merge branch 'dev'

geraldinepascal · Jun 9, 2017 · 4278388 · 4278388
2 parents 2e53b35 + 08c2dc8
commit 4278388
Show file tree

Hide file tree

Showing 25 changed files with 2,056 additions and 93 deletions.
diff --git a/README.md b/README.md
@@ -148,6 +148,24 @@
                  OR
                  sudo yum install util-linux
 
+    Pynast
+        Version: >= 1.2.2
+        Named as: pynast
+        Tools: tree
+        Download: https://pypi.python.org/pypi/pynast
+
+    Mafft
+        Version: >= v7.222
+        Named as: mafft
+        Tools: tree
+        Download: http://mafft.cbrc.jp/alignment/software/
+
+    Fasttree
+        Version: >= 2.1.9
+        Named as: FastTree
+        Tools: tree
+        Download: http://www.microbesonline.org/fasttree/#Install
+
 ### 4. Check intallation
     To check your installation you can type:
         cd <FROGS_PATH>/test
@@ -287,11 +305,13 @@
         - Extract databanks.
         - To use these databank, you need to create a .loc file named
           'frogs_db.loc'. The path provided must be the '.fasta'.
+          (see the frogs_db.loc example file)
     b] Contaminant databank
         - Upload databank and indexes from http://genoweb.toulouse.inra.fr/frogs_databanks/contaminants
         - Extract databank.
         - To use this databank, you need to create a .loc file named
           'phiX_db.loc'. The path provided must be the '.fasta'.
+          (see the phiX_db.loc example file)
 
 ### 9. Tools images
     The tools help contain images. These images must be in galaxy images

diff --git a/RELEASES_NOTES.md b/RELEASES_NOTES.md
@@ -1,10 +1,22 @@
+# v2.0.0  [DEV]
+### Tools added : 
+  * Tree : perform phylogenetic tree reconstruction based on Pynast or Mafft and Fasttree
+
+### Bugs fixes:
+  * Preprocess : min overlap at least equal to 1
+
+### Functions added:
+  * Preprocess: add Flash mismatch rate option
+
 # v1.4.0  [2017-02-04]
 ### Bugs fixes:
   * Preprocess: error in final dereplication with hudge number of samples.
   * Remove_chimera: error when using library Queue and hudge number of samples.
   * Clusters_stat: error with empty samples in hierarchical clustering.
   * Filters: error when only the filter on contamination is used.
   * Filters: bug when using other filters than abundance (check parameter when None).
+  * Tsv2Biom : bug fix when using a tsv file comming from a standard biom file
+  * Affiliations_stat : bug in rarefaction step computation when sample are empty
 
 ### Functions added:
   * Preprocess: new amplicon length graph.

diff --git a/app/tree.py b/app/tree.py
@@ -0,0 +1 @@
+../tools/tree/tree.py
diff --git a/app/tree.xml b/app/tree.xml
@@ -0,0 +1 @@
+../tools/tree/tree.xml
diff --git a/app/tree_tpl.html b/app/tree_tpl.html
@@ -0,0 +1 @@
+../tools/tree/tree_tpl.html
diff --git a/frogs_db.loc b/frogs_db.loc
@@ -0,0 +1,48 @@
+# Copyright (C) 2014 INRA
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+#
+#This is a sample file that enables tools FROGS_affiliations_OTU to use taxonomy database for
+#taxonomy affiliation. You will need to create or download Blast+ index and train your database
+#for RDP classifier these data files.
+#download link : http://genoweb.toulouse.inra.fr/frogs_databanks/assignation
+#Finally you will need to create frogs_db.loc file similar to this one in your galaxy
+#tool-data directory.The frogs_db.loc file has this format (longer white space characters are
+#TAB characters):
+#
+#<unique_database_name>   <file_path>
+#
+#First column will be the visible name in galaxy.
+#So, for example, if you had 16S silva 128 indexed stored in
+#/galaxy_databanks/16S/silva_128/ 
+#then the frogs_db.loc entry would look like this:
+#
+#silva 128 16S  /galaxy_databanks/16S/silva_128/silva_128_16S.fasta
+#
+#and your /galaxy_databanks/16S/silva_128/ directory
+#would contain index files:
+#
+#-rw-r--r-- 1 mbernard FROGS    8097966  5 déc.  16:56 bergeyTrainingTree.xml
+#-rw-r--r-- 1 mbernard FROGS 1572981589  5 déc.  16:56 genus_wordConditionalProbList.txt
+#-rw-r--r-- 1 mbernard FROGS       1654  5 déc.  16:56 LICENCE.txt
+#-rw-r--r-- 1 mbernard FROGS    1072228  5 déc.  16:56 logWordPrior.txt
+#-rw-r--r-- 1 mbernard FROGS  940834335  5 déc.  16:56 silva_128_16S.fasta
+#-rw-r--r-- 1 mbernard FROGS  152606489  5 déc.  16:56 silva_128_16S.fasta.nhr
+#-rw-r--r-- 1 mbernard FROGS    6918588  5 déc.  16:56 silva_128_16S.fasta.nin
+#-rw-r--r-- 1 mbernard FROGS  205320030  5 déc.  16:56 silva_128_16S.fasta.nsq
+#-rw-r--r-- 1 mbernard FROGS        281  5 déc.  16:56 silva_128_16S.fasta.properties
+#-rw-r--r-- 1 mbernard FROGS    3420464  5 déc.  16:56 silva_128_16S.tax
+#-rw-r--r-- 1 mbernard FROGS     964048  5 déc.  16:57 wordConditionalProbIndexArr.txt
+#
diff --git a/img/frogs_tree_otufile.png b/img/frogs_tree_otufile.png
diff --git a/img/frogs_tree_summary.png b/img/frogs_tree_summary.png
diff --git a/img/frogs_tree_templatefile.png b/img/frogs_tree_templatefile.png
diff --git a/img/frogs_tree_treefile.png b/img/frogs_tree_treefile.png
diff --git a/img/frogs_tree_view_phyloviz.png b/img/frogs_tree_view_phyloviz.png
diff --git a/lib/frogsNode.py b/lib/frogsNode.py
@@ -18,7 +18,7 @@
 __author__ = 'Frederic Escudie - Plateforme bioinformatique Toulouse'
 __copyright__ = 'Copyright (C) 2015 INRA'
 __license__ = 'GNU General Public License'
-__version__ = '0.2.0'
+__version__ = '0.2.1'
 __email__ = '[email protected]'
 __status__ = 'dev'
 
@@ -107,7 +107,7 @@ def get_ancestors(self):
             ancestors.extend( [self.parent] )
         return ancestors
 
-    def get_descendants_by_depth(self, depth=1):
+    def get_descendants(self, depth=1):
         """
         @summary: Returns the node descendants with the provided depth from the node. Example: depth=1 returns all the children of the node ; depth=2 returns all the grandchildren of the node.
         @param: [int] The selected depth.
@@ -191,4 +191,4 @@ def to_extended_newick(self):
             if len(self.metadata.keys()) != 0:
                 return '(' + ','.join(children_newick) + ')"' + self.name + '":' + json.dumps(self.metadata)
             else:
-                return '(' + ','.join(children_newick) + ')"' + self.name + '"'
+                return '(' + ','.join(children_newick) + ')"' + self.name + '"'
diff --git a/phiX_db.loc b/phiX_db.loc
@@ -0,0 +1,39 @@
+# Copyright (C) 2014 INRA
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+#
+#This is a sample file that enables tools FROGS_filters to identify phix contaminant. You will #need to create or download Blast+ index.
+#download link : http://genoweb.toulouse.inra.fr/frogs_databanks/contaminants
+#Finally you will need to create phiX_db.loc file similar to this one in your galaxy
+#tool-data directory.The phiX_db.loc file has this format (longer white space characters are
+#TAB characters):
+#
+#<contaminant_ name>   <file_path>
+#
+#First column will be the visible name in galaxy.
+#So, for example, if you had phix indexed stored in
+#/galaxy_databanks/phiX/ 
+#then the phiX_db.loc entry would look like this:
+#
+#phiX    /galaxy_databanks/phiX/phi.fa
+#
+#and your /galaxy_databanks/phiX/ directory
+#would contain index files:
+#
+#-rwxrwxr-x 1 gpascal FROGS 5535 16 sept.  2015 phi.fa
+#-rw-rwxr-- 1 gpascal FROGS  132 16 sept.  2015 phi.fa.nhr
+#-rw-rwxr-- 1 gpascal FROGS   88 16 sept.  2015 phi.fa.nin
+#-rw-rwxr-- 1 gpascal FROGS 1348 16 sept.  2015 phi.fa.nsq
+#
diff --git a/tools/demultiplex/demultiplex.xml b/tools/demultiplex/demultiplex.xml
@@ -15,8 +15,8 @@
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 -->
-<tool id="FROGS_demultiplex" name="FROGS Demultiplex reads" version="1.1.0">
-	<description>Split by samples the reads in function of inner barcode.</description>
+<tool id="FROGS_demultiplex" name="FROGS Demultiplex reads" version="2.0.0">
+	<description>Attribute reads to samples in function of inner barcode.</description>
 	<command interpreter="python2.7">
 		demultiplex.py
 		#if str( $fastq_input.fastq_input_selector ) == "paired":
@@ -37,7 +37,7 @@
 		<param format="tabular" name="barcode_file" type="data" label="Barcode file" help="This file describes barcodes and samples (one line by sample tabulated separated from barcode sequence(s)). See Help section" optional="false" />
 
 		<conditional name="fastq_input">
-	      	<param name="fastq_input_selector" type="select" label="Single or Paired-end reads" help="Select between paired and single end data">
+	      	<param name="fastq_input_selector" type="select" label="Single or Paired-end reads" help="Select between paired and single-end data">
 	      		<option value="single">Single</option>
 	        	<option value="paired">Paired</option>
 	        </param>
@@ -51,8 +51,8 @@
       	</conditional>
 
 		<!-- Option -->
-		<param name="mismatches" type="integer" label="barcode mismatches" help="Number of mismatches allowed in barcode" value="0" optional="false" />
-		<param name="end" type="select" label="barcode on which end ?" help="The barcode is at the begining of the forward end or of the reverse end or both?">
+		<param name="mismatches" type="integer" label="Barcode mismatches" help="Number of mismatches allowed in barcode" value="0" optional="false" />
+		<param name="end" type="select" label="Barcode on which end ?" help="The barcode is placed either at the beginning of the forward end or of the reverse end or both?">
 			<option value="bol" selected="true">Forward</option>
 			<option value="eol">Reverse</option>
 			<option value="both">Both ends</option>
@@ -69,7 +69,7 @@
 
 What it does
 
-Classify single or paired end reads in function of barcode forward or reverse in the first or both reads.
+This tool classifies single or paired-end reads in function of barcode forward or reverse in the first or both reads.
 
 **Command line**::
 
@@ -80,9 +80,9 @@ Classify single or paired end reads in function of barcode forward or reverse in
    :widths: 20, 80
    :class: table table-striped
 
-   "FQ_INPUT1", "Fastq input file for the first read (single end or forward read of pair end sequences)"
-   "FQ_INPUT2", "Fastq input file for the second read (only for pair end sequences)"
-   "TXT_BARCODE", "Tabulated text file that describe barcode sequence used to multiplexe sample in your run: SAMPLE_NAME	BARCODE1	[BARCODE2]"
+   "FQ_INPUT1", "Fastq input file for the first read (single-end or forward read of paired-end sequences)"
+   "FQ_INPUT2", "Fastq input file for the second read (only for paired-end sequences)"
+   "TXT_BARCODE", "Tabulated text file that describes barcode sequences used to multiplexe samples: SAMPLE_NAME	BARCODE1	[BARCODE2]"
 
 .. csv-table:: Options
    :header: "Option name", "Meaning"
@@ -97,47 +97,51 @@ Classify single or paired end reads in function of barcode forward or reverse in
    :widths: 20, 80
    :class: table table-striped
 
-   "TXT_SUMMARY_OUTPUT", "A tabulated text file which summarise the number of sequence (pair) for each sample"
+   "TXT_SUMMARY_OUTPUT", "A tabulated text file which summarises the number of sequences (single or paired) for each sample"
    "TARGZ_DEMULT_ARCHIVE_OUTPUT", "A TAR.GZ archive that contains all fastq files for each sample"
-   "TARGZ_UNDEMULT_ARCHIVE_OUTPUT", "A TAR.GZ archive that contains all fastq files for undemultiplexed read"
+   "TARGZ_UNDEMULT_ARCHIVE_OUTPUT", "A TAR.GZ archive that contains all fastq files for undemultiplexed reads"
 
 .. class:: h3
 
 Format
 
-BARCODE_FILE : This file is expected to be tabulated:
+BARCODE_FILE :
+ This file is expected to be tabulated
+
  -first column corresponds to the sample name
 
- -second one to the sequence barcode used
+ -second column corresponds to the sequence barcode used
 
- -optional third one to the reverse sequence barcode.
+ -third column (optional) corresponds to the reverse sequence barcode
 
 .. class:: warningmark
 
 Take care to indicate sequence barcode in the strand of the read, so you may need to reverse complement the reverse barcode sequence
 
 .. class:: warningmark
 
-Barcode sequence must have the same length
+All barcode sequences must have the same length
 
-example of barcode file. The last column is optional, like this it describes sample multiplexed by both fragment ends.
+Example of barcode file: Here the sample is multiplexed by both fragment ends.
 
 .. image:: ${static_path}/images/tools/frogs/demultiplex_barcode.png
    :height: 18
    :width: 286
 
-FASTQ : Text file describing biological sequence in 4 lines format: 
- -first line start by "@" corresponds to the sequence identifier and optionally the sequence description.
+FASTQ : 
+ Text file describing biological sequences in a 4 line format: 
+
+ -first line starts by "@" corresponds to the sequence identifier and optionally the sequence description
 
  -second line is the sequence itself
 
  -third line is a "+" following by the sequence identifier or not depending on the version
 
- -fourth line is the quality sequence, one code per base. The code depends on the version and the sequencer
+ -fourth line is the quality sequence, one code per base. The code depends on its version and the sequencer
 
-`detailed fastq format article &lt;https://en.wikipedia.org/wiki/FASTQ_format&gt;`_
+`Click here for more details on the fastq format &lt;https://en.wikipedia.org/wiki/FASTQ_format&gt;`_
 
-example of fastq read corresponding to the previous barcode file  
+Example of fastq read corresponding to the previous barcode file  
 
 .. image:: ${static_path}/images/tools/frogs/demultiplex_fastq_ex.png
    :height: 57
@@ -148,24 +152,24 @@ example of fastq read corresponding to the previous barcode file
 
 How it works
 
-For each sequence or sequence pair the sequence fragment at the beginning (forward multiplexing) of the (first) read or at the end (reverse multiplexing) of the (second) read will be compare to all barcode sequence.
+For each sequence or sequence pair, the sequence fragment at the beginning (forward multiplexing) of the (first) read or at the end (reverse multiplexing) of the (second) read will be compared to all barcodes of the barecode file.
 
-If this fragment is equal (with less or equal mismatch than the threshold) to one (and only one) barcode, the fragment is trimmed and the sequence will be attributed to the corresponding sample.
+If this fragment is found once and only once (regarding the mismatch threshold), the fragment is trimmed and the sequence will be attributed to the corresponding sample.
 
-Finally fastq files (or pair of fastq files) for each sample are included in an archive, and a report describes how many sequences are attributed for each sample. 
+Finally fastq files (or pair of fastq files) for each sample are included in an archive and a report, describing how many sequences are attributed for each sample, is created.
 
 
 .. class:: infomark page-header h2
 
 Advices
 
-Do not forget to indicate barcode sequence as they actually are in the fastq sequence file, especially if you have data multiplexed via the reverse strand.
+Do not forget to indicate barcode sequence as they really are in the fastq sequence file, especially if you have multiplexed data via the reverse strand.
 
-For the mismatch threshold, we advised you to let the threshold to 0, and if you are not satisfied by the result try with 1. The number of mismatch depends on the length of the barcode, but oftenly those sequences are very short so 1 mismatch is already more than the sequencing error rate.
+For the mismatch threshold, we advised to let the threshold to 0. Then if you are not satisfied by the result try with 1. The number of mismatches depends on the length of the barcode, but frequently this sequences are very short so 1 mismatch is already more than the sequencing error rate.
 
-If you have different barcode length, you must demultiplex your data in different times beginning by the longest barcode set and used the "unmatched" or "ambiguous" sequence with smaller barcode and so on.
+If you have different barcode lengths, you must demultiplex your data in several steps,  beginning by the longest barcode set. Then to trim the barcodes with smaller lengths, you use the "unmatched" or "ambiguous" sequence file with smaller barcodes and so on.
 
-If you have Roche 454 sequences in sff format, you must convert it with some program like `sff2fastq &lt;https://github.com/indraniel/sff2fastq&gt;`_ or sff_to_fastq (installable in Galaxy)
+If you have Roche 454 sequences in sff format, you must convert them with some programs like `sff2fastq &lt;https://github.com/indraniel/sff2fastq&gt;`_ or sff_to_fastq (installable in Galaxy)
 
 
 ----
@@ -176,7 +180,7 @@ Contacts: [email protected]
 
 Repository: https://github.com/geraldinepascal/FROGS
 
-Please cite the FROGS Publication: *Escudie F., Auer L., Bernard M., Cauquil L., Vidal K., Maman S., Mariadassou M., Hernadez-Raquet G., Pascal G., 2015. FROGS: Find Rapidly OTU with Galaxy Solution. In: The environmental genomic Conference, Montpellier, France,* http://bioinfo.genotoul.fr/fileadmin/user_upload/FROGS_2015_GE_Montpellier_poster.pdf
+Please cite the FROGS Publication: *Escudie F., Auer L., Bernard M., Cauquil L., Vidal K., Maman S., Mariadassou M., Combes S., Hernandez-Raquet G., Pascal G., 2016. FROGS: Find Rapidly OTU with Galaxy Solution. In: ISME-2016 Montreal, CANADA ,* http://bioinfo.genotoul.fr/wp-content/uploads/FROGS_ISME2016_poster.pdf
 
 Depending on the help provided you can cite us in acknowledgements, references or both.
 	</help>