Skip to content

FROGS is a galaxy/CLI workflow designed to produce an OTUs count matrix from high depth sequencing amplicon data.

License

Notifications You must be signed in to change notification settings

vindarbot/FROGS

 
 

Repository files navigation

Visit our web site : http://frogs.toulouse.inrae.fr/

ReleaseDate

Description

FROGS is a CLI workflow designed to produce an OTU count matrix from high depth sequencing amplicon data.

FROGS-wrappers allow to add FROGS on a Galaxy instance. (see https://github.com/geraldinepascal/FROGS-wrappers)

This workflow is focused on:

  • User-friendliness with lots of rich graphic outputs and the integration in Galaxy thanks to FROGS-wrappers.
  • Accuracy with a clustering without global similarity threshold, the management of separated PCRs in the chimera removal step, and the management of multi-affiliations.
  • Dealing of non overlapping pair of sequences from long amplicon like ITS, or RPB2.
  • Speed with fast algorithms parallelisation and easy to use.
  • Scalability with algorithms designed to support the data growth.

Table of content

Convenient input data

Legend for the next schemas:

.: Complete nucleic sequence
!: Region of interest
*: PCR primers
  • Paired-end classical protocol: In the paired-end protocol R1 and R2 may share a nucleic region. For example the amplicons on 16S V3-V4 regions can have a length between 350 and 500nt, with 2*300pb sequencing the overlap is between 250nt and 100nt.
        From:                                    To:
         rDNA .........!!!!!!................    ......!!!!!!!!!!!!!!!!!!!.....
         Ampl      ****!!!!!!****                  ****!!!!!!!!!!!!!!!!!!!****
           R1      --------------                  --------------
           R2      --------------                               --------------

In any case, the maximum overlap between R1 and R2 can be the complete overlap.

The minimum authorized overlap between R1 and R2 is 10nt. With less, the overlap can be incorrect, it will be rejected or considered as non overlap reads.

  • Single-end classical protocol:
        rDNA .........!!!!!!................
        Ampl      ****!!!!!!****
        Read      --------------

  • Custom protocol
        rDNA .....!!!!!!!!!!!!!!............
        Ampl      ****!!!!!!****
        Read      --------------       

The amplicons can have a high length variability such as ITS. The R1 and R2 can have different length.

Installation

This FROGS repository is for command line user. If you want to install FROGS on Galaxy, please refer to FROGS-wrappers.

Tools dependancies

FROGS is written in Python 3.7 (with external numpy and Scipy libraries) , uses also home-made scripts written in PERL5 and R 3.6.

FROGS relies on different specific tools for each of the analysis steps.

FROGS Tools Dependancy version tested
Preprocess and Remove_chimera vsearch 2.17.0
Preprocess flash (optional) 1.2.11
Preprocess cutadapt (need to be >=2.8) 3.1
Clustering swarm (need to be >=2.1) 3.0.0
ITSx ITSx 1.1.2
Affiliation_OTU NCBI BLAST+ 2.10.1
Affiliation_OTU RDP Classifier 2.0.3
Affiliation_OTU EMBOSS needleall 6.6.0
Tree MAFFT 7.475
Tree Fasttree 2.1.10
Tree / FROGSSTAT plotly, phangorn, rmarkdown, phyloseq, DESeq2, optparse, calibrate, formattable, DT R 3.6.3
FROGSSTAT pandoc 2.11.3

Use PEAR as read pairs merging software in preprocess

PEAR is one of the most effective software for read pairs merging, but as its license is not free for private use, we can not distribute it in FROGS. If you work in an academic lab on a private Galaxy server, or if you have paid your license you can use PEAR in FROGS preprocess. For that you need to:

  • have PEAR in your PATH or in the FROGS libexec directory. We have tested PEAR 0.9.10 version (last version 0.9.11).
  • use --merge-software pear option in the preprocess.py command line

FROGS and dependancies installation

From conda

FROGS is now available on bioconda (https://anaconda.org/bioconda/frogs).

  • to create a specific environment for a specific FROGS version
conda env create --name [email protected] --file frogs-conda-requirements.yaml
# to use FROGS, first you need to activate your environment
conda activate [email protected]

From source

see INSTALL_from_source.md

Check intallation

To check your installation you can type:

cd <FROGS_PATH>/test
# when using conda FROGS_PATH=<conda_env_dir>/[email protected]/share/FROGS_3.2.3

sh test.sh <FROGS_PATH> <NB_CPU> <JAVA_MEM> <OUT_FOLDER>

"Bioinformatic" tools are performed on a small simulated dataset of one sample replicated three times. "Statistical" tools are performed on an extract of the published results of Chaillou et al, ISME 2014

This test executes the FROGS tools in command line mode. Example:

[user@computer:/home/frogs/FROGS/test/]$ sh test.sh ../ 1 2 res
Step preprocess : Flash mardi 10 novembre 2020, 10:56:56 (UTC+0100)
Step preprocess : Vsearch mardi 10 novembre 2020, 10:59:57 (UTC+0100)
Step clustering mardi 10 novembre 2020, 11:02:51 (UTC+0100)
Step remove_chimera mardi 10 novembre 2020, 11:08:31 (UTC+0100)
Step otu filters mardi 10 novembre 2020, 11:13:43 (UTC+0100)
Step ITSx mardi 10 novembre 2020, 11:14:00 (UTC+0100)
Step affiliation_OTU mardi 10 novembre 2020, 11:14:01 (UTC+0100)
Step affiliation_filter: masking mode mardi 10 novembre 2020, 11:14:53 (UTC+0100)
Step affiliation_filter: deleted mode mardi 10 novembre 2020, 11:14:54 (UTC+0100)
Step affiliation_postprocess mardi 10 novembre 2020, 11:14:54 (UTC+0100)
Step normalisation mardi 10 novembre 2020, 11:14:55 (UTC+0100)
Step clusters_stat mardi 10 novembre 2020, 11:14:55 (UTC+0100)
Step affiliations_stat mardi 10 novembre 2020, 11:14:58 (UTC+0100)
Step biom_to_tsv mardi 10 novembre 2020, 11:15:05 (UTC+0100)
Step biom_to_stdBiom mardi 10 novembre 2020, 11:15:06 (UTC+0100)
Step tsv_to_biom mardi 10 novembre 2020, 11:15:06 (UTC+0100)
Step tree mardi 10 novembre 2020, 11:15:06 (UTC+0100)
Step phyloseq_import_data mardi 10 novembre 2020, 11:16:36 (UTC+0100)
Step phyloseq_composition mardi 10 novembre 2020, 11:18:00 (UTC+0100)
Step phyloseq_alpha_diversity mardi 10 novembre 2020, 11:19:31 (UTC+0100)
Step phyloseq_beta_diversity mardi 10 novembre 2020, 11:20:19 (UTC+0100)
Step phyloseq_structure mardi 10 novembre 2020, 11:20:45 (UTC+0100)
Step phyloseq_clustering mardi 10 novembre 2020, 11:21:59 (UTC+0100)
Step phyloseq_manova mardi 10 novembre 2020, 11:22:20 (UTC+0100)
Step deseq2_preprocess mardi 10 novembre 2020, 11:22:42 (UTC+0100)
Step deseq2_visualisation mardi 10 novembre 2020, 11:23:29 (UTC+0100)
Completed with success

Memory and parallelisation advices

If you have more than one CPU, it is recommended to increase the number of CPUs used by tools. All the CPUs must be on the same computer/node.

Tool RAM per CPU Minimal RAM Configuration example
Preprocess 8Gb - 12 CPUs and 96 GB
Clustering - 10 Gb 16 CPUs and 60 GB
ITSx / Remove_Chimera 3Gb 5Gb 12 CPUs and 36 GB
Affiliation_OTU - 20 Gb 30 CPUs and 300 GB

Download databanks

Reference database are needed to filter contaminants, assign taxonomy to each OTU or filter ambiguities for hyper variable amplicon length.

We propose some databanks, that you simply need to download and extract.

Please take time to read individual README.txt and LICENCE.txt files.

Troubleshooting

Abnormal increase memory consumption with CPU number

With some old versions of glibc the virtual memory used by CPU is multiplicative.

Nb CPUs expected RAM consumtion observed RAM consumption
1 1 Gb 1Gb
2 2 Gb 2*2 Gb
3 3 Gb 3*3 Gb
4 4 Gb 4*4 Gb

The parameters memory and CPU provided in examples take into account this problem.

License

GNU GPL v3

Copyright

2020 INRAE

Citation

Depending on which type of amplicon you are working on (mergeable or unmergeable), please cite one of the two FROGS publications:

Contact

[email protected]

About

FROGS is a galaxy/CLI workflow designed to produce an OTUs count matrix from high depth sequencing amplicon data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 48.7%
  • HTML 48.0%
  • Shell 1.8%
  • Perl 1.3%
  • R 0.2%