-
Notifications
You must be signed in to change notification settings - Fork 86
StrainPhlAn 4
StrainPhlAn is a computational tool for tracking individual strains across a large set of samples. The input of StrainPhlAn is a set of metagenomic samples and for each species, the output is a multiple sequence alignment (MSA) file of all species strains reconstructed directly from the samples. From this MSA, StrainPhlAn calls (PhyloPhlAn)[http://segatalab.cibio.unitn.it/tools/phylophlan3/index.html] to build the phylogenetic tree showing the strain evolution of the sample strains.
For each sample, StrainPhlAn is able to identify strains of a species by merging and concatenating all reads mapped against species-specific MetaPhlAn markers.
In detail, let us start from a toy example with 6 HMP gut metagenomic samples (SRS055982-subjectID_638754422, SRS022137-subjectID_638754422, SRS019161-subjectID_763496533, SRS013951-subjectID_763496533, SRS014613-subjectID_763840445, SRS064276-subjectID_763840445) from 3 three subjects (each was sampled at two time points) and one Bacteroides caccae genome G000273725. We would like to:
- extract the Bacteroides caccae (SGB1877) strains from these samples and compare them with the reference genome in a phylogenetic tree.
- know how many SNPs between those strains and the reference genome.
By running StrainPhlAn on these samples, we will obtain the Bacteroides caccae (SGB1877) phylogenetic tree and its multiple sequence alignment in the following figure (produced with ete2 and Jalview):
We can see that the strains from the same subject are grouped together. The tree also highlights that the strains from each subject did not evolve between the two sampling time points. From the tree, we also know that the strains of subject "763496533" are closer to the reference genome than those of the others.
In addition, the table below shows the number of SNPs between the sample strains and the reference genome based on the strain alignment returned by StrainPhlAn.
In the next sections, we will illustrate step by step how to run StrainPhlAn on this toy example to reproduce the above figures.
StrainPhlAn is included in MetaPhlAn. You can get more info about MetaPhlAn installation here.
Let's reproduce the toy example result in the introduction section. The steps are as follows:
Step 1. Download the 6 HMP gut metagenomic samples, the metadata.txt file and one reference genome from the folder "fastqs" and "reference_genomes" in this link and put these folders under the "strainer_tutorial" folder.
Step 2. Obtain the sam files from these samples by mapping them against MetaPhlAn database:
This step will run MetaPhlAn: the metagenomic samples will be mapped against the MetaPhlAn marker database and then SAM files (*.sam.bz2) will produced. Each SAM file (in SAM format) contains the reads mapped against the marker database of MetaPhlAn. The commands to run are:
mkdir -p sams
mkdir -p bowtie2
mkdir -p profiles
for f in fastq/SRS*
do
echo "Running MetaPhlAn on ${f}"
bn=$(basename ${f})
metaphlan ${f} --input_type fastq -s sams/${bn}.sam.bz2 --bowtie2out bowtie2/${bn}.bowtie2.bz2 -o profiles/${bn}_profiled.tsv
done
After this step, you will have a folder "sams" containing the SAM files (*.sam.bz2) and other MetaPhlAn output files in the "bowtie2" and "profiles" folders. If you want to skip this step, you can download the SAM files from the folder "sams" in this link.
Step 3. Produce the consensus-marker files which are the input for StrainPhlAn:
For each sample, this step will reconstruct all species strains found in it and store them in a pickle file (*.pkl). Those strains are referred as sample-reconstructed strains. The commands to run are:
mkdir -p consensus_markers
sample2markers.py -i sams/*.sam.bz2 -o consensus_markers -n 8
The result is the same if you want to run several sample2markers.py
scripts in parallel with each run for a sample (this may be useful for some cluster-system settings).
After this step, you will have a folder consensus_markers
containing all sample-marker files (*.pkl).
If you want to skip this step, you can download the consensus marker files from the folder "consensus_markers" in this link.
Step 4. Extract the markers of Bacteroides_caccae (SGB1877) from MetaPhlAn database (to add its reference genome later):
This step will extract the markers of Bacteroides_caccae (SGB1877) in the database and then StrainPhlAn will identify the sequences in the reference genomes that are closet to them (in the next step by using blast). Those will be concatenated and referred as reference-genome-reconstructed strains. The commands to run are:
mkdir -p db_markers
extract_markers.py -c t__SGB1877 -o db_markers/
After this step, you should have one file in folder "db_markers": "t__SGB1877.fna" containing the markers of Bacteroides caccae (SGB1877). Those markers can be found in the folder "db_markers" in this link.
Step 5. Build the multiple sequence alignment and the phylogenetic tree:
This step will filter the selected clade markers based on their presence in the sample-reconstructed strains (stored in the marker files produced in step 3) and reference-genomes (if specified). Also, the sample-reconstructed strains and reference-genomes will be filtered based on the presence of the selected clade markers. From this filtered markers and samples, StrainPhlAn will call PhyloPhlAn to produce a multiple sequence alignment (MSA) to then build the phylogenetic tree.
The commands to run are:
mkdir -p output
strainphlan -s consensus_markers/*.pkl -m db_markers/t__SGB1877.fna -r reference_genomes/G000273725.fna.bz2 -o output -n 8 -c t__SGB1877 --mutation_rates
After this step, you will find the tree "output/RAxML_bestTree.t__SGB1877.StrainPhlAn4.tre". All the output files can be found in the folder "output" in this link. You can view it by Archaeopteryx or any other viewers.
In order to add the metadata, we provide a script called add_metadata_tree.py which can be used as follows:
add_metadata_tree.py -t output/RAxML_bestTree.t__SGB1877.StrainPhlAn4.tre -f fastq/metadata.txt -m subjectID --string_to_remove .fastq.bz2
The script "add_metadata_tree.py" can accept multiple metadata files (space-separated, wild card can also be used) and multiple trees. A metadata file is a tab-separated file where the first row is the meta-headers, and the following rows contain the metadata for each sample. Multiple metadata files are used in the case where your samples come from more than one dataset and you do not want to merge the metadata files.
For more details of using "add_metadata_tree.py", please see its help (with option "-h"). An example of a metadata file is the "fastqs/metadata.txt" file with the below content:
sampleID subjectID
SRS055982 638754422
SRS022137 638754422
SRS019161 763496533
SRS013951 763496533
SRS014613 763840445
SRS064276 763840445
G000273725 ReferenceGenomes
Note that "sampleID" is a compulsory field.
After adding the metadata, you will obtain the tree files "*.tre.metadata" with metadata and view them by Archaeopteryx as in the previous step.
Now, you can plot the tree with the plot_tree_graphlan.py script:
plot_tree_graphlan.py -t output/RAxML_bestTree.t__SGB1877.StrainPhlAn4.tre.metadata -m subjectID
and obtain the following figure (output/RAxML_bestTree.t__SGB1877.StrainPhlAn4.tre.metadata.png):
usage: strainphlan [-h] [-d DATABASE] [-m CLADE_MARKERS]
[-s SAMPLES [SAMPLES ...]] [-r REFERENCES [REFERENCES ...]]
[-c CLADE] [-o OUTPUT_DIR] [-n NPROCS]
[--secondary_samples SECONDARY_SAMPLES [SECONDARY_SAMPLES ...]]
[--secondary_references SECONDARY_REFERENCES [SECONDARY_REFERENCES ...]]
[--trim_sequences TRIM_SEQUENCES]
[--marker_in_n_samples MARKER_IN_N_SAMPLES]
[--sample_with_n_markers SAMPLE_WITH_N_MARKERS]
[--secondary_sample_with_n_markers SECONDARY_SAMPLE_WITH_N_MARKERS]
[--sample_with_n_markers_after_filt SAMPLE_WITH_N_MARKERS_AFTER_FILT]
[--phylophlan_mode {accurate,fast}]
[--phylophlan_configuration PHYLOPHLAN_CONFIGURATION]
[--tmp TMP] [--mutation_rates] [--print_clades_only]
[--debug]
optional arguments:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
The input MetaPhlAn 4.0.0 database (default:
/shares/CIBIO-Storage/CM/cmstore/tools/anaconda3/envs/
test_mpa4/lib/python3.7/site-packages/metaphlan/metaph
lan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.pkl)
-m CLADE_MARKERS, --clade_markers CLADE_MARKERS
The clade markers as FASTA file (default: None)
-s SAMPLES [SAMPLES ...], --samples SAMPLES [SAMPLES ...]
The reconstructed markers for each sample (default:
[])
-r REFERENCES [REFERENCES ...], --references REFERENCES [REFERENCES ...]
The reference genomes (default: [])
-c CLADE, --clade CLADE
The clade to investigate (default: None)
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
The output directory (default: None)
-n NPROCS, --nprocs NPROCS
The number of threads to use (default: 1)
--secondary_samples SECONDARY_SAMPLES [SECONDARY_SAMPLES ...]
The reconstructed markers for each secondary sample
(default: [])
--secondary_references SECONDARY_REFERENCES [SECONDARY_REFERENCES ...]
The secondary reference genomes (default: [])
--trim_sequences TRIM_SEQUENCES
The number of bases to remove from both ends when
trimming markers (default: 50)
--marker_in_n_samples MARKER_IN_N_SAMPLES
Theshold defining the minimum percentage of samples to
keep a marker (default: 80)
--sample_with_n_markers SAMPLE_WITH_N_MARKERS
Threshold defining the minimun percentage of markers
to keep a sample (default: 80)
--secondary_sample_with_n_markers SECONDARY_SAMPLE_WITH_N_MARKERS
Threshold defining the minimun percentage of markers
to keep a secondary sample (default: 80)
--sample_with_n_markers_after_filt SAMPLE_WITH_N_MARKERS_AFTER_FILT
Threshold defining the minimun percentage of markers
to keep a sample after filtering the markers [only for
dev] (default: 50)
--phylophlan_mode {accurate,fast}
The presets for fast or accurate phylogenetic analysis
(default: fast)
--phylophlan_configuration PHYLOPHLAN_CONFIGURATION
The PhyloPhlAn configuration file (default: None)
--tmp TMP If specified, the directory where to store the
temporal files. (default: None)
--mutation_rates If specified, StrainPhlAn will produce a mutation
rates table for each of the aligned markers and a
summary table for the concatenated MSA. This operation
can take long time to finish (default: False)
--print_clades_only If specified, StrainPhlAn will only print the
potential clades and stop the execution (default:
False)
--debug If specified, StrainPhlAn will not remove the temporal
folders (default: False)
In the output folder, you can find the following files:
- clade_name.info: this file shows the general information like the total length of the concatenated markers (full sequence length), number of used markers, etc.
- clade_name.polymorphic: this file shows the statistics on the polymorphic site, where "sample" is the sample name, "percentage_of_polymorphic_sites" is the percentage of sites that are suspected to be polymorphic, "avg_freq" is the average frequency of the dominant alleles on all polymorphic sites, "avg_coverage" is the average coverage at all polymorphic sites.
- clade_name.StrainPhlAn4_concatenated.aln: the alignment file of all metagenomic strains.
- clade_name.mutation: this file contains a mutation rates summary table for the concatenated MSA. This file will be generated if --mutation_rates param is specified.
- clade_name_mutation_rates: this folder contains the mutation rates table for each of the aligned markers. This folder will be generated if --mutation_rates param is specified.