Be sure to launch your VM with -X option, so we can visualize assembly graphs with Bandage.
ssh -X -A [email protected]
If this is not the case yet remember to activate the correct conda env:
conda activate LongReads
We are going to work in the following directory:
mkdir -p ~/data/mydatalocal/HiFi
cd ~/data/mydatalocal/HiFi
There are 2 sample, available at:
~/data/public/teachdata/ebame-2022/metagenomics/HIFI_datasets/samples/
- HumanReal: Sample from a pool of vegans and omnivore
- Zymo: mock community
Choose either for the assembly section.
As previously we can use the command seqkit stats to assess this sample.
Solution
seqkit stats ~/data/public/teachdata/ebame-2022/metagenomics/HIFI_datasets/samples/Zymo_sample1e5.fastq.gz
Mostly 3 software have been developed for assembling HiFi metagenomic datasets.
In this tutorial, we are going to run hifiasm-meta and metaMDBG separatly, then use a method to merge their results.
As usual, try to craft your own command line to run the software.
Let's run hifiasm-meta first.
Use the following commands to see usage information for hifiasm-meta:
hifiasm_meta -h
Solution
hifiasm_meta -o ~/data/mydatalocal/HiFi/hifiasm-meta_asm ~/data/public/teachdata/ebame-2022/metagenomics/HIFI_datasets/samples/Zymo_sample1e5.fastq.gz -t 4
Assembly takes a lot of time, so instead lets comment on the pre-run version.
#Copy hifiasm human assembly in your local folder
ln -s ~/data/public/teachdata/ebame-2022/metagenomics/HIFI_datasets/HumanReal_asm ~/data/mydatalocal/HiFi/hifiasm-meta_human
#Copy hifiasm zymo assembly in your local folder
ln -s ~/data/public/teachdata/ebame-2022/metagenomics/HIFI_datasets/Zymo_asm/ ~/data/mydatalocal/HiFi/hifiasm-meta_zymo
#Print hifiasm output files
ls -lh ~/data/mydatalocal/HiFi/hifiasm-meta_human/
Hifiasm represents the set of contigs with the GFA1 format. There is a quite a diversity of output file, what are they? What is a unitig a contig? Let's check the documentation.
We can use the following command to transform hifiasm-meta gfa file in a more conventional fasta file:
awk '/^S/{print ">"$2;print $3}' ~/data/mydatalocal/HiFi/hifiasm-meta_human/asm.p_ctg.gfa > ~/data/mydatalocal/HiFi/hifiasm-meta_human_asm.p_ctg.fasta
awk '/^S/{print ">"$2;print $3}' ~/data/mydatalocal/HiFi/hifiasm-meta_zymo/asm.p_ctg.gfa > ~/data/mydatalocal/HiFi/hifiasm-meta_zymo_asm.p_ctg.fasta
--> look at assembly statistics
--> look at actual size of longer contigs
Use Bandage to compare the Human and Zymo assemblies.
Bandage load ~/data/mydatalocal/HiFi/hifiasm-meta_human/asm.p_ctg.gfa
Bandage load ~/data/mydatalocal/HiFi/hifiasm-meta_zymo/asm.p_ctg.gfa
If you have issues with Bandage
Did you use -X or -Y when connecting to the VM? If not, please disconect and retype ssh with that flag:
ssh -X [email protected]
If you have Bandage on your laptop, use the scp command to download the gfa file on your laptop:
scp [email protected]:~/data/mydatalocal/HiFi/prerun_asm/asm.p_ctg.gfa .
This will copy the file to the directory you executed that command from. Also to be clear this command should not be run on the vm. This is a command for your laptop to request that file from the distant server. So it should be run on a terminal before you connect to the vm.
Try and follow explanation on how to forward display from this google doc: https://docs.google.com/document/d/1VPnL-5mXXQimkXQNiQagPhgzRn8j1JBHCLV42r8-Wqc/edit#
The Zymo is a bit more exciting than the HumanReal in terms of circular components, however some of those long contigs even if not circular are could already satisfy medium or high MAGs criterion for quality.
Now, let's try to run metaMDBG on the zymo mock community.
metaMDBG asm -h
Solution
metaMDBG asm ~/data/mydatalocal/HiFi/metaMDBG_asm ~/data/public/teachdata/ebame-2022/metagenomics/HIFI_datasets/samples/Zymo_sample1e5.fastq.gz -t 4
Let's wait for metaMDBG to finish a few multi-k iterations.
With the command "metaMDBG gfa", we can generate the assembly graph corresponding to each multi-k iteration. Lower-k graph will have more connectivity, while higher-k graph will have more repeats solved but also more fragmentation. It is interesting to work on lower-k graph if you have external source of data that could solve long repeat (for instance, HiC, ultra long reads, binning metrics).
Let' try to generate an assembly graph with a low k value. The following command shows the available values for k and their corresponding size in bps.
metaMDBG gfa ~/data/mydatalocal/HiFi/metaMDBG_asm 0
Choose a value for k and wait for metaMDBG to generate the graph.
Solution
metaMDBG gfa ~/data/mydatalocal/HiFi/metaMDBG_asm 10
Visualize the assembly graph with Bandage
Solution
Bandage load /home/ubuntu/data/mydatalocal/HiFi/metaMDBG_asm/assemblyGraph_k10_1813bps.gfa
Now let's check the final metaMDBG assembly results:
#Copy metaMDBG prerun in your local folder
ln -s ~/repos/Ebame/tmp/metaMDBG_zymo/ ~/data/mydatalocal/HiFi/metaMDBG_zymo
#Print metaMDBG output files
ls -lh ~/data/mydatalocal/HiFi/metaMDBG_zymo/
--> look at assembly statistics
--> visualize the final assembly graph with Bandage
Let's focus here only on circular contigs.
We can check if a contig is circular by looking at the contig headers in the fasta files. If a header ends with a "c", it means that the contig is circular, otherwise it is linear. You can check this info with the following command:
grep ">" ~/data/mydatalocal/HiFi/hifiasm-meta_zymo_asm.p_ctg.fasta
Let's create folders for the circular contigs:
mkdir -p ~/data/mydatalocal/HiFi/circularContigs/
mkdir -p ~/data/mydatalocal/HiFi/circularContigs/hifiasm_meta/
mkdir -p ~/data/mydatalocal/HiFi/circularContigs/metaMDBG/
Now, try to run the following homemade script to extract the circular contigs:
~/repos/Ebame/scripts/extractCircularContigs.py
It should not be too hard if you use the -h.
Solution
~/repos/Ebame/scripts/extractCircularContigs.py ~/data/mydatalocal/HiFi/hifiasm-meta_zymo_asm.p_ctg.fasta ~/data/mydatalocal/HiFi/circularContigs/hifiasm-meta/
~/repos/Ebame/scripts/extractCircularContigs.py ~/data/mydatalocal/HiFi/metaMDBG_zymo/contigs.fasta.gz ~/data/mydatalocal/HiFi/circularContigs/metaMDBG/
We are now going to merge the results of metaMDBG and hifiasm-meta. The idea is to compute the similarity (ANI) between the circular contigs, and to choose only one representative if two contigs are duplicated (ANI > 0.95 by default).
For this task, we are going to use the software dRep. dRep has a lot of options, let's craft the command together, first display de-replicate options:
dRep dereplicate -h
dRep takes a list of genomes as input (option -g), let's add all circular contig paths in a single file:
Solution
#Collect circular contig paths
ls ~/data/mydatalocal/HiFi/circularContigs/hifiasm-meta/*.fa > ~/data/mydatalocal/HiFi/circularContigs/allCircularContigs.txt
ls ~/data/mydatalocal/HiFi/circularContigs/metaMDBG/*.fa >> ~/data/mydatalocal/HiFi/circularContigs/allCircularContigs.txt
#Check input file
cat ~/data/mydatalocal/HiFi/circularContigs/allCircularContigs.txt
Let's run dRep in a fast fashion, first disable quality check with option --ignoreGenomeQuality (we'll do it after dereplication), and try to select the "fastANI" method for comparing the circular contigs:
Solution
dRep dereplicate ~/data/mydatalocal/HiFi/circularContigs/drep/ -p 4 -g ~/data/mydatalocal/HiFi/circularContigs/allCircularContigs.txt --S_algorithm fastANI --ignoreGenomeQuality
Read dRep output information and try to list the folder containing dereplicated contigs.
Solution
ls -lh ~/data/mydatalocal/HiFi/circularContigs/drep/dereplicated_genomes/
dRep provides a lot of useful information, for instance, we can look at the similarity between the pair of circular contigs:
column -s, -t < ~/data/mydatalocal/HiFi/circularContigs/drep/data_tables/Ndb.csv
Are there any circular contigs which are only found by one assembler?
column -s, -t < ~/data/mydatalocal/HiFi/circularContigs/drep/data_tables/Cdb.csv
Clearly some of these are not genomes, let's run checkm on the dereplicated contigs:
checkm lineage_wf ~/data/mydatalocal/HiFi/circularContigs/drep/dereplicated_genomes/ ~/data/mydatalocal/HiFi/circularContigs/drep/dereplicated_genomes/checkm/ -r -x .fa -t 4 --pplacer_threads 4 --tab_table -f ~/data/mydatalocal/HiFi/circularContigs/drep/dereplicated_genomes/checkm/results.tsv
CheckM is a bit slow, so let's check the prerun results
ln -s ~/repos/Ebame/tmp/checkm_drepCircularContigs/ ~/data/mydatalocal/HiFi/circularContigs/drep/dereplicated_genomes/checkm_prerun
Print columns corresponding to completeness and contamination:
awk -F"\t" '{ print $1, "\t", $12, "\t", $13 }' ~/data/mydatalocal/HiFi/circularContigs/drep/dereplicated_genomes/checkm_prerun/results.tsv
Some of the smaller circular contigs are likely to be plasmids or virus. Let's use genomad, a machine learning approach to verify this.
As usual, let's check the software usage information:
genomad -h
genomad end-to-end -h
Genomad takes as input a single fasta file. It will then process the contigs one by one and determine how likely they are to be plasmids or virus. Let's concatenate the small dereplicated circular contigs in a single fasta file.
Solution
#Concatenate all circular contigs in a single fasta file
cat ~/data/mydatalocal/HiFi/circularContigs/drep/dereplicated_genomes/*.fa > ~/data/mydatalocal/HiFi/circularContigs/allCircularContigs.fasta
#Concatenante only small circular contigs
find ~/data/mydatalocal/HiFi/circularContigs/drep/dereplicated_genomes/*.fa -size -500k | xargs cat > ~/data/mydatalocal/HiFi/circularContigs/allSmallCircularContigs.fasta
The genomad database is located here:
~/data/public/teachdata/ebame-2023/virome/db/genomad_db
Let's run genomad (you can use option --sensitivity 1.0 to speed-up prediction, use only for this tutorial):
Solution
genomad end-to-end ~/data/mydatalocal/HiFi/circularContigs/allSmallCircularContigs.fasta ~/data/mydatalocal/HiFi/circularContigs/genomad/ ~/data/public/teachdata/ebame-2023/virome/db/genomad_db --threads 4 --sensitivity 1.0
Read genomad logs and try to print plasmids and virus summaries:
Solution
cat ~/data/mydatalocal/HiFi/circularContigs/genomad/allSmallCircularContigs_summary/allSmallCircularContigs_plasmid_summary.tsv
cat ~/data/mydatalocal/HiFi/circularContigs/genomad/allSmallCircularContigs_summary/allSmallCircularContigs_virus_summary.tsv
Retry to use checkm on a contigs you chose and saved with Bandage.
--> find a long contigs from Bandage. Then click on "Output" menu -> "Save selected node sequence to FASTA"
Or from the command line, use the following command line replacing <NODE>:
Bandage reduce ~/data/mydatalocal/HiFi/hifiasm-meta_zymo/asm.p_ctg.gfa ~/data/mydatalocal/HiFi/<Node>.gfa --scope aroundnodes --nodes <NODE> --distance 0
--> use a bash online to extract, name and sequence from that gfa graph:
cd ~/data/mydatalocal/HiFi/linear_contigs
awk '/^S/{print ">"$2"\n"$3}' <Node>.gfa > <Node>.fasta
--> use checkm