Reconstruction of ancient phage genomes from the human gut

downloading 72 Illumina metagenomic libraries of palaeofeces and human gut content samples (see: Supplementary Table S1)
trimming paired-end reads using Cutadapt (v.4.1)
filtering out human DNA using KneadData (v.0.12.0)
assembly reads from samples into contigs using Metaspades (v.3.15.5)
filtering out contigs shorter than 4kb and with coverage <20 using custom python script
mapping filtered reads to contigs using Bowtie2 (v.2.4.4)
sorting and indexing with SAMtools (v.1.14)
authenticating ancient contigs using Pydamage (v.0.70)
quality control in FastQC before and after downloading, Cutadapt, KneadData

Metadata in Suplementary Table S1 is coming from AncientMetagenomeDir, a community curated resource of lists of all published shotgun-sequenced ancient metagenomes.

The first step is installation snakemake following instructions from here.

Then you can download this repository and run our pipeline which automatically create specific environments using conda and download all 72 libraries. You can modify config.yaml from config folder to select different samples to analyze.

conda activate snakemake

git clone https://github.com/rozwalak/Ancient_gut_phages.git

cd Ancient_gut_phages/snake-download-preprocessing-assembly-authentication/workflow

#run on Slurm cluster, set your parameters in ../config/snakemake/slurm/config.yaml
snakemake --profile ../config/snakemake/slurm --use-conda

#run on local computer 
snakemake --cores 4 --use-conda

Viral contigs identification

We identified viral sequences in the collection of ancient contigs using three different tools:

Jaeger (v.1.1.0)
VIBRANT (v.1.2.1)
VirSorter2 (v.2.2.3)

In order to predict viral sequences after installing the necessary tools, we used the combined fasta file with all of the ancient contigs and the following commands:

Jaeger:
python inference.py -i input_file.fasta -o jaeger_output.fasta

VIBRANT:
python VIBRANT_run.py -i input_file.fasta -t 37 -folder vibrant_output

VirSorter2:
virsorter run --prep-for-dramv -w virsorter2_output -i input_file.fasta -j 37 --include-groups dsDNAphage,NCLDV,ssDNA,lavidaviridae all

Finally, we selected contigs classified as viral by at least 2 methods to downstream analyses.

Genomes quality assessment

We assessed quality of ancient viral contigs using CheckV (v.1.0.1)
After installation we run:

checkv download_database ./

checkv end_to_end input_file.fasta output_directory -t 8 -d PATH_to_DB

We considered ancient viral genomes classified as complete, high-quality, medium-quality, or fragments longer than 20kb. Additionally, we filtered out sequences with viral genes <= 1 and host genes >= 1 assigned by CheckV to clean putative contaminants.

python checkv2fasta.py

Viral contigs clustering

We clustered selected ancient viral genomes on the basis of 95% average nucleotide identity (ANI) and 85% alignment fraction of the shorter sequence, as recommended in MIUViG (Minimum information about an uncultivated virus genome)

For this purpose, we used custom scripts published in the CheckV repository

#First, create a blast+ database:
makeblastdb -in <my_seqs.fna> -dbtype nucl -out <my_db>

#Next, use megablast from blast+ package to perform all-vs-all blastn of sequences:
blastn -query <my_seqs.fna> -db <my_db> -outfmt '6 std qlen slen' -max_target_seqs 10000 -o <my_blast.tsv> -num_threads 32

#Next, calculate pairwise ANI by combining local alignments between sequence pairs:
anicalc.py -i <my_blast.tsv> -o <my_ani.tsv>

#Finally, perform UCLUST-like clustering using the MIUVIG recommended-parameters (95% ANI + 85% AF):
aniclust.py --fna <my_seqs.fna> --ani <my_ani.tsv> --out <my_clusters.tsv> --min_ani 95 --min_tcov 85 --min_qcov 0

Gene-sharing network

At first, we predicted protein-coding genes in ancient viral genomes using Prodigal (v.2.6.3):

prodigal -i ancient_viruses.fasta -a ancient_viruses_proteins.faa -d ancient_viruses_proteins.ffn -p meta -f gff > ancient_viruses_proteins.gff

Next, we downloaded IMG_VR_2022-09-20_6.1/IMGVR_all_proteins-high_confidence.faa.gz from IMG/VR (v.4 high-confidence genomes only) and searched with modern viral proteins from IMG/VR as a query in DIAMOND (v2.0.15):

gunzip IMGVR_all_proteins-high_confidence.faa.gz

diamond makedb --in ancient_viruses_proteins.faa -d ancient_viruses_proteins

diamond blastp -d ancient_viruses_proteins -q IMGVR_all_proteins-high_confidence.faa -o IMGVR_vs_all_ancient_viruses_proteins.tsv -p 96 -f 6 qseqid sseqid scovhsp pident length mismatch gapopen qstart qend sstart send evalue bitscore

We selected 10 modern genomes related to every ancient virus genome based on the highest number of shared proteins at a minimum of 50% query coverage and 50% identity in Diamond output using custom script:

python extractTOP10.py

To the collection of ancient phages and related genomes from IMG/VR, we added sequences downloaded from NCBI RefSeq or GenBank using an accession number from ICTV Virus Metadata Resources (VMR_20-190822_MSL37.2, created 08/31/2022) and finally we made gene-sharing network in vContact2 (v.0.11.3)

#predict genes for all viruses in the network
prodigal -i SelectedIMGVR_ICTV_AncientViruses.fasta -a SelectedIMGVR_ICTV_AncientViruses.faa -p meta

#prepare file gene_2_genome
vcontact2_gene2genome -p SelectedIMGVR_ICTV_AncientViruses.faa -o SelectedIMGVR_ICTV_AncientViruses_g2g.csv -s 'Prodigal-FAA'

#make gene-sharing network
vcontact2 --raw-proteins SelectedIMGVR_ICTV_AncientViruses.faa --proteins-fp SelectedIMGVR_ICTV_AncientViruses_g2g.csv --db 'None' --output-dir SelectedIMGVR_ICTV_AncientViruses_vContact2

We visualised the network (Fig. 2A) from vContact2 in Cytoscape (v.3.9.0) and refined it in Inkscape (v.1.2.2.)

Host prediction

We used four computational tools to predict hosts of ancient viral genomes:

#BLAST
cat hosts/*.fna > blastdb.fna

makeblastdb -in blastdb.fna -dbtype nucl

ls aMGVs/*.fna | xargs -I {} blastn -task blastn -query {} -db blastdb.fna -outfmt 6 -num_threads 16 -out blast/{}.txt

#PHIST
./phist.py --t 16 aMGVs/ hosts/ phist/

#VirHostMatcher-Net
./VirHostMatcher-Net.py -t 16 -q aMGVs -o vhm-net/

#RaFAH
perl RaFAH.pl --predict --genomes_dir aMGVs/ --extension .fasta --file_prefix RaFAH_aMGVs

Taxonomy assignment, clustering and phylogenetic analysis

We performed taxonomic assignment of ancient viral genomes using geNomad (v.1.3.3)

genomad download-database .

genomad annotate --cleanup ancient_viruses.fasta ancient_viruses_genomad genomad_db

To identify genus- and family-level ancient viral genomes, we clustered genomes using a combination of gene sharing and AAI following scripts from here (Nayfach et al., 2021)

Firstly, we selected ancient viral genomes assessed as high-quality or complete by CheckV, and we then added prokaryotic viruses from RefSeq (n = 4703; access: 30.01.2023) and IMG/VR sequences (n = 265) forming clusters (VC) with ancient viruses (see: Gene-sharing network) = ancient_viruses_RefSeq_selectedIMGVR.fasta

#Predict protein-coding genes
prodigal -i ancient_viruses_RefSeq_selectedIMGVR.fasta -a ancient_viruses_RefSeq_selectedIMGVR.faa -p meta

#Make DIAMOND database
diamond makedb --in ancient_viruses_RefSeq_selectedIMGVR.faa --db viral_proteins --threads 10

#Perform all-vs-all BLASTP
diamond blastp --query ancient_viruses_RefSeq_selectedIMGVR.faa --db viral_proteins --out blastp.tsv --outfmt 6 --evalue 1e-5 --max-target-seqs 10000 --query-cover 50 --subject-cover 50

#Compute AAI from BLAST results
python amino_acid_identity.py --in_faa ancient_viruses_RefSeq_selectedIMGVR.faa --in_blast blastp.tsv --out_tsv aai.tsv

#Amino acid identity is computed based on the average BLAST percent identity between all genes shared between each pair of genomes (E-value <1e-5)

#Filter edges and prepare MCL input
python filter_aai.py --in_aai aai.tsv --min_percent_shared 20 --min_num_shared 16 --min_aai 40 --out_tsv genus_edges.tsv
python filter_aai.py --in_aai aai.tsv --min_percent_shared 10 --min_num_shared 8 --min_aai 20 --out_tsv family_edges.tsv

#Here we're keeping edges between genomes with >=20% AAI and genomes with either 8 shared genes or at least 20% of shared genes (relative to both genomes)

#Perform MCL-based clustering
mcl genus_edges.tsv -te 8 -I 2.0 --abc -o genus_clusters.txt
mcl family_edges.tsv -te 8 -I 1.2 --abc -o family_clusters.txt

#In the output each row indicates the members belonging to each cluster (including singletons)

To visualise phylogenetic relationships of genus- and family-level groups, we generated a proteomic tree (Fig. 3) of 5017 viral sequences using ViPTreeGen (v1.1.3) and GraPhlAn (v1.1.3).

ViPTreeGen --ncpus 24 ancient_viruses_RefSeq_selectedIMGVR.fasta ancient_viruses_RefSeq_selectedIMGVR_VipTree

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
snake-download-preprocessing-assembly-authentication		snake-download-preprocessing-assembly-authentication
.gitignore		.gitignore
README.md		README.md
checkv2fasta.py		checkv2fasta.py
extractTOP10.py		extractTOP10.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reconstruction of ancient phage genomes from the human gut

Table of contents

Introduction

Downloading, preprocessing, assembly and authentication

Viral contigs identification

Genomes quality assessment

Viral contigs clustering

Gene-sharing network

Host prediction

Taxonomy assignment, clustering and phylogenetic analysis

About

Releases 1

Packages

Languages

rozwalak/Ancient_gut_phages

Folders and files

Latest commit

History

Repository files navigation

Reconstruction of ancient phage genomes from the human gut

Table of contents

Introduction

Downloading, preprocessing, assembly and authentication

Viral contigs identification

Genomes quality assessment

Viral contigs clustering

Gene-sharing network

Host prediction

Taxonomy assignment, clustering and phylogenetic analysis

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages