Our project is to adapt Brooke's script for determining AMR reads per million from meta genomic data
Here is the the google doc with project info.
Lab meeting presentation where we did git tutorial.
Github site for practising github pushes and pulls.
Google doc plan for January 2024 hackathon
Preparing the local environment
** Note: as of 2024-01-30 there is a bug in the nextflow script when only ONE pair of fastqs is in the input directory**≈y
git clone [email protected]:Read-Lab-Confederation/lab_hackathon2.git
conda create -c bioconda -n hack2 nextflow kma kraken2 csvtk fastp
conda activate hack2
add data (create data directory if it doesnt already exist)
cd lab_hackathon2/data/
wget https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/latest/AMR_CDS
mkdir fastqs
cd fastqs
wget -O simulated_metagenome_R1.fastq.gz https://zenodo.org/record/6543357/files/simulated_metagenome_1.fq.gz?download=1
wget -O simulated_metagenome_R2.fastq.gz https://zenodo.org/record/6543357/files/simulated_metagenome_2.fq.gz?download=1
cp /mnt/tiramisu/emergent/projects/SEMAPHORE/data/fastqs/semaphore/microbiome/data_files/S.190905.00152_* ./
cd ../
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230605.tar.gz
tar -xvzf k2_standard_08gb_20230605.tar.gz
- Clone the github repo
- Create conda environments based on YAML. (Any other softwhere we need to add to the environment , like nextflow?)
conda create -c bioconda -n hack2 nextflow kma kraken2 csvtk
conda activate hack2
- Download test data sets
wget https://zenodo.org/record/6543357/files/simulated_metagenome_1.fq.gz?download=1
wget https://zenodo.org/record/6543357/files/simulated_metagenome_2.fq.gz?download=1
- Download kraken database and the AMR gene database
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230605.tar.gz
(move the kraken database outside of your github directory)
wget https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/latest/AMR_CDS
Create kma index
kma index -i AMR_CDS -o templates
##Agenda for friday 29th June 2023
- Downsample the synthetic datasets?
- One group work on Rscript
Rscript abundanceTable.r --input all_kma_numerators_raw.tab --input_denominator all_kraken_bacteria.tab
- The other group work on converting the existing shell script to nextflow
- Docker container?
##Andrei's work on the pipeline
- Results and instructions:
a) wrapped the whole pipeline in nextflow
b) created new yaml for conda dependencies
c) to install the environment type: conda env create -fhack2_nextflow.yaml
d) to run the pipeline type: bash -i runAll.sh
e) current version uses local paths for references, so data/, bin/, main.nf, and runAll.sh should be in the same directory
f) data/ directory can be downloaded from s3://transfer-files-emory/amrKma/data.tar.gz
- Need to modify:
a) edit kma indexing and alignment so that index files would not need to be copped in data directory (currently not elegant)
wget for yaml
Fastq files have been filtered for quality, adaptor sequences, optical duplicates, and host reads
R, Python, KMA, CSVTK, Kraken2 (version ?), pandas
- reads in paired fastq files (.gz extension only as of 2024.02.02)
- out directory
main.nf --reads {dir} --outdir {dir}
File Name | Explanation |
---|---|
abundance_plot.png | bar plot of RPKM by gene |
gene_abundance_table.tsv | summarized RPKM values for all samples |
kma_all_joinned.tsv | contains read counts for RPKM calculations (numerator) |
kraken_all_report.tsv | bacterial read count for all samples (denominator) |
next up: add an explanation of each column of output file
The formula to calculate gene relative abundances is given by:
[ \frac{{\text{{Gene Reads}}}}{{\text{{Length of gene per kb}}}} \times (\text{{Bacteria Depth}}) \times 10^9 ]
This formula was adapted from Munk et al. 2022
Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019). https://doi.org/doi-number
Munk, P., Brinch, C., Møller, F.D. et al. Genomic analysis of sewage from 101 countries reveals global landscape of antimicrobial resistance. Nat Commun 13, 7251 (2022). https://doi.org/doi-number