lab_hackathon2

Our project is to adapt Brooke's script for determining AMR reads per million from meta genomic data

Here is the the google doc with project info.

Lab meeting presentation where we did git tutorial.

Github site for practising github pushes and pulls.

January 2024 Hackathon

Google doc plan for January 2024 hackathon

Preparing the local environment

** Note: as of 2024-01-30 there is a bug in the nextflow script when only ONE pair of fastqs is in the input directory**≈y

git clone [email protected]:Read-Lab-Confederation/lab_hackathon2.git

conda create -c bioconda -n hack2 nextflow kma kraken2 csvtk fastp

conda activate hack2

add data (create data directory if it doesnt already exist)

cd lab_hackathon2/data/

wget https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/latest/AMR_CDS

mkdir fastqs

cd fastqs

wget -O simulated_metagenome_R1.fastq.gz https://zenodo.org/record/6543357/files/simulated_metagenome_1.fq.gz?download=1

wget -O simulated_metagenome_R2.fastq.gz https://zenodo.org/record/6543357/files/simulated_metagenome_2.fq.gz?download=1

cp small fastq files from Michael David metagenome project

cp /mnt/tiramisu/emergent/projects/SEMAPHORE/data/fastqs/semaphore/microbiome/data_files/S.190905.00152_* ./

cd ../

wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230605.tar.gz

tar -xvzf k2_standard_08gb_20230605.tar.gz

Agenda for thurs 29th June 2023

Clone the github repo
Create conda environments based on YAML. (Any other softwhere we need to add to the environment , like nextflow?)

conda create -c bioconda -n hack2 nextflow kma kraken2 csvtk

conda activate hack2

Download test data sets

wget https://zenodo.org/record/6543357/files/simulated_metagenome_1.fq.gz?download=1

wget https://zenodo.org/record/6543357/files/simulated_metagenome_2.fq.gz?download=1

Download kraken database and the AMR gene database

wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230605.tar.gz

(move the kraken database outside of your github directory)

wget https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/latest/AMR_CDS

Create kma index

kma index -i AMR_CDS -o templates

##Agenda for friday 29th June 2023

Downsample the synthetic datasets?
One group work on Rscript Rscript abundanceTable.r --input all_kma_numerators_raw.tab --input_denominator all_kraken_bacteria.tab
The other group work on converting the existing shell script to nextflow
Docker container?

##Andrei's work on the pipeline

Results and instructions:

a) wrapped the whole pipeline in nextflow

b) created new yaml for conda dependencies

c) to install the environment type: conda env create -fhack2_nextflow.yaml

d) to run the pipeline type:  bash -i runAll.sh

e) current version uses local paths for references, so data/, bin/, main.nf, and runAll.sh should be in the same directory

f) data/ directory can be downloaded from s3://transfer-files-emory/amrKma/data.tar.gz

Need to modify:

a) edit kma indexing and alignment so that index files would not need to be copped in data directory (currently not elegant)

Install

wget for yaml

Preprocessing assumptions

Fastq files have been filtered for quality, adaptor sequences, optical duplicates, and host reads

Dependencies

R, Python, KMA, CSVTK, Kraken2 (version ?), pandas

Inputs

- reads in paired fastq files (.gz extension only as of 2024.02.02) - out directory

Example Usage

main.nf --reads {dir} --outdir {dir}

Outputs

File Name	Explanation
abundance_plot.png	bar plot of RPKM by gene
gene_abundance_table.tsv	summarized RPKM values for all samples
kma_all_joinned.tsv	contains read counts for RPKM calculations (numerator)
kraken_all_report.tsv	bacterial read count for all samples (denominator)

next up: add an explanation of each column of output file

Formula

The formula to calculate gene relative abundances is given by:

[ \frac{{\text{{Gene Reads}}}}{{\text{{Length of gene per kb}}}} \times (\text{{Bacteria Depth}}) \times 10^9 ]

This formula was adapted from Munk et al. 2022

Citations

Kraken

Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019). https://doi.org/doi-number

Formula

Munk, P., Brinch, C., Møller, F.D. et al. Genomic analysis of sewage from 101 countries reveals global landscape of antimicrobial resistance. Nat Commun 13, 7251 (2022). https://doi.org/doi-number

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.nextflow		.nextflow
bin		bin
drafts		drafts
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hack2_nextflow.yaml		hack2_nextflow.yaml
main.nf		main.nf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lab_hackathon2

January 2024 Hackathon

cp small fastq files from Michael David metagenome project

Agenda for thurs 29th June 2023

Install

Preprocessing assumptions

Dependencies

Inputs

Example Usage

Outputs

Formula

Citations

Kraken

Formula

About

Releases

Packages

Contributors 6

Languages

License

Read-Lab-Confederation/lab_hackathon2

Folders and files

Latest commit

History

Repository files navigation

lab_hackathon2

January 2024 Hackathon

cp small fastq files from Michael David metagenome project

Agenda for thurs 29th June 2023

Install

Preprocessing assumptions

Dependencies

Inputs

Example Usage

Outputs

Formula

Citations

Kraken

Formula

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages