Read mapping and counts tables

This walkthrough goes through read mapping and generating counts tables.

What you'll end up with at the end of this are counts tables at the contig, MAG, SLC, ORF, and SSO levels.

It is assumed you've completed either the end-to-end metagenomics or the recovering viruses from metatranscriptomics walkthroughs.

We focus on global read mapping in this tutorial but local is similar.

Steps:

Create a global index of genomes and their gene models
Map reads to global reference and create base counts tables
Merge the counts tables for all the samples

Conda Environment: conda activate VEBA. Use this for intermediate scripts.

1. Create a global index of genomes and their gene models

Here we are going to concatenate all of the binned contigs (i.e., MAGs) and their respective gene models (i.e., GFF files) then index using Bowtie2.

# Set the number of threads
N_JOBS=16

# Set a useful name for log files
N=index-global
rm -f logs/${N}.*

# Get list to all genomes
ls veba_output/binning/*/*/output/genomes/*.fa > veba_output/misc/genomes.list

GENOMES=veba_output/misc/genomes.list

# Get list to all gene models
ls veba_output/binning/*/*/output/genomes/*.gff > veba_output/misc/gene_models.list

GENE_MODELS=veba_output/misc/gene_models.list

# Set up command
CMD="source activate VEBA && veba --module index --params \"-r ${GENOMES} -g ${GENE_MODELS} -o veba_output/index/global/ -p ${N_JOBS}\""

# Either run this command or use SunGridEnginge/SLURM

The following output files will produced:

reference.fa.gz - Concatenated reference fasta
reference.fa.gz.*.bt2 - Bowtie2 index of reference fasta
reference.gff - Concatenated gene models
reference.saf - SAF format for reference

2. Map reads to global reference and create base counts tables

Here we are map all of the reads to the global reference and create base counts tables for contigs and ORFs using featureCounts.

Note: Versions prior to v1.1.2 require the output directory to include the sample name. (e.g., -o veba_output/mapping/global/${ID} where -n is not used. In v1.1.2+, the output directory is automatic (e.g., veba_output/mapping/global/ and -n ${ID} are used)

# If you have run the cluster.py module you can use this:
SCAFFOLDS_TO_MAGS=veba_output/cluster/output/global/scaffolds_to_mags.tsv
SCAFFOLDS_TO_SLCS=veba_output/cluster/output/global/scaffolds_to_slcs.tsv
PROTEINS_TO_ORTHOGROUPS=veba_output/cluster/output/global/proteins_to_orthogroups.tsv
MAGS_TO_SLCS=veba_output/cluster/output/global/mags_to_slcs.tsv

# If you skipped the clustering, you can oncatenate all of the scaffolds to bins from all of the domains
cat veba_output/binning/*/*/output/scaffolds_to_bins.tsv > veba_output/misc/all_genomes.scaffolds_to_mags.tsv
SCAFFOLDS_TO_MAGS=veba_output/misc/all_genomes.scaffolds_to_mags.tsv

# Set a lower number of threads since we are running for each sample
N_JOBS=2

# Get the index directory
INDEX_DIRECTORY=veba_output/index/global/output/

for ID in $(cat identifiers.list); do
	N="mapping-global__${ID}";
	rm -f logs/${N}.*
	
	# Get the forward and reverse reads
	R1=veba_output/preprocess/${ID}/output/cleaned_1.fastq.gz
	R2=veba_output/preprocess/${ID}/output/cleaned_2.fastq.gz
	
	# Specify the output directory
	OUT_DIR=veba_output/mapping/global
	
	# Set up command	
	CMD="source activate VEBA && veba --module mapping --params \"-1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -x ${INDEX_DIRECTORY} --scaffolds_to_bins ${SCAFFOLDS_TO_MAGS}\"" #--scaffolds_to_clusters ${SCAFFOLDS_TO_SLCS} --proteins_to_orthogroups ${PROTEINS_TO_ORTHOGROUPS}
	
	# Either run this command or use SunGridEnginge/SLURM

	done

The following output files will produced for each sample:

counts.orfs.tsv.gz - ORF-level counts table
counts.scaffolds.tsv.gz - Contig-level counts table
mapped.sorted.bam - Sorted BAM file
unmapped_1.fastq.gz - Unmapped reads (forward)
unmapped_2.fastq.gz - Unmapped reads (reverse)
genome_spatial_coverage.tsv.gz - Spatial coverage of MAG within sample (based on samtools coverage)

3. Merge the counts tables for all the samples:

We have individual counts vectors for ORFs and contigs per sample. Now we need to merge the contig counts and ORF counts separately. While we are at it, let's aggregate the contig counts into MAGs and SLCs then aggregate the ORF counts into SSPCs (previously referred to as SSO). We are going to use the merge_contig_mapping.py and merge_orf_mapping.py scripts installed with VEBA. If for some reason they aren't installed, then download them here: https://github.com/jolespin/veba/tree/main/src/scripts

# Get mapping directory
MAPPING_DIRECTORY=veba_output/mapping/global

# Set output directory (this is default)
OUT_DIR=veba_output/counts

# Merge contig-level counts (excu
merge_contig_mapping.py -m ${MAPPING_DIRECTORY} -c ${MAGS_TO_SLCS}  -i ${SCAFFOLDS_TO_MAGS} -o ${OUT_DIR}

# Merge ORF-level counts
merge_orf_mapping.py -m ${MAPPING_DIRECTORY} -c ${PROTEINS_TO_ORTHOGROUPS} -o ${OUT_DIR}

The following output files will produced:

scaffold_to_mag.tsv - Identifier mapping [id_contig][id_mag]
mag_to_slc.tsv - Identifier mapping [id_mag][id_slc]
orf_to_orthogroup.tsv - Identifier mapping [id_orf][id_sso]
X_contigs.tsv.gz - Counts tables gzipped and tab-delimited (samples, contigs)
X_mags.tsv.gz - Counts tables gzipped and tab-delimited (samples, MAGs)
X_orfs.tsv.gz - Counts tables gzipped and tab-delimited (samples, ORFs)
X_orthogroups.tsv.gz - Counts tables gzipped and tab-delimited (samples, SSOs)
X_slcs.tsv.gz - Counts tables gzipped and tab-delimited (samples, SLCs)

Next steps:

Now it's time to analyze the data using compositional data analysis (CoDA).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_mapping_and_counts_tables.md

read_mapping_and_counts_tables.md

Read mapping and counts tables

Steps:

1. Create a global index of genomes and their gene models

2. Map reads to global reference and create base counts tables

3. Merge the counts tables for all the samples:

Next steps:

Recommended reading for CoDA:

Files

read_mapping_and_counts_tables.md

Latest commit

History

read_mapping_and_counts_tables.md

File metadata and controls

Read mapping and counts tables

Steps:

1. Create a global index of genomes and their gene models

2. Map reads to global reference and create base counts tables

3. Merge the counts tables for all the samples:

Next steps:

Recommended reading for CoDA: