Skip to content

Commit

Permalink
Merge pull request #3 from nbokulich/nb-edits
Browse files Browse the repository at this point in the history
updating text: style edits
  • Loading branch information
nbokulich authored Nov 19, 2024
2 parents d203599 + b78489e commit 3115416
Show file tree
Hide file tree
Showing 15 changed files with 56 additions and 43 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -163,3 +163,5 @@ cython_debug/

moshpit_docs/_build
.idea/

.DS_Store
7 changes: 4 additions & 3 deletions moshpit_docs/chapters/00_data_retrieval.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@ kernelspec:
---
(data-retrieval)=
# Data retrieval
The dataset which we are using in this tutorial is available through the Sequence Read Archive. To retrieve it we will
use the q2-fondue plugin: we only need to provide a list of accession IDs which we are interested in downloading -
everything else will be taken care of for us.
The dataset used in this tutorial is available through the [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) (SRA).
To retrieve it we will use the [q2-fondue plugin](https://github.com/bokulich-lab/q2-fondue) for programmatic access to
sequences and metadata from SRA; we only need to provide a list of accession IDs to download - q2-fondue will take care of
the rest.

```{note}
You need to provide an e-mail address when running this command - this is required by the NCBI as a way to
Expand Down
5 changes: 3 additions & 2 deletions moshpit_docs/chapters/00_setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,9 @@ kernelspec:
(setup)=
# Setup
Before we dive into the tutorial, let's make sure we have all the necessary components in place. Make sure you have a
working QIIME 2 metagenome environment available - please follow the instructions from the official [QIIME 2 documentation](https://docs.qiime2.org/2024.5/install/native/)
to learn more.
working QIIME 2 metagenome environment available - please follow the instructions from the official
[QIIME 2 documentation](https://docs.qiime2.org/2024.10/install/native/#qiime-2-metagenome-distribution) to install
the QIIME 2 "Metagenome Distribution".

In this tutorial we will be storing all the data in the QIIME 2 cache. To learn more about how the cache works you can
consult [this](https://dev.qiime2.org/latest/api-reference/cache) QIIME 2 forum post. You should create a single cache
Expand Down
10 changes: 5 additions & 5 deletions moshpit_docs/chapters/01_filtering/host-filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ kernelspec:
---
# Host read removal
There are a few different options to perform host read removal in QIIME 2: a more generic one using the `filter-reads` action
and a more specific one using the `filter-reads-pangenome` action. Below you can see how to use both of them. In this tutorial we will
and a more specific one using the `filter-reads-pangenome` action. Below you can learn how to use both of them. In this tutorial we will
use the `filter-reads-pangenome` action to remove human reads from the dataset.

## Removal of contaminating reads
Removal of contaminating reads can generally be done by mapping the reads to a reference database and filtering out the reads
that map to it. In QIIME 2 this can be done by using the `filter-reads` action from the `quality-control` plugin. Before the filtering
we need to construct the index of the reference database which will be used by Bowtie 2:
- start with the FASTA files contaning the reference sequences - we will import them into a QIIME 2 artifact:
that map to it. In QIIME 2 this can be done by using the `filter-reads` action from the `quality-control` plugin. Before filtering
we need to construct the index of the reference database that will be used by Bowtie 2:
- start with the FASTA files containing the reference sequences - we will import them into a QIIME 2 artifact:
```{code-cell}
qiime tools cache-import \
--cache ./cache \
Expand All @@ -44,7 +44,7 @@ qiime quality-control filter-reads \

## Human host reads
Contaminating human reads can also be filtered out using the approach shown above by providing a human reference genome.
Since a single human reference genome is not enough to cover all the human genetic diversity, it is now recommended to use a
Since a single human reference genome is not enough to cover all the human genetic diversity, it is recommended to use a
collection of genomes represented by the human pangenome (__CIT). We have built a new QIIME 2 action `filter-reads-pangenome`
which allows to first fetch the human pangenome sequence, combine it with the GRCh38 reference genome, build a combined
Bowtie 2 index and, finally, filter the reads against it. Next to the filtered reads, the action will also return the generated
Expand Down
12 changes: 7 additions & 5 deletions moshpit_docs/chapters/01_filtering/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,10 @@ kernelspec:
---
(quality-control)=
# Quality control
Just like any other NGS experiment, shotgun metagenomics data should be quality controlled before any downstream analysis.
The filtering steps may include adapter removal, quality trimming, and filtering out low-quality reads. Moreover, metagenomic
reads may contain host DNA, which should be removed. QIIME 2 provides functionality to address some of those issues - the
MOSHPIT plugin suite only expands on those by focusing more on host DNA removal. The next sections contain a brief overview
of the filtering steps which can be done using QIIME 2 and MOSHPIT.
As with any other NGS experiment, metagenome data should be quality controlled before any downstream analysis.
The filtering steps may include adapter removal, quality trimming, and filtering out low-quality reads. Moreover,
depending on the sample type and preparation procedures, metagenomic reads may contain host DNA, which should be
removed. Other QIIME 2 plugins already provide generalized functionality to address quality filtering/control of
next-generation sequencing data - the MOSHPIT plugin suite expands on these by focusing more on host DNA removal
from metagenome data. The next sections contain a brief overview of the filtering steps which can be done using
QIIME 2 and MOSHPIT.
2 changes: 1 addition & 1 deletion moshpit_docs/chapters/01_filtering/quality-filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ kernelspec:
## Quality overview
We can get an overview of the read quality by using the `summarize` action from the `demux` QIIME 2 plugin. This command
will generate a visualization of the quality scores at each position. You can learn more about this action in the [QIIME 2
documentation](https://docs.qiime2.org/2024.5/plugins/available/demux/summarize/).
documentation](https://docs.qiime2.org/2024.10/plugins/available/demux/summarize/).
```{code-cell}
qiime demux summarize \
--i-data ./cache:reads \
Expand Down
2 changes: 1 addition & 1 deletion moshpit_docs/chapters/02_mag_reconstruction/abundance.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ kernelspec:
---
# MAG abundance estimation
Once we recover MAGs from metagenomic data, we may be interested in estimating their abundance in the samples. We can do
it by mapping the original reads to the derepliacted MAGs and calculating the abundance based on the read mapping results.
it by mapping the original reads to the dereplicated MAGs and calculating the abundance based on the read mapping results.
There are a couple of ways to estimate MAG abundance, such as RPKM (Reads Per Kilobase per Million mapped reads) and TPM
(Transcripts Per Million). Here we will use TPM to estimate the abundance of each MAG in all samples.

Expand Down
4 changes: 2 additions & 2 deletions moshpit_docs/chapters/02_mag_reconstruction/dereplication.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ kernelspec:
(dereplication)=
# MAG set dereplication
Depending on the application, it may be necessary to dereplicate the set of MAGs to remove redundancy and retain only
unique genome representatives. Our workflow includes a dereplication step which uses any genome distance matrix, which is
used to find clusters of similar genomes (based on a specific similarity threshold) and identify the most representative
unique genome representatives. Our workflow includes a dereplication step that can use any genome distance matrix to
find clusters of similar genomes (based on a specific similarity threshold) and identify the most representative
MAG (in our case, it will be the longest genome in the cluster). Here we use Sourmash to generate the distance matrix
but any other tool could also be used.
## Compute MinHash signatures with Sourmash
Expand Down
13 changes: 7 additions & 6 deletions moshpit_docs/chapters/02_mag_reconstruction/reconstruction.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ memory resources before running these commands.
## Assemble contigs with MEGAHIT
The first step in recovering MAGs is genome assembly itself. There are many genome assemblers available, two of which
you can use through the q2-assembly plugin - here, we will use MEGAHIT. MEGAHIT takes short DNA sequencing reads,
constructs a simplified De Bruijn graph, and generates longer contiguous sequences called contigs, providing valuable
genetic information for the next steps of our analysis.
constructs a simplified [De Bruijn graph](https://en.wikipedia.org/wiki/De_Bruijn_graph), and generates longer contiguous
sequences called contigs, providing valuable genetic information for the next steps of our analysis.

- The `--p-presets` specifies the preset mode for MEGAHIT. In this case, it's set to "meta-sensitive" for metagenomic data.
- The `--p-cpu-threads` specifies the number of CPU threads to use during assembly.
Expand All @@ -46,7 +46,7 @@ qiime assembly assemble-megahit \

## Contig QC with QUAST
Once the reads are assembled into contigs, we can use QUAST to evaluate the quality of our assembly. There are many
metrics which can be used for that purpose but here we will focus on the two most popular metrics:
metrics that can be used for that purpose but here we will focus on the two most popular metrics:
- **N50**: represents the contiguity of a genome assembly. It's defined as the length of the contig (or scaffold) at
which 50% of the entire genome is covered by contigs of that length or longer - the higher this number, the better.
- **L50**: represents the number of contigs required to cover 50% of the genome's total length - the smaller this number,
Expand Down Expand Up @@ -102,10 +102,10 @@ qiime moshpit bin-contigs-metabat \
--o-unbinned-contigs ./cache:unbinned_contigs \
--verbose
```
This tep generated a couple artifacts:
This step generated several artifacts:

- `mags`: these are our actual MAGS, per sample.
- `contig_map`: this is a mapping between MAG IDs and IDs of contigs which belong to a given MAG.
- `contig_map`: this is a mapping between MAG IDs and IDs of contigs that belong to a given MAG.
- `unbinned_contigs`: these are all the contigs that could not be assign to any particular MAG.
From now on, we will focus on the `mags`.

Expand Down Expand Up @@ -141,11 +141,12 @@ The `--p-lineage-dataset bacteria_odb10` parameter specifies the particular line
the bacteria_odb10 dataset. This is a standard database for bacterial genomes.

Your visualization should look similar to [this one](https://view.qiime2.org/visualization/?src=https://raw.githubusercontent.com/bokulich-lab/moshpit-docs/main/moshpit_docs/data/mags.qzv).

## Filter MAGs
This step filters MAGs based on completeness. In this example, we filter out any MAGs with completeness below 50%.
The filtering process ensures only high-quality genomes are kept for downstream analysis.
```{tip}
We recommed this step to be done before dereplication (as in this example). Alternatively, we can also use the
We recommed that this step is done before dereplication (as in this example). Alternatively, we can also use the
[dereplicated set](dereplication) and filter this one using `qiime moshpit filter-derep-mags`.
```

Expand Down
2 changes: 1 addition & 1 deletion moshpit_docs/chapters/03_taxonomic_classification/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ For more information on Kraken 2, consult [Wood et al., 2019](https://genomebiol
```

## Kaiju: protein-based classification
Kaiju compares reads by translating DNA sequences into protein sequences (BLASTx-like). This allows Kaiju to identify
Kaiju compares reads by translating DNA sequences into protein sequences (similar to BLASTx). This allows Kaiju to identify
organisms accurately when nucleotide sequences are too divergent to be identified with DNA-based methods. Kaiju uses a
fast exact matching algorithm based on Burrows-Wheeler Transform (BWT) and FM-index to align translated DNA reads
against a reference database of protein sequences.
Expand Down
2 changes: 1 addition & 1 deletion moshpit_docs/chapters/03_taxonomic_classification/mags.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ kernelspec:
name: python3
---
# Taxonomic classification of MAGs
Kraken 2 can also be used to obtain a classification of metagenome-assembled genomes (MAGs). In this tutorial we use this
Kraken 2 can also be used to taxonomically classify metagenome-assembled genomes (MAGs). In this tutorial we use this
tool to classify a subset of dereplicated MAGs but the same approach can be used for the entire set of MAGs contained in
the `SampleData[MAGs]` or `SampleData[Contigs]` artifacts.
```{code-cell}
Expand Down
10 changes: 5 additions & 5 deletions moshpit_docs/chapters/03_taxonomic_classification/reads.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ kernelspec:
(kraken-reads)=
# Taxonomic classification of reads
In this section we will focus on the taxonomic classification of shotgun metagenomic reads using two different tools: Kraken 2 and Kaiju.
We will use the data which we got from the steps described in the [data retrieval section](../00_data_retrieval.md).
We will use the data obtained in the [data retrieval section](../00_data_retrieval.md).

## Approach 1: Kraken 2
Before we can use Kraken 2, we need to build or download a database. We will use the `build-kraken-db` action to fetch the PlusPF database
Expand Down Expand Up @@ -43,7 +43,7 @@ qiime moshpit classify-kraken2 \

```{seealso}
[Bracken](https://ccb.jhu.edu/software/bracken/) is a related tool that additionally estimates relative abundances of species or genera to adjust for
genome size which the reads originated from. In order to use this tool we need the Bracken database that was fetched in the first step.
the genome size the organisms from which each read originated. In order to use this tool we need the Bracken database that was fetched in the first step.
```

```{code-cell}
Expand All @@ -57,7 +57,7 @@ qiime moshpit estimate-bracken \
--o-reports ./cache:bracken_reports
```

To remove the unclassified read fraction we can use the `filter-table` action from the `taxa` QIIME 2 plugin:
To remove the unclassified read fraction we can use the `filter-table` action from the `q2-taxa` QIIME 2 plugin:
```{code-cell}
qiime taxa filter-table \
--i-table ./cache:bracken_ft \
Expand All @@ -67,8 +67,8 @@ qiime taxa filter-table \
```

## Approach 2: Kaiju
Similarly to Kraken 2, Kaiju requires a reference database to perform the classification. We will use the `fetch-kaiju-db`
action to download the [nr_euk](https://bioinformatics-centre.github.io/kaiju/downloads.html) database that includes both,
Similarly to Kraken 2, Kaiju requires a reference database to perform taxonomic classification. We will use the `fetch-kaiju-db`
action to download the [nr_euk](https://bioinformatics-centre.github.io/kaiju/downloads.html) database that includes both
prokaryotes and eukaryotes (more info on the taxa [here](https://github.com/bioinformatics-centre/kaiju/blob/master/util/kaiju-taxonlistEuk.tsv)).
```{code-cell}
qiime moshpit fetch-kaiju-db \
Expand Down
5 changes: 3 additions & 2 deletions moshpit_docs/chapters/04_functional_annotation/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,16 @@ extracted from complex microbial communities, bypassing the need to culture the

This process provides insights into the genes that code for enzymes, transporters, and other proteins critical to the
survival and function of the microbes in various ecosystems. Annotating these genomes allows for the study of their
contributions to nutrient cycles, disease processes, or specialized ecological functions.
contributions to nutrient cycles, disease processes, or specialized ecological functions, to name only a few examples.

This workflow outlines the step-by-step process for functional annotation of MAGs or contigs using tools like EggNOG and
the Diamond aligner in QIIME2.

```{note}
Functional annotation can be performed on fully reconstructed **MAGs** or directly on **contigs** (the contiguous sequences
assembled from sequencing reads). Annotating **contigs** can provide early insights into important functional genes even
before complete genomes are assembled.
before complete genomes are assembled. Annotating **MAGs** has the added benefit of seeing how these annotated genes are
connected and organized in a single genome.
In this tutorial, we will focus on functional annotation of our previously reconstructed MAGs (see **Recovery of MAGs section**)
```
Expand Down
11 changes: 6 additions & 5 deletions moshpit_docs/chapters/04_functional_annotation/mags.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ kernelspec:
---
# Functional annotation
## Required databases
In order to perform the functional annotation, we will need a couple of different reference databases. Below you will find instructions on how to download these databases using respective QIIME 2 actions.
In order to perform the functional annotation, we will need a couple of different reference databases. Below you will find instructions on how to download these databases using MOSHPIT.

```{code-cell}
qiime moshpit fetch-diamond-db \
Expand Down Expand Up @@ -57,7 +57,7 @@ qiime moshpit eggnog-annotate \
## Extract annotations
This method extract a specific annotation from the table generated by EggNOG and calculates its frequencies across all MAGs.
```{note}
The `qiime moshpit extract-annotations` command allows us to extract specific types of functional annotations, such as
The `qiime moshpit extract-annotations` method allows us to extract specific types of functional annotations, such as
**CAZymes**, **KEGG pathways**, **COG categories**, or other functional elements, and calculate their frequency across
all dereplicated MAGs.
Expand All @@ -75,7 +75,7 @@ qiime moshpit extract-annotations \
## Multiply tables
This steps simply calculates the dot product of the `mags_derep_ft` and `caz_annot_ft` feature tables. This is useful for
combining the annotation data (e.g., **CAZymes**) with MAG abundance to determine how specific functional annotations
are distributed across MAGs.
are distributed across MAGs, and use this information to estimate the total frequency of each annotation in each sample.

```{code-cell}
qiime moshpit multiply-tables \
Expand All @@ -86,15 +86,16 @@ qiime moshpit multiply-tables \
```

## Let's have a look at our CAZymes functional diversity!
We will start by calculating Bray-curtis beta diversity matrix.
We will start by calculating a Bray-curtis dissimilarity matrix to measure the dissimilarity between each sample, based on
observed frequency of different CAZyme annotations in each sample.
```{code-cell}
qiime diversity beta \
--i-table ./cache:caz_ft \
--p-metric braycurtis \
--o-distance-matrix ./cache:caz_braycurtis_dist
```

Then, we will perform PCoA from the obtained Bray-curtis matrix.
Next, we will perform principal coordinate analysis (PCoA) from the obtained Bray-curtis matrix.
```{code-cell}
qiime diversity pcoa \
--i-distance-matrix ./cache:caz_braycurtis_dist \
Expand Down
12 changes: 8 additions & 4 deletions moshpit_docs/intro.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# MOSHPIT tutorial

Welcome to the MOSHPIT tutorial! This tutorial will guide you through the process of analyzing metagenomic data using
the QIIME 2 framework and the MOSHPIT plugin suite. The tutorial is divided into several chapters, each focusing on a
different aspect of metagenomic data analysis. We will use a small published dataset to demonstrate the capabilities of
most of the methods available in MOSHPIT.
MOSHPIT (MOdular SHotgun metagenome Pipelines with Integrated provenance Tracking) is a suite of plugins for whole
metagenome assembly and analysis as part of the microbiome multi-omics data science platform [QIIME 2](https://qiime2.org/).
MOSHPIT enables flexible, modular, fully reproducible workflows for read-based or assembly-based analysis of
metagenome data.

This tutorial will guide you through the process of analyzing metagenomic data using QIIME 2 framework and MOSHPIT.
The tutorial is divided into several chapters, each focusing on a different aspect of metagenomic data analysis.
We will use a small published dataset to demonstrate the capabilities of most of the methods available in MOSHPIT.

We will begin by setting up our computational environment and fetching all the necessary data (see [Setup](setup) and
[Data retrieval](data-retrieval)). Then, we will move to quality control and filtering of the raw reads (see
Expand Down

0 comments on commit 3115416

Please sign in to comment.