Merge pull request #3 from nbokulich/nb-edits

updating text: style edits
bokulich-lab · Nov 19, 2024 · 3115416 · 3115416
2 parents d203599 + b78489e
commit 3115416
Show file tree

Hide file tree

Showing 15 changed files with 56 additions and 43 deletions.
diff --git a/.gitignore b/.gitignore
@@ -163,3 +163,5 @@ cython_debug/
 
 moshpit_docs/_build
 .idea/
+
+.DS_Store
diff --git a/moshpit_docs/chapters/00_data_retrieval.md b/moshpit_docs/chapters/00_data_retrieval.md
@@ -13,9 +13,10 @@ kernelspec:
 ---
 (data-retrieval)=
 # Data retrieval
-The dataset which we are using in this tutorial is available through the Sequence Read Archive. To retrieve it we will
-use the q2-fondue plugin: we only need to provide a list of accession IDs which we are interested in downloading - 
-everything else will be taken care of for us.
+The dataset used in this tutorial is available through the [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) (SRA). 
+To retrieve it we will use the [q2-fondue plugin](https://github.com/bokulich-lab/q2-fondue) for programmatic access to 
+sequences and metadata from SRA; we only need to provide a list of accession IDs to download - q2-fondue will take care of 
+the rest.
 
 ```{note}
 You need to provide an e-mail address when running this command - this is required by the NCBI as a way to 

diff --git a/moshpit_docs/chapters/00_setup.md b/moshpit_docs/chapters/00_setup.md
@@ -14,8 +14,9 @@ kernelspec:
 (setup)=
 # Setup
 Before we dive into the tutorial, let's make sure we have all the necessary components in place. Make sure you have a 
-working QIIME 2 metagenome environment available - please follow the instructions from the official [QIIME 2 documentation](https://docs.qiime2.org/2024.5/install/native/) 
-to learn more.
+working QIIME 2 metagenome environment available - please follow the instructions from the official 
+[QIIME 2 documentation](https://docs.qiime2.org/2024.10/install/native/#qiime-2-metagenome-distribution) to install 
+the QIIME 2 "Metagenome Distribution".
 
 In this tutorial we will be storing all the data in the QIIME 2 cache. To learn more about how the cache works you can 
 consult [this](https://dev.qiime2.org/latest/api-reference/cache) QIIME 2 forum post. You should create a single cache 

diff --git a/moshpit_docs/chapters/01_filtering/host-filtering.md b/moshpit_docs/chapters/01_filtering/host-filtering.md
@@ -13,14 +13,14 @@ kernelspec:
 ---
 # Host read removal
 There are a few different options to perform host read removal in QIIME 2: a more generic one using the `filter-reads` action
-and a more specific one using the `filter-reads-pangenome` action. Below you can see how to use both of them. In this tutorial we will 
+and a more specific one using the `filter-reads-pangenome` action. Below you can learn how to use both of them. In this tutorial we will 
 use the `filter-reads-pangenome` action to remove human reads from the dataset.
 
 ## Removal of contaminating reads
 Removal of contaminating reads can generally be done by mapping the reads to a reference database and filtering out the reads
-that map to it. In QIIME 2 this can be done by using the `filter-reads` action from the `quality-control` plugin. Before the filtering
-we need to construct the index of the reference database which will be used by Bowtie 2:
-- start with the FASTA files contaning the reference sequences - we will import them into a QIIME 2 artifact:
+that map to it. In QIIME 2 this can be done by using the `filter-reads` action from the `quality-control` plugin. Before filtering
+we need to construct the index of the reference database that will be used by Bowtie 2:
+- start with the FASTA files containing the reference sequences - we will import them into a QIIME 2 artifact:
 ```{code-cell}
 qiime tools cache-import \
  --cache ./cache \
@@ -44,7 +44,7 @@ qiime quality-control filter-reads \
 
 ## Human host reads
 Contaminating human reads can also be filtered out using the approach shown above by providing a human reference genome.
-Since a single human reference genome is not enough to cover all the human genetic diversity, it is now recommended to use a
+Since a single human reference genome is not enough to cover all the human genetic diversity, it is recommended to use a
 collection of genomes represented by the human pangenome (__CIT). We have built a new QIIME 2 action `filter-reads-pangenome`
 which allows to first fetch the human pangenome sequence, combine it with the GRCh38 reference genome, build a combined 
 Bowtie 2 index and, finally, filter the reads against it. Next to the filtered reads, the action will also return the generated 

diff --git a/moshpit_docs/chapters/01_filtering/intro.md b/moshpit_docs/chapters/01_filtering/intro.md
@@ -13,8 +13,10 @@ kernelspec:
 ---
 (quality-control)=
 # Quality control
-Just like any other NGS experiment, shotgun metagenomics data should be quality controlled before any downstream analysis. 
-The filtering steps may include adapter removal, quality trimming, and filtering out low-quality reads. Moreover, metagenomic
-reads may contain host DNA, which should be removed. QIIME 2 provides functionality to address some of those issues - the 
-MOSHPIT plugin suite only expands on those by focusing more on host DNA removal. The next sections contain a brief overview 
-of the filtering steps which can be done using QIIME 2 and MOSHPIT.
+As with any other NGS experiment, metagenome data should be quality controlled before any downstream analysis. 
+The filtering steps may include adapter removal, quality trimming, and filtering out low-quality reads. Moreover, 
+depending on the sample type and preparation procedures, metagenomic reads may contain host DNA, which should be 
+removed. Other QIIME 2 plugins already provide generalized functionality to address quality filtering/control of 
+next-generation sequencing data - the MOSHPIT plugin suite expands on these by focusing more on host DNA removal
+from metagenome data. The next sections contain a brief overview of the filtering steps which can be done using 
+QIIME 2 and MOSHPIT.
diff --git a/moshpit_docs/chapters/01_filtering/quality-filtering.md b/moshpit_docs/chapters/01_filtering/quality-filtering.md
@@ -15,7 +15,7 @@ kernelspec:
 ## Quality overview
 We can get an overview of the read quality by using the `summarize` action from the `demux` QIIME 2 plugin. This command 
 will generate a visualization of the quality scores at each position. You can learn more about this action in the [QIIME 2
-documentation](https://docs.qiime2.org/2024.5/plugins/available/demux/summarize/).
+documentation](https://docs.qiime2.org/2024.10/plugins/available/demux/summarize/).
 ```{code-cell}
 qiime demux summarize \
   --i-data ./cache:reads \

diff --git a/moshpit_docs/chapters/02_mag_reconstruction/abundance.md b/moshpit_docs/chapters/02_mag_reconstruction/abundance.md
@@ -13,7 +13,7 @@ kernelspec:
 ---
 # MAG abundance estimation
 Once we recover MAGs from metagenomic data, we may be interested in estimating their abundance in the samples. We can do 
-it by mapping the original reads to the derepliacted MAGs and calculating the abundance based on the read mapping results.
+it by mapping the original reads to the dereplicated MAGs and calculating the abundance based on the read mapping results.
 There are a couple of ways to estimate MAG abundance, such as RPKM (Reads Per Kilobase per Million mapped reads) and TPM
 (Transcripts Per Million). Here we will use TPM to estimate the abundance of each MAG in all samples.
 

diff --git a/moshpit_docs/chapters/02_mag_reconstruction/dereplication.md b/moshpit_docs/chapters/02_mag_reconstruction/dereplication.md
@@ -14,8 +14,8 @@ kernelspec:
 (dereplication)=
 # MAG set dereplication
 Depending on the application, it may be necessary to dereplicate the set of MAGs to remove redundancy and retain only 
-unique genome representatives. Our workflow includes a dereplication step which uses any genome distance matrix, which is
-used to find clusters of similar genomes (based on a specific similarity threshold) and identify the most representative 
+unique genome representatives. Our workflow includes a dereplication step that can use any genome distance matrix to 
+find clusters of similar genomes (based on a specific similarity threshold) and identify the most representative 
 MAG (in our case, it will be the longest genome in the cluster). Here we use Sourmash to generate the distance matrix 
 but any other tool could also be used. 
 ## Compute MinHash signatures with Sourmash

diff --git a/moshpit_docs/chapters/02_mag_reconstruction/reconstruction.md b/moshpit_docs/chapters/02_mag_reconstruction/reconstruction.md
@@ -21,8 +21,8 @@ memory resources before running these commands.
 ## Assemble contigs with MEGAHIT
 The first step in recovering MAGs is genome assembly itself. There are many genome assemblers available, two of which 
 you can use through the q2-assembly plugin - here, we will use MEGAHIT. MEGAHIT takes short DNA sequencing reads, 
-constructs a simplified De Bruijn graph, and generates longer contiguous sequences called contigs, providing valuable 
-genetic information for the next steps of our analysis.
+constructs a simplified [De Bruijn graph](https://en.wikipedia.org/wiki/De_Bruijn_graph), and generates longer contiguous 
+sequences called contigs, providing valuable genetic information for the next steps of our analysis.
 
 - The `--p-presets` specifies the preset mode for MEGAHIT. In this case, it's set to "meta-sensitive" for metagenomic data.
 - The `--p-cpu-threads` specifies the number of CPU threads to use during assembly.
@@ -46,7 +46,7 @@ qiime assembly assemble-megahit \
 
 ## Contig QC with QUAST
 Once the reads are assembled into contigs, we can use QUAST to evaluate the quality of our assembly. There are many 
-metrics which can be used for that purpose but here we will focus on the two most popular metrics:
+metrics that can be used for that purpose but here we will focus on the two most popular metrics:
 - **N50**: represents the contiguity of a genome assembly. It's defined as the length of the contig (or scaffold) at 
     which 50% of the entire genome is covered by contigs of that length or longer - the higher this number, the better.
 - **L50**: represents the number of contigs required to cover 50% of the genome's total length - the smaller this number, 
@@ -102,10 +102,10 @@ qiime moshpit bin-contigs-metabat \
   --o-unbinned-contigs ./cache:unbinned_contigs \
   --verbose          
 ```
-This tep generated a couple artifacts:
+This step generated several artifacts:
 
 - `mags`: these are our actual MAGS, per sample.
-- `contig_map`: this is a mapping between MAG IDs and IDs of contigs which belong to a given MAG.
+- `contig_map`: this is a mapping between MAG IDs and IDs of contigs that belong to a given MAG.
 - `unbinned_contigs`: these are all the contigs that could not be assign to any particular MAG.
 From now on, we will focus on the `mags`.
 
@@ -141,11 +141,12 @@ The `--p-lineage-dataset bacteria_odb10` parameter specifies the particular line
 the bacteria_odb10 dataset. This is a standard database for bacterial genomes.
 
 Your visualization should look similar to [this one](https://view.qiime2.org/visualization/?src=https://raw.githubusercontent.com/bokulich-lab/moshpit-docs/main/moshpit_docs/data/mags.qzv).
+
 ## Filter MAGs
 This step filters MAGs based on completeness. In this example, we filter out any MAGs with completeness below 50%. 
 The filtering process ensures only high-quality genomes are kept for downstream analysis.
 ```{tip}
-We recommed this step to be done before dereplication (as in this example). Alternatively, we can also use the 
+We recommed that this step is done before dereplication (as in this example). Alternatively, we can also use the 
 [dereplicated set](dereplication) and filter this one using `qiime moshpit filter-derep-mags`.
 ```
 

diff --git a/moshpit_docs/chapters/03_taxonomic_classification/intro.md b/moshpit_docs/chapters/03_taxonomic_classification/intro.md
@@ -48,7 +48,7 @@ For more information on Kraken 2, consult [Wood et al., 2019](https://genomebiol
 ```
 
 ## Kaiju: protein-based classification
-Kaiju compares reads by translating DNA sequences into protein sequences (BLASTx-like). This allows Kaiju to identify 
+Kaiju compares reads by translating DNA sequences into protein sequences (similar to BLASTx). This allows Kaiju to identify 
 organisms accurately when nucleotide sequences are too divergent to be identified with DNA-based methods. Kaiju uses a 
 fast exact matching algorithm based on Burrows-Wheeler Transform (BWT) and FM-index to align translated DNA reads 
 against a reference database of protein sequences.

diff --git a/moshpit_docs/chapters/03_taxonomic_classification/mags.md b/moshpit_docs/chapters/03_taxonomic_classification/mags.md
@@ -12,7 +12,7 @@ kernelspec:
   name: python3
 ---
 # Taxonomic classification of MAGs
-Kraken 2 can also be used to obtain a classification of metagenome-assembled genomes (MAGs). In this tutorial we use this
+Kraken 2 can also be used to taxonomically classify metagenome-assembled genomes (MAGs). In this tutorial we use this
 tool to classify a subset of dereplicated MAGs but the same approach can be used for the entire set of MAGs contained in 
 the `SampleData[MAGs]` or `SampleData[Contigs]` artifacts.
 ```{code-cell}

diff --git a/moshpit_docs/chapters/03_taxonomic_classification/reads.md b/moshpit_docs/chapters/03_taxonomic_classification/reads.md
@@ -14,7 +14,7 @@ kernelspec:
 (kraken-reads)=
 # Taxonomic classification of reads
 In this section we will focus on the taxonomic classification of shotgun metagenomic reads using two different tools: Kraken 2 and Kaiju. 
-We will use the data which we got from the steps described in the [data retrieval section](../00_data_retrieval.md).
+We will use the data obtained in the [data retrieval section](../00_data_retrieval.md).
 
 ## Approach 1: Kraken 2
 Before we can use Kraken 2, we need to build or download a database. We will use the `build-kraken-db` action to fetch the PlusPF database 
@@ -43,7 +43,7 @@ qiime moshpit classify-kraken2 \
 
 ```{seealso}
 [Bracken](https://ccb.jhu.edu/software/bracken/) is a related tool that additionally estimates relative abundances of species or genera to adjust for
-genome size which the reads originated from. In order to use this tool we need the Bracken database that was fetched in the first step.
+the genome size the organisms from which each read originated. In order to use this tool we need the Bracken database that was fetched in the first step.
 ```
 
 ```{code-cell}
@@ -57,7 +57,7 @@ qiime moshpit estimate-bracken \
   --o-reports ./cache:bracken_reports
 ```
 
-To remove the unclassified read fraction we can use the `filter-table` action from the `taxa` QIIME 2 plugin:
+To remove the unclassified read fraction we can use the `filter-table` action from the `q2-taxa` QIIME 2 plugin:
 ```{code-cell}
 qiime taxa filter-table \
   --i-table ./cache:bracken_ft \
@@ -67,8 +67,8 @@ qiime taxa filter-table \
 ```
 
 ## Approach 2: Kaiju
-Similarly to Kraken 2, Kaiju requires a reference database to perform the classification. We will use the `fetch-kaiju-db` 
-action to download the [nr_euk](https://bioinformatics-centre.github.io/kaiju/downloads.html) database that includes both, 
+Similarly to Kraken 2, Kaiju requires a reference database to perform taxonomic classification. We will use the `fetch-kaiju-db` 
+action to download the [nr_euk](https://bioinformatics-centre.github.io/kaiju/downloads.html) database that includes both 
 prokaryotes and eukaryotes (more info on the taxa [here](https://github.com/bioinformatics-centre/kaiju/blob/master/util/kaiju-taxonlistEuk.tsv)).
 ```{code-cell}
 qiime moshpit fetch-kaiju-db \

diff --git a/moshpit_docs/chapters/04_functional_annotation/intro.md b/moshpit_docs/chapters/04_functional_annotation/intro.md
@@ -27,15 +27,16 @@ extracted from complex microbial communities, bypassing the need to culture the
 
 This process provides insights into the genes that code for enzymes, transporters, and other proteins critical to the 
 survival and function of the microbes in various ecosystems. Annotating these genomes allows for the study of their 
-contributions to nutrient cycles, disease processes, or specialized ecological functions.
+contributions to nutrient cycles, disease processes, or specialized ecological functions, to name only a few examples.
 
 This workflow outlines the step-by-step process for functional annotation of MAGs or contigs using tools like EggNOG and 
 the Diamond aligner in QIIME2.
 
 ```{note}
 Functional annotation can be performed on fully reconstructed **MAGs** or directly on **contigs** (the contiguous sequences 
 assembled from sequencing reads). Annotating **contigs** can provide early insights into important functional genes even 
-before complete genomes are assembled.
+before complete genomes are assembled. Annotating **MAGs** has the added benefit of seeing how these annotated genes are 
+connected and organized in a single genome.
 
 In this tutorial, we will focus on functional annotation of our previously reconstructed MAGs (see **Recovery of MAGs section**)
 ```

diff --git a/moshpit_docs/chapters/04_functional_annotation/mags.md b/moshpit_docs/chapters/04_functional_annotation/mags.md
@@ -13,7 +13,7 @@ kernelspec:
 ---
 # Functional annotation
 ## Required databases
-In order to perform the functional annotation, we will need a couple of different reference databases. Below you will find instructions on how to download these databases using respective QIIME 2 actions.
+In order to perform the functional annotation, we will need a couple of different reference databases. Below you will find instructions on how to download these databases using MOSHPIT.
 
 ```{code-cell}
 qiime moshpit fetch-diamond-db \
@@ -57,7 +57,7 @@ qiime moshpit eggnog-annotate \
 ## Extract annotations
 This method extract a specific annotation from the table generated by EggNOG and calculates its frequencies across all MAGs.
 ```{note}
-The `qiime moshpit extract-annotations` command allows us to extract specific types of functional annotations, such as 
+The `qiime moshpit extract-annotations` method allows us to extract specific types of functional annotations, such as 
 **CAZymes**, **KEGG pathways**, **COG categories**, or other functional elements, and calculate their frequency across 
 all dereplicated MAGs. 
 
@@ -75,7 +75,7 @@ qiime moshpit extract-annotations \
 ## Multiply tables
 This steps simply calculates the dot product of the `mags_derep_ft` and `caz_annot_ft` feature tables. This is useful for 
 combining the annotation data (e.g., **CAZymes**) with MAG abundance to determine how specific functional annotations 
-are distributed across MAGs. 
+are distributed across MAGs, and use this information to estimate the total frequency of each annotation in each sample. 
 
 ```{code-cell}
 qiime moshpit multiply-tables \
@@ -86,15 +86,16 @@ qiime moshpit multiply-tables \
 ```
 
 ## Let's have a look at our CAZymes functional diversity!
-We will start by calculating Bray-curtis beta diversity matrix.
+We will start by calculating a Bray-curtis dissimilarity matrix to measure the dissimilarity between each sample, based on 
+observed frequency of different CAZyme annotations in each sample.
 ```{code-cell}
 qiime diversity beta \
     --i-table ./cache:caz_ft \
     --p-metric braycurtis \
     --o-distance-matrix ./cache:caz_braycurtis_dist
 ```
 
-Then, we will perform PCoA from the obtained Bray-curtis matrix.
+Next, we will perform principal coordinate analysis (PCoA) from the obtained Bray-curtis matrix.
 ```{code-cell}
 qiime diversity pcoa \
     --i-distance-matrix ./cache:caz_braycurtis_dist  \

diff --git a/moshpit_docs/intro.md b/moshpit_docs/intro.md
@@ -1,9 +1,13 @@
 # MOSHPIT tutorial
 
-Welcome to the MOSHPIT tutorial! This tutorial will guide you through the process of analyzing metagenomic data using 
-the QIIME 2 framework and the MOSHPIT plugin suite. The tutorial is divided into several chapters, each focusing on a 
-different aspect of metagenomic data analysis. We will use a small published dataset to demonstrate the capabilities of 
-most of the methods available in MOSHPIT.
+MOSHPIT (MOdular SHotgun metagenome Pipelines with Integrated provenance Tracking) is a suite of plugins for whole 
+metagenome assembly and analysis as part of the microbiome multi-omics data science platform [QIIME 2](https://qiime2.org/). 
+MOSHPIT enables flexible, modular, fully reproducible workflows for read-based or assembly-based analysis of 
+metagenome data.
+
+This tutorial will guide you through the process of analyzing metagenomic data using QIIME 2 framework and MOSHPIT. 
+The tutorial is divided into several chapters, each focusing on a different aspect of metagenomic data analysis. 
+We will use a small published dataset to demonstrate the capabilities of most of the methods available in MOSHPIT.
 
 We will begin by setting up our computational environment and fetching all the necessary data (see [Setup](setup) and 
 [Data retrieval](data-retrieval)). Then, we will move to quality control and filtering of the raw reads (see
Original file line number	Diff line number	Diff line change
Expand Up		@@ -163,3 +163,5 @@ cython_debug/

		moshpit_docs/_build
		.idea/

		.DS_Store