Merge pull request #156 from atrigila/adress_review_comments

address reviewer comments for first release
nf-core · Nov 17, 2024 · da2d16b · da2d16b
2 parents 10a1dda + 0662a58
commit da2d16b
Show file tree

Hide file tree

Showing 16 changed files with 126 additions and 452 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -18,7 +18,7 @@ env:
   NXF_ANSI_LOG: false
   NXF_SINGULARITY_CACHEDIR: ${{ github.workspace }}/.singularity
   NXF_SINGULARITY_LIBRARYDIR: ${{ github.workspace }}/.singularity
-  NFTEST_VER: "0.9.0"
+  NFT_VER: "0.9.2"
   NFT_WORKDIR: "~"
   NFT_DIFF: "pdiff"
   NFT_DIFF_ARGS: "--line-numbers --expand-tabs=2"
@@ -97,10 +97,9 @@ jobs:
           python -m pip install --upgrade pip
           pip install pdiff
 
-      - name: Install nf-test
-        run: |
-          wget -qO- https://code.askimed.com/install/nf-test | bash -s $NFTEST_VER
-          sudo mv nf-test /usr/local/bin/
+      - uses: nf-core/setup-nf-test@v1
+        with:
+          version: ${{ env.NFT_VER }}
 
       - name: "Run pipeline with test data ${{ matrix.NXF_VER }} | ${{ matrix.TEST_PROFILE }} | ${{ matrix.profile }}"
         run: |

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -10,9 +10,9 @@
 
 ## Pipeline tools
 
-- [QUILT](https://pubmed.ncbi.nlm.nih.gov/34083788/)
+- [bcftools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/)
 
-> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., ... & Myers, S. (2021). Rapid genotype imputation from sequence with reference panels. Nature genetics, 53(7), 1104-1111.
+> Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
 
 - [GLIMPSE](https://www.nature.com/articles/s41588-020-00756-0)
 
@@ -22,21 +22,21 @@
 
 > Rubinacci, S., Hofmeister, R. J., Sousa da Mota, B., & Delaneau, O. (2023). Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes. Nature genetics 55, 1088–1090.
 
-- [STITCH](https://doi.org/10.1038/ng.3594)
+- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
 
-> Davies, R. W., Flint, J., Myers, S., & Mott, R.(2016). Rapid genotype imputation from sequence without reference panels. Nature genetics 48, 965–969.
+> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
 
-- [Shapeit](https://doi.org/10.1038/s41588-023-01415-w)
+- [QUILT](https://pubmed.ncbi.nlm.nih.gov/34083788/)
 
-> Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nature Genetics doi: https://doi.org/10.1038/s41588-023-01415-w
+> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., ... & Myers, S. (2021). Rapid genotype imputation from sequence with reference panels. Nature genetics, 53(7), 1104-1111.
 
-- [bcftools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/)
+- [Shapeit](https://doi.org/10.1038/s41588-023-01415-w)
 
-> Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
+> Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nature Genetics doi: https://doi.org/10.1038/s41588-023-01415-w
 
-- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
+- [STITCH](https://doi.org/10.1038/ng.3594)
 
-> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
+> Davies, R. W., Flint, J., Myers, S., & Mott, R. (2016). Rapid genotype imputation from sequence without reference panels. Nature genetics 48, 965–969.
 
 ## Software packaging/containerisation tools
 

diff --git a/README.md b/README.md
@@ -19,11 +19,33 @@
 
 ## Introduction
 
-**nf-core/phaseimpute** is a bioinformatics pipeline to phase and impute genetic data. The pipeline is constituted of five main steps:
+**nf-core/phaseimpute** is a bioinformatics pipeline to phase and impute genetic data.
 
-| Metro map                                                                       | Modes                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
-| ------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| <img src="docs/images/metro/MetroMap_animated.svg" alt="metromap" width="800"/> | - **Check chromosomes names**: Validates the presence of the different contigs in all variants and alignment files, ensuring data compatibility for further processing <br> - **Panel preparation**: Perfoms the phasing, QC, variant filtering, variant annotation of the reference panel <br> - **Imputation**: Imputes genotypes in the target dataset using the reference panel <br> - **Simulate**: Generates simulated datasets from high-quality target data for testing and validation purposes. <br> - **Concordance**: Evaluates the accuracy of imputation by comparing the imputed data against a truth dataset. |
+<img src="docs/images/metro/phaseimpute.drawio.png" alt="metromap"/>
+
+The whole pipeline consists of five main steps, each of which can be run separately and independently. Users are not required to run all steps sequentially and can select specific steps based on their needs:
+
+1. **QC: Chromosome Name Check**: Ensures compatibility by validating that all expected contigs are present in the variant and alignment files.
+
+2. **Simulation (`--simulate`)**: Generates artificial datasets by downsampling high-density data to simulate low-pass genetic information. This enables the comparison of imputation results against a high-quality dataset (truth set). Simulations may include:
+
+   - **Low-pass data generation** by downsampling BAM or CRAM files with [`samtools view -s`](https://www.htslib.org/doc/samtools-view.html) at different depths.
+
+3. **Panel Preparation (`--panelprep`)**: Prepares the reference panel through phasing, quality control, variant filtering, and annotation. Key processes include:
+
+   - **Normalization** of the reference panel to retain essential variants.
+   - **Phasing** of haplotypes in the reference panel using [Shapeit5](https://odelaneau.github.io/shapeit5/).
+   - **Chunking** of the reference panel into specific regions across chromosomes.
+   - **Position Extraction** for targeted imputation sites.
+
+4. **Imputation (`--impute`)**: This is the primary step, where genotypes in the target dataset are imputed using the prepared reference panel. The main steps are:
+
+   - **Imputation** of the target dataset using tools like [Glimpse1](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html), [Glimpse2](https://odelaneau.github.io/GLIMPSE/), [Stitch](https://github.com/rwdavies/stitch), or [Quilt](https://github.com/rwdavies/QUILT).
+   - **Ligation** of imputed chunks to produce a final VCF file per sample, with all chromosomes unified.
+
+5. **Validation (`--validate`)**: Assesses imputation accuracy by comparing the imputed dataset to a truth dataset. This step leverages the [Glimpse2](https://odelaneau.github.io/GLIMPSE/) concordance process to summarize differences between two VCF files.
+
+For more detailed instructions, please refer to the [usage documentation](https://nf-co.re/phaseimpute/usage).
 
 ## Usage
 
@@ -32,9 +54,7 @@
 
 The primary function of this pipeline is to impute a target dataset based on a phased panel. Begin by preparing a samplesheet with your input data, formatted as follows:
 
-`samplesheet.csv`:
-
-```csv
+```csv title="samplesheet.csv"
 sample,file,index
 SAMPLE_1X,/path/to/.<bam/cram>,/path/to/.<bai,crai>
 ```
@@ -43,7 +63,7 @@ Each row represents either a bam or a cram file along with its corresponding ind
 
 For certain tools and steps within the pipeline, you will also need to provide a samplesheet for the reference panel. Here's an example of what a final samplesheet for a reference panel might look like, covering three chromosomes:
 
-```csv
+```csv title="panel.csv"
 panel,chr,vcf,index
 Phase3,1,ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz,ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.csi
 Phase3,2,ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz,ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.csi
@@ -52,16 +72,11 @@ Phase3,3,ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.
 
 ## Running the pipeline
 
-Execute the pipeline with the following command:
+Run one of the steps of the pipeline (imputation with glimpse1) using the following command and test profile:
 
 ```bash
 nextflow run nf-core/phaseimpute \
-   -profile <docker/singularity/.../institute> \
-   --input <samplesheet.csv>  \
-   --genome "GRCh38" \
-   --panel <phased_reference_panel.csv> \
-   --steps "panelprep,impute" \
-   --tools "glimpse1" \
+   -profile test, <docker/singularity/.../institute> \
    --outdir <OUTDIR>
 ```
 
@@ -70,18 +85,6 @@ nextflow run nf-core/phaseimpute \
 
 For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/phaseimpute/usage) and the [parameter documentation](https://nf-co.re/phaseimpute/parameters).
 
-## Description of the different steps of the pipeline
-
-Here is a short description of the different steps of the pipeline.
-For more information please refer to the [usage documentation](https://nf-co.re/phaseimpute/usage).
-
-| steps           | Flow chart                                                                       | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-| --------------- | -------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| **--panelprep** | <img src="docs/images/metro/PanelPrep.png" alt="Panel preparation" width="600"/> | The preprocessing mode is responsible for preparing multiple input files that will be used by the phasing and imputation process. <br> The main processes are : <br> - **Haplotypes phasing** of the reference panel using [**Shapeit5**](https://odelaneau.github.io/shapeit5/). <br> - **Normalize** the reference panel to select only the necessary variants. <br> - **Chunking the reference panel** into a subset of regions for all the chromosomes. <br> - **Extract** the positions where to perform the imputation.                                                                                                                                                                                                                                                                                                                                           |
-| **--impute**    | <img src="docs/images/metro/Impute.png" alt="Impute target" width="600"/>        | The imputation mode is the core mode of this pipeline. <br> It consists of 3 main steps: <br> - **Imputation**: Impute the target dataset on the reference panel using either: <br> &emsp; - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html): It comes with the necessity to compute the genotype likelihoods of the target dataset (done using [`bcftools mpileup`](https://samtools.github.io/bcftools/bcftools.html#mpileup)). <br> &emsp; - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/) <br> &emsp; - [**Stitch**](https://github.com/rwdavies/stitch) This step does not require a reference panel but needs to merge the samples. <br> &emsp; - [**Quilt**](https://github.com/rwdavies/QUILT) <br> - **Ligation**: all the different chunks are merged together then all chromosomes are reunited to output one VCF per sample. |
-| **--simulate**  | <img src="docs/images/metro/Simulate.png" alt="simulate_metro" width="600"/>     | The simulation mode is used to create artificial low informative genetic information from high density data. This allows the comparison of the imputed result to a _truth_ and therefore evaluates the quality of the imputation. <br> For the moment it is possible to simulate: <br> - Low-pass data by **downsample** BAM or CRAM using [`samtools view -s`](https://www.htslib.org/doc/samtools-view.html) at different depth.                                                                                                                                                                                                                                                                                                                                                                                                                                      |
-| **--validate**  | <img src="docs/images/metro/Validate.png" alt="concordance_metro" width="600"/>  | This mode compares two VCF files together to compute a summary of the differences between them. <br> This step uses [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/) concordance process.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
-
 ## Pipeline output
 
 To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/phaseimpute/results) tab on the nf-core website pipeline page.

diff --git a/assets/chr_rename_del.txt b/assets/chr_rename_del.txt
@@ -36,4 +36,4 @@ chr35 35
 chr36 36
 chr37 37
 chr38 38
-chr39 X
+chrX X
diff --git a/docs/development.md b/docs/development.md