Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

address reviewer comments for first release #156

Merged
merged 42 commits into from
Nov 17, 2024
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
1134498
update nf-test version
atrigila Nov 5, 2024
def5bef
Change main figure in table to figure+caption
atrigila Nov 5, 2024
7a4d516
add file titles
atrigila Nov 5, 2024
07bd474
replace table with figures for ordered list
atrigila Nov 5, 2024
13237e3
Order tools alphabetically based on the tool name
atrigila Nov 5, 2024
3e104a8
add useful info from development.md to CONTRIBUTING.md
atrigila Nov 5, 2024
2b3651a
remove logo.svg not used
atrigila Nov 5, 2024
4037e8e
fix typo
atrigila Nov 5, 2024
213ffbf
Fix typos
atrigila Nov 5, 2024
d8d6dff
add tree structure
atrigila Nov 5, 2024
11b0cd3
one line per sentence in markdown
atrigila Nov 5, 2024
ad0eda4
consistent VCF vs. vcf
atrigila Nov 5, 2024
dde1285
add dots to sentences
atrigila Nov 5, 2024
6daae70
remove duplicate
atrigila Nov 5, 2024
fbfa9d0
show example for 3 files
atrigila Nov 5, 2024
13f7611
improve description
atrigila Nov 5, 2024
8779a19
improve usage.md
atrigila Nov 5, 2024
ed26114
change how to install nf-test
atrigila Nov 6, 2024
af00b5c
remove tsv that is not produced anymore
atrigila Nov 6, 2024
d9cce4a
remove repeated information
atrigila Nov 6, 2024
69eb8bb
give higher level heading to tools, fix grammar
atrigila Nov 6, 2024
17c4817
add groovylang
atrigila Nov 6, 2024
93f51ca
delete tags.yml
atrigila Nov 6, 2024
13d2a2d
align
atrigila Nov 6, 2024
b65f85b
fix patterns in schema
atrigila Nov 6, 2024
c159855
revert setup nf-test
atrigila Nov 6, 2024
e5e4818
test sharding strategy in ci
atrigila Nov 6, 2024
6654bc1
Revert "test sharding strategy in ci"
atrigila Nov 6, 2024
9923c93
make contributing.md same as template
atrigila Nov 6, 2024
a1cd930
fix typo in chrX chr39
atrigila Nov 6, 2024
e92e2da
pattern checking for comma separated list
atrigila Nov 8, 2024
b806079
reorganize diagrams in usage section
atrigila Nov 8, 2024
2228cc3
replace for quick working example
atrigila Nov 8, 2024
0c5cdf2
remove txt2image
atrigila Nov 8, 2024
4f9da30
update image with white background
atrigila Nov 10, 2024
969af5e
improve introduction flow
atrigila Nov 10, 2024
f722171
use setup nf-test
atrigila Nov 10, 2024
05397b4
Update nextflow_schema.json
atrigila Nov 10, 2024
4f4d034
remove spaces
atrigila Nov 10, 2024
4e7255e
Merge branch 'adress_review_comments' of https://github.com/atrigila/…
atrigila Nov 10, 2024
200176a
modify diagram
atrigila Nov 10, 2024
0662a58
update diagram
atrigila Nov 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ env:
NXF_ANSI_LOG: false
NXF_SINGULARITY_CACHEDIR: ${{ github.workspace }}/.singularity
NXF_SINGULARITY_LIBRARYDIR: ${{ github.workspace }}/.singularity
NFTEST_VER: "0.9.0"
NFT_VER: "0.9.2"
NFT_WORKDIR: "~"
NFT_DIFF: "pdiff"
NFT_DIFF_ARGS: "--line-numbers --expand-tabs=2"
Expand Down Expand Up @@ -97,10 +97,9 @@ jobs:
python -m pip install --upgrade pip
pip install pdiff

- name: Install nf-test
run: |
wget -qO- https://code.askimed.com/install/nf-test | bash -s $NFTEST_VER
sudo mv nf-test /usr/local/bin/
- uses: nf-core/setup-nf-test@v1
with:
version: ${{ env.NFT_VER }}

- name: "Run pipeline with test data ${{ matrix.NXF_VER }} | ${{ matrix.TEST_PROFILE }} | ${{ matrix.profile }}"
run: |
Expand Down
24 changes: 12 additions & 12 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,33 +10,33 @@

## Pipeline tools

- [QUILT](https://pubmed.ncbi.nlm.nih.gov/34083788/)
- [bcftools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/)

> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., ... & Myers, S. (2021). Rapid genotype imputation from sequence with reference panels. Nature genetics, 53(7), 1104-1111.
> Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
LouisLeNezet marked this conversation as resolved.
Show resolved Hide resolved

- [GLIMPSE](https://www.nature.com/articles/s41588-020-00756-0)

> Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O. (2021). Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics, 53(1), 120-126.
> Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O. (2021). Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics, 53(1), 120-126.

- [GLIMPSE2](https://doi.org/10.1038/s41588-023-01438-3)

> Rubinacci, S., Hofmeister, R. J., Sousa da Mota, B., & Delaneau, O. (2023). Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes. Nature genetics 55, 1088–1090.
> Rubinacci, S., Hofmeister, R. J., Sousa da Mota, B., & Delaneau, O. (2023). Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes. Nature genetics 55, 1088–1090.

- [STITCH](https://doi.org/10.1038/ng.3594)
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Davies, R. W., Flint, J., Myers, S., & Mott, R.(2016). Rapid genotype imputation from sequence without reference panels. Nature genetics 48, 965–969.
> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [Shapeit](https://doi.org/10.1038/s41588-023-01415-w)
- [QUILT](https://pubmed.ncbi.nlm.nih.gov/34083788/)

> Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nature Genetics doi: https://doi.org/10.1038/s41588-023-01415-w
> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., ... & Myers, S. (2021). Rapid genotype imputation from sequence with reference panels. Nature genetics, 53(7), 1104-1111.

- [bcftools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/)
- [Shapeit](https://doi.org/10.1038/s41588-023-01415-w)

> Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
> Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nature Genetics doi: https://doi.org/10.1038/s41588-023-01415-w

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
- [STITCH](https://doi.org/10.1038/ng.3594)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
> Davies, R. W., Flint, J., Myers, S., & Mott, R. (2016). Rapid genotype imputation from sequence without reference panels. Nature genetics 48, 965–969.

## Software packaging/containerisation tools

Expand Down
57 changes: 30 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,33 @@

## Introduction

**nf-core/phaseimpute** is a bioinformatics pipeline to phase and impute genetic data. The pipeline is constituted of five main steps:
**nf-core/phaseimpute** is a bioinformatics pipeline to phase and impute genetic data.

| Metro map | Modes |
| ------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <img src="docs/images/metro/MetroMap_animated.svg" alt="metromap" width="800"/> | - **Check chromosomes names**: Validates the presence of the different contigs in all variants and alignment files, ensuring data compatibility for further processing <br> - **Panel preparation**: Perfoms the phasing, QC, variant filtering, variant annotation of the reference panel <br> - **Imputation**: Imputes genotypes in the target dataset using the reference panel <br> - **Simulate**: Generates simulated datasets from high-quality target data for testing and validation purposes. <br> - **Concordance**: Evaluates the accuracy of imputation by comparing the imputed data against a truth dataset. |
<img src="docs/images/metro/phaseimpute.drawio.png" alt="metromap"/>

The whole pipeline consists of five main steps, each of which can be run separately and independently. Users are not required to run all steps sequentially and can select specific steps based on their needs:

1. **QC: Chromosome Name Check**: Ensures compatibility by validating that all expected contigs are present in the variant and alignment files.

2. **Simulation (`--simulate`)**: Generates artificial datasets by downsampling high-density data to simulate low-pass genetic information. This enables the comparison of imputation results against a high-quality dataset (truth set). Simulations may include:

- **Low-pass data generation** by downsampling BAM or CRAM files with [`samtools view -s`](https://www.htslib.org/doc/samtools-view.html) at different depths.

3. **Panel Preparation (`--panelprep`)**: Prepares the reference panel through phasing, quality control, variant filtering, and annotation. Key processes include:

- **Normalization** of the reference panel to retain essential variants.
- **Phasing** of haplotypes in the reference panel using [Shapeit5](https://odelaneau.github.io/shapeit5/).
- **Chunking** of the reference panel into specific regions across chromosomes.
- **Position Extraction** for targeted imputation sites.

4. **Imputation (`--impute`)**: This is the primary step, where genotypes in the target dataset are imputed using the prepared reference panel. The main steps are:

- **Imputation** of the target dataset using tools like [Glimpse1](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html), [Glimpse2](https://odelaneau.github.io/GLIMPSE/), [Stitch](https://github.com/rwdavies/stitch), or [Quilt](https://github.com/rwdavies/QUILT).
- **Ligation** of imputed chunks to produce a final VCF file per sample, with all chromosomes unified.

5. **Validation (`--validate`)**: Assesses imputation accuracy by comparing the imputed dataset to a truth dataset. This step leverages the [Glimpse2](https://odelaneau.github.io/GLIMPSE/) concordance process to summarize differences between two VCF files.

For more detailed instructions, please refer to the [usage documentation](https://nf-co.re/phaseimpute/usage).

## Usage

Expand All @@ -32,9 +54,7 @@

The primary function of this pipeline is to impute a target dataset based on a phased panel. Begin by preparing a samplesheet with your input data, formatted as follows:

`samplesheet.csv`:

```csv
```csv title="samplesheet.csv"
sample,file,index
SAMPLE_1X,/path/to/.<bam/cram>,/path/to/.<bai,crai>
```
Expand All @@ -43,7 +63,7 @@ Each row represents either a bam or a cram file along with its corresponding ind

For certain tools and steps within the pipeline, you will also need to provide a samplesheet for the reference panel. Here's an example of what a final samplesheet for a reference panel might look like, covering three chromosomes:

```csv
```csv title="panel.csv"
panel,chr,vcf,index
Phase3,1,ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz,ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.csi
Phase3,2,ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz,ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.csi
Expand All @@ -52,16 +72,11 @@ Phase3,3,ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.

## Running the pipeline

Execute the pipeline with the following command:
Run one of the steps of the pipeline (imputation with glimpse1) using the following command and test profile:

```bash
nextflow run nf-core/phaseimpute \
-profile <docker/singularity/.../institute> \
--input <samplesheet.csv> \
--genome "GRCh38" \
--panel <phased_reference_panel.csv> \
--steps "panelprep,impute" \
--tools "glimpse1" \
-profile test, <docker/singularity/.../institute> \
--outdir <OUTDIR>
```

Expand All @@ -70,18 +85,6 @@ nextflow run nf-core/phaseimpute \

For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/phaseimpute/usage) and the [parameter documentation](https://nf-co.re/phaseimpute/parameters).

## Description of the different steps of the pipeline
LouisLeNezet marked this conversation as resolved.
Show resolved Hide resolved

Here is a short description of the different steps of the pipeline.
For more information please refer to the [usage documentation](https://nf-co.re/phaseimpute/usage).

| steps | Flow chart | Description |
| --------------- | -------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **--panelprep** | <img src="docs/images/metro/PanelPrep.png" alt="Panel preparation" width="600"/> | The preprocessing mode is responsible for preparing multiple input files that will be used by the phasing and imputation process. <br> The main processes are : <br> - **Haplotypes phasing** of the reference panel using [**Shapeit5**](https://odelaneau.github.io/shapeit5/). <br> - **Normalize** the reference panel to select only the necessary variants. <br> - **Chunking the reference panel** into a subset of regions for all the chromosomes. <br> - **Extract** the positions where to perform the imputation. |
| **--impute** | <img src="docs/images/metro/Impute.png" alt="Impute target" width="600"/> | The imputation mode is the core mode of this pipeline. <br> It consists of 3 main steps: <br> - **Imputation**: Impute the target dataset on the reference panel using either: <br> &emsp; - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html): It comes with the necessity to compute the genotype likelihoods of the target dataset (done using [`bcftools mpileup`](https://samtools.github.io/bcftools/bcftools.html#mpileup)). <br> &emsp; - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/) <br> &emsp; - [**Stitch**](https://github.com/rwdavies/stitch) This step does not require a reference panel but needs to merge the samples. <br> &emsp; - [**Quilt**](https://github.com/rwdavies/QUILT) <br> - **Ligation**: all the different chunks are merged together then all chromosomes are reunited to output one VCF per sample. |
| **--simulate** | <img src="docs/images/metro/Simulate.png" alt="simulate_metro" width="600"/> | The simulation mode is used to create artificial low informative genetic information from high density data. This allows the comparison of the imputed result to a _truth_ and therefore evaluates the quality of the imputation. <br> For the moment it is possible to simulate: <br> - Low-pass data by **downsample** BAM or CRAM using [`samtools view -s`](https://www.htslib.org/doc/samtools-view.html) at different depth. |
| **--validate** | <img src="docs/images/metro/Validate.png" alt="concordance_metro" width="600"/> | This mode compares two VCF files together to compute a summary of the differences between them. <br> This step uses [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/) concordance process. |

## Pipeline output

To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/phaseimpute/results) tab on the nf-core website pipeline page.
Expand Down
2 changes: 1 addition & 1 deletion assets/chr_rename_del.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,4 @@ chr35 35
chr36 36
chr37 37
chr38 38
chr39 X
chrX X
96 changes: 0 additions & 96 deletions docs/development.md

This file was deleted.

Loading
Loading