From 5c4d53d746f4f4bd129df54f0de430d4f10e0bda Mon Sep 17 00:00:00 2001 From: Ivan Blagoev Topolsky Date: Wed, 13 Oct 2021 18:56:51 +0200 Subject: [PATCH] Config README.md - refer to it in common error msg - Link main README to configuration --- README.md | 4 +- config/README.md | 135 ++++++++++++++++++++++++++++++++++++++ workflow/rules/common.smk | 4 +- 3 files changed, 140 insertions(+), 3 deletions(-) create mode 100644 config/README.md diff --git a/README.md b/README.md index 422af83fd..3b2815f35 100644 --- a/README.md +++ b/README.md @@ -14,10 +14,10 @@ V-pipe is written using the Snakemake workflow management system. Different ways of initializing V-pipe are presented below. -V-pipe expects the input samples to be organized in a [two-level](https://github.com/cbg-ethz/V-pipe/wiki/getting-started#input-files) directory hierarchy, +V-pipe expects the input samples to be organized in a [two-level](config/#samples) directory hierarchy, and the sequencing reads must be provided in a sub-folder named `raw_data`. Further details can be found on the [website](https://cbg-ethz.github.io/V-pipe/usage/). -We provide virus-specific base configuration files which contain handy defaults for, e.g., HIV and SARS-CoV-2. Set the virus in the general section of the configuration file: +We provide [virus-specific base configuration files](config/#virus_base_config) which contain handy defaults for, e.g., HIV and SARS-CoV-2. Set the virus in the general section of the configuration file: ```yaml general: virus_base_config: hiv diff --git a/config/README.md b/config/README.md new file mode 100644 index 000000000..57757425c --- /dev/null +++ b/config/README.md @@ -0,0 +1,135 @@ +# Configuring V-pipe + +In order to start using V-pipe, you need to provide three things: + + 1. Samples in a specific directory structure + 2. _(optional)_ TSV file listing the samples + 3. Configuration file + +## Configuration file + +The V-pipe workflow is customized using a structured configuration file called `config.yaml`, `config.json` or, for backward compatibility, `vpipe.config` (INI-like format). + +This configuration file is a text file written using a basic structure composed of sections, properties and values. When using [YAML](https://yaml.org/spec/1.0/#id2564813) or [JSON](https://www.json.org/json-en.html) format use these languages associative array/dictionaries in two levels for sections and properties. When using the older [INI format](https://docs.python.org/3/library/configparser.html), sections are expected in squared brackets, and properties are followed by corresponding values. + +Further more, it is possible to specify additional options on the command line using Snakemake's `--configfile` to pass additional YAML/JSON configuration files, and/or using Snakemake's `--config` to pass sections and properties in a [YAML Flow style](https://yaml.org/spec/1.2.0/#Flow)/JSON syntax. + +Here is an **example** of `config.yaml`: + +```yaml +general: + virus_base_config: hiv + +input: + datadir: samples + samples_file: config/samples.tsv + +output: + datadir: results + snv: true + local: true + global: false + visualization: true + QA: true +``` + +At minimum, a valid configuration **MUST** provide a reference sequence against which to align the short reads from the raw data. This can be done in several ways: + + - by using a [_virus base config_ ](#virus-base-config) that will provide default presets for specific viruses + - by directly passing a reference .fasta file in the section _input_ -> property _reference_ that will override the default + + +### virus base config + +We provide virus-specific base configuration files which contain handy defaults for some viruses. + +Currently, the following _virus base config_ are available: + + - [hiv](hiv.yaml): provides HXB2 as a reference sequence for HIV, and sets the default aligner to _ngshmmalign_. + - [sars-cov-2](sars-cov-2.yaml): provides NC_045512.2 as a reference sequence for SARS-CoV-2, sets the default aligner to _bwa_ and sets the variant calling to be done against the reference instead of the cohort's consensus. + + +### configuration manual + +More information about all the available configuration options and an exhaustive list can be found in [config.html](config.html) +or [online](https://htmlpreview.github.io/?https://github.com/cbg-ethz/V-pipe/blob/master/config/config.html). + + +### legacy V-pipe 1.xx/2.xx users + +If you want to re-use your old configuration +from a [legacy V-pipe v1.x/2.x installation](https://github.com/cbg-ethz/V-pipe/wiki/options) +or [sars-cov2 branch](https://cbg-ethz.github.io/V-pipe/tutorial/sars-cov2/#running-v-pipe) +it is possible, if you keep in mind the following caveats: + +- The older INI-like syntax is still supported for a `vpipe.config` configuration file. + - This configuration will be overridden by `config.yaml` or `config.json`, + you might want to delete those files from your working directory if you are not using them. +- V-pipe starting from version 2.99.1 follows the [Standardized usage](https://snakemake.github.io/snakemake-workflow-catalog/?rules=true) rules of the + [Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog/?usage=cbg-ethz/V-pipe) + - This defines a newer [directory structure](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#distribution-and-reproducibility) + - samples TSV table is now expected to be in `config/samples.tsv` + (use the section _input_ -> property _samples_file_ to override). + - the per sample output isn't written in the same `samples/` directory as the input anymore, but in a separate directory called `results/` + (use the section _output_ -> property _datadir_ to override). + - the cohort-wide output isn't written in a different `variants/` directory anymore, but at at the base of the _output datadir_ - i.e by default in `results/` + (use the section _output_ -> property _cohortdir_ to specify a different path **relative to the output datadir**). + - Add the following sections and properties to your `vpipe.config` configuration file to **bring back the legacy behaviour**: +```ini +[input] +datadir=samples +samples_file=samples.tsv + +[output] +datadir=samples +cohortdir=../variants +``` + +As of version 2.99.1, only the analysis of viral sequencing data has been +[extensively tested](https://github.com/cbg-ethz/V-pipe/actions/workflows/run_regression_tests.yaml) +and is guaranteed stable. +For other more advanced functionality you might want to wait until a future release. + +## samples tsv + +File containing sample unique identifiers and dates as tab-separated values. + +**Example:** here, we have two samples from patient 1 and one sample from patient 2: + +```tsv +patient1 20100113 +patient1 20110202 +patient2 20081130 +``` + +By default, V-pipe searches for a file named `config/samples.tsv`, if this file does not exist, a list of samples is built by searching the contents of the input datadir. + +Optionally, the samples file can contain a third column specifying the read length. This is particularly useful when samples are sequenced using protocols with different read lengths. + +## samples + +V-pipe expects the input samples to be organized in a two-level directory hierarchy. + + - The first level can be, e.g., patient samples or biological replicates of an experiment. + - The second level can be, e.g., different sampling dates or different sequencing runs of the same sample. + - Inside that directory, the sub-directory `raw_data/` holds the sequencing data in FASTQ format (optionally compressed with GZip). + +**For example:** + +``` +samples +├── patient1 +│ ├── 20100113 +│ │ └──raw_data +│ │ ├──patient1_20100113_R1.fastq +│ │ └──patient1_20100113_R2.fastq +│ └── 20110202 +│ └──raw_data +│ ├──patient1_20100202_R1.fastq +│ └──patient1_20100202_R2.fastq +└── patient2 + └── 20081130 + └──raw_data + ├──patient2_20081130_R1.fastq.gz + └──patient2_20081130_R2.fastq.gz +``` diff --git a/workflow/rules/common.smk b/workflow/rules/common.smk index 88136e055..450b2fc20 100644 --- a/workflow/rules/common.smk +++ b/workflow/rules/common.smk @@ -434,7 +434,9 @@ def get_reference_name(reference_file): if not VPIPE_BENCH: reference_file = config["input"]["reference"] if not reference_file: - raise ValueError(f"ERROR: No input reference in configuration.") + raise ValueError( + f"ERROR: No input reference in configuration. Please read: config/README.md or https://github.com/cbg-ethz/V-pipe/tree/master/config" + ) elif not is_local_file(reference_file): reference_file_alt = cachepath(reference_file) LOGGER.info(f"Caching {reference_file} into {reference_file_alt}")