Skip to content

Latest commit

 

History

History
640 lines (481 loc) · 44 KB

README.md

File metadata and controls

640 lines (481 loc) · 44 KB

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling

License install with bioconda

Contact: Ruibang Luo, Zhenxian Zheng
Email: [email protected], [email protected]


Introduction

Clair3 is a germline small variant caller for long-reads. Clair3 makes the best of two major method categories: pileup calling handles most variant candidates with speed, and full-alignment tackles complicated candidates to maximize precision and recall. Clair3 runs fast and has superior performance, especially at lower coverage. Clair3 is simple and modular for easy deployment and integration.

Clair3 is the 3rd generation of Clair (the 2nd) and Clairvoyante (the 1st).

Clair3 is published at Nature Computational Science, and available as a preprint at bioRxiv.

For germline variant calling using long-read RNA-seq samples, please try Clair3-RNA.

For somatic variant calling using paired tumor/normal samples, please try ClairS.

For somatic variant calling using tumor-only samples, please try ClairS-TO.


Contents


Latest Updates

v1.0.10 (Jul 28, 2024) : 1. Fixed an out of range bug when outputing GVCF for non-human genome (#317). 2. Improve the calling speed when working with targeted amplicon data, Use —chunk_num=-1 to call variant without splitting genome into chunks (#306, contributor Elliot Hallmark). 3. Updated LongPhase to version 1.7.3 (#321).

v1.0.9 (May 15, 2024) : 1. Fixed an issue in VCF header(#305 by @Monica Palafox Roberts). 2. Updated DP FORMAT description in header.

v1.0.8 (Apr 29, 2024) : 1. Fixed an issue in VCF output that caused occasional quality score small differences compared to GVCF output. 2. Updated LongPhase to version 1.7.

v1.0.7 (Apr 7, 2024) : 1. Added memory guards and test function on full-alignment C implement (#286 by @Chris Wright). 2. Increased the maximum mpileup read coverage to 2^20 to adapt high-coverage amplicon data (#292 by @Devon Ryan). 3. Updated LongPhase to version 1.6.

v1.0.6 (Mar 15, 2024) : 1. Fixed stack overflow issue when the read coverage is excessively high (#282 by @Chris Wright, #265 by @ymcki). 2. Added reference caching for CRAM input (#278 by @Alex Leonard, #180 by @bartcharbon and @SamStudio8). 3. Fixed a bug that outputs RefCall calls when no variant is called by full-alignment model (#271). 4. Fixed a bug that variants below min coverage were still being called (#262). 5. Set --min_snp_af and --min_indel_af to 0.0 when --vcf_fn is provided (#261).

v1.0.5 (Dec 20, 2023) : 1. Fixed the issue showing wrong multi-allelic AF when read coverage is excessively high (#241). 2. Added --base_err and --gq_bin_size options that can resolve the problem of having excessive GT ./. in GVCF output (#220). 3. Modified logs (#231, #225)

v1.0.4 (Jul 11, 2023) : 1. Added showing command line and reference source in output VCF header. 2. Fixed a bug in showing the AF tag for 1/2 genotypes. 3. Added AD tag output.

v1.0.3 (Jun 20, 2023) : 1. Colon ':' is now allowed in reference sequence name (#203).

v1.0.2 (May 22, 2023) : 1. Added PacBio HiFi Revio model, check Pre-trained model for model usage. 2. Fixed a bug that halts the pipeline when there exists too few variant candidates (#198).

v1.0.1 (Apr 24, 2023) : 1. Bumped up "WhatsHap" version to 1.7, whatshap haplotag step is ~15% faster.(#193). 2. Fixed PL issue when alternative base is N (#191, contributor @ Dennis Hendriksen).

v1.0.0 (Mar 6, 2023) : 1. Added Clair3 version in VCF header(#141). 2. Fixed the Numpy int issue using the latest numpy version (#165, contributor @Aaron Tyler). 3. Coverted all IUPAC bases to 'N' in both VCF and GVCF output, use --keep_iupac_bases to keep the IUPAC bases (#153). 4. Added --use_longphase_for_intermediate_phasing, --use_whatshap_for_intermediate_phasing, --use_longphase_for_final_output_phasing, --use_whatshap_for_final_output_phasing, --use_whatshap_for_final_output_haplotagging options to support intermediate phasing and final VCF phasing(using WhatsHap or LongPhase) (#164). 5. Fixed shell issue in docker host user mode (#175).

v0.1-r12 (Aug 19) : 1. CRAM input is supported (#117). 2. Bumped up dependencies' version to "Python 3.9" (#96), "TensorFlow 2.8", "Samtools 1.15.1", "WhatsHap 1.4". 3. VCF DP tag now shows raw coverage for both pileup and full-alignment calls (before r12, sub-sampled coverage was shown for pileup calls if average DP > 144, (#128). 4. Fixed Illumina representation unification out-of-range error in training (#110). 5. Updated package longphase from v1.0 to v1.3 (on Sept 27th, included in all installation options labeled v0.1-r12).

v0.1-r11 minor 2 (Apr 16) : 1. fixed a bug in GVCF output that occasionally caused missing of non-variant positions at chunk boundaries. 2. fixed a bug in GVCF output that consumes too much memory for caching, now GVCF output mode takes amount of memory similar to VCF (#88).

v0.1-r11 (Apr 4) : 1. Variant calling ~2.5x faster than v0.1-r10 tested with ONT Q20 data, with feature generation in both pileup and full-alignment now implemented in C (co-contributors @cjw85, @ftostevin-ont, @EpiSlim). 2. Added the lightning-fast longphase as an option for phasing. Enable using longphase with option --longphase_for_phasing. New option disabled by default to align with the default behavior of the previous versions, but we recommend enable when calling human variants with ≥20x long-reads). 3. Added --min_coverage and --min_mq options (#83). 4. Added --min_contig_size option to skip calling variants in short contigs when using genome assembly as input. 4. Reads haplotagging after phasing before full-alignment calling now integrated into full-alignment calling to avoid generating an intermediate BAM file. 5. Supported .csi BAM index for large references (#90). For more speedup details, please check Notes on r11.

v0.1-r10 (Jan 13, 2022) : 1. Added a new ONT Guppy5 model (r941_prom_sup_g5014). Click here for some benchmarking results. This sup model is also applicable to reads called using the hac and fast mode. The old r941_prom_sup_g506 model that was fine-tuned from the Guppy3,4 model is obsoleted. 2. Added --var_pct_phasing option to control the percentage of top ranked heterozygous pile-up variants used for WhatsHap phasing.

v0.1-r9 (Dec 1) : Added the --enable_long_indel option to output indel variant calls >50bp (#64), Click here to see more benchmarking results.

v0.1-r8 (Nov 11) : 1. Added the --enable_phasing option that adds a step after Clair3 calling to output variants phased by WhatsHap (#63). 2. Fixed unexpected program termination on successful runs.

v0.1-r7 (Oct 18) : 1. Increased var_pct_full in ONT mode from 0.3 to 0.7. Indel F1-score increased ~0.2%, but took ~30 minutes longer to finish calling a ~50x ONT dataset. 2. Expand fall through to next most likely variant if network prediction has insufficient read coverage (#53 commit 09a7d185, contributor @ftostevin-ont), accuracy improved on complex Indels. 3. Streamized pileup and full-alignment training workflows. Reduce diskspace demand in model training (#55 commit 09a7d185, contributor @ftostevin-ont). 4. Added mini_epochs option in Train.py, performance slightly improved in training a model for ONT Q20 data using mini-epochs(#60, contributor @ftostevin-ont). 5. Massively reduced disk space demand when outputting GVCF. Now compressing GVCF intermediate files with lz4, five times smaller with little speed penalty. 6. Added --remove_intermediate_dirto remove intermediate files as soon as no longer needed (#48). 7. Renamed ONT pre-trained models with Medaka's naming convention. 8. Fixed training data spilling over to validation data (#57).

ONT-provided Models (Sep 23): ONT also provides Clair3 models for specific chemistries and basecallers through Rerio.

v0.1-r6 (Sep 4) : 1. Reduced memory footprint at the SortVcf stage(#45). 2. Reduced ulimit -n (number of files simultaneously opened) requirement (#45, #47). 3. Added Clair3-Illumina package in bioconda(#42).

v0.1-r5 (July 19) : 1. Modified data generator in model training to avoid memory exhaustion and unexpected segmentation fault by Tensorflow (contributor @ftostevin-ont ). 2. Simplified dockerfile workflow to reuse container caching (contributor @amblina). 3. Fixed ALT output for reference calls (contributor @wdecoster). 4. Fixed a bug in multi-allelic AF computation (AF of [ACGT]Del variants was wrong before r5). 5. Added AD tag to the GVCF output. 6. Added the --call_snp_only option to only call SNP only (#40). 7. Added pileup and full-alignment output validity check to avoid workflow crashing (#32, #38).

v0.1-r4 (June 28) : 1. Install via bioconda. 2. Added an ONT Guppy2 model to the images (ont_guppy2). Click here for more benchmarking results. The results show you have to use the Guppy2 model for Guppy2 or earlier data. 3. Added google colab notebooks for quick demo. 4. Fixed a bug when there are too few variant candidates (#28).

v0.1-r3 (June 9) : 1. Added ulimit -u (max user processes) check (lowers the THREADS if the resource is insufficient) and automatic retries on failed jobs (#20, #23, #24). 2. Added an ONT Guppy5 model to the images (ont_guppy5). Click here for more benchmarks on the Guppy5 model and data.

v0.1-r2 (May 23) : 1. Fixed BED file out of range error (#12). 2. Added support for both .bam.bai and .bai BAM index filename (#10). 3. Added some boundary checks on inputs. 4. Added version checks on required packages and utilities. 5. Increased pipeline robusity.

v0.1-r1 (May 18) : 1. Support relative path in Conda, but Docker and Singularity still require absolute path (#5). 2. Fix taskset CPU-core visibility and provide a Singularity image (#6).

v0.1 (May 17, 2021): Initial release.


Pre-trained Models

HKU-provided Models

Download models from here or click on the links below.

In a docker installation, models are in /opt/models/. In a bioconda installation, models are in {CONDA_PREFIX}/bin/models/.

Model name Platform Option (-p/--platform) Training samples Included in the bioconda package Included in the docker image Date
r941_prom_sup_g5014 ONT r9.4.1 ont HG002,4,5 (Guppy5_sup) Yes Yes 20220112
r941_prom_hac_g360+g422 ONT r9.4.1 ont HG001,2,4,5 Yes Yes 20210517
r941_prom_hac_g360+g422_1235 ONT r9.4.1 ont HG001,2,3,5 20210517
r941_prom_hac_g238 ONT r9.4.1 ont HG001,2,3,4 Yes 20210627
r941_prom_sup_g506 ONT r9.4.1 Base model: HG001,2,4,5 (Guppy3,4)
Fine-tuning data: HG002 (Guppy5_sup)
20210609
hifi_revio PacBio HiFi Revio hifi HG002,4 Yes Yes 20230522
hifi_sequel2 PacBio HiFi Sequel II hifi HG001,2,4,5 Yes Yes 20210517
ilmn Illumina ilmn HG001,2,4,5 Yes Yes 20210517

ONT-provided Models

ONT provides models for some latest or specific chemistries and basecallers (including both Guppy and Dorado) through Rerio. These models are tested and supported by the ONT developers.


What's New in Clair3

  • New Architecture. Clair3 integrates both pileup (summarized alignment statistics) model and full-alignment model for variant calling. While a pileup model determines the result of a majority of variant candidates, candidates with uncertain results are further processed with a more computational-intensive haplotype-resolved full-alignment model.
  • Improved Performance. Using HG003 85-fold coverage ONT data from PrecisionFDA for benchmarking, Clair3 achieved 99.69% SNP F1-score and 80.58% Indel F1-score. Compare to Clair, Clair3 reduced SNP errors by ~78%, and Indel errors by ~48%.
  • High Efficiency. Using 36 CPU cores,
    • Clair3 takes ~8 hours to process 50-fold WGS ONT data (~4x faster than PEPPER (r0.4) and ~14x faster than Medaka (v1.3.2)). Memory consumption of Clair3 is capped at 1 GB per CPU thread, which is roughly five times lower than Clair.
    • Clair3 takes ~2 hours to process 35-fold WGS PacBio HiFi data (13x faster than DeepVariant (v1.1.0)).
  • Using data from newer basecallers. Clair3 models were trained using data from Guppy version 3.6.0 and 4.2.2, please check Training Data for details and links.
  • GVCF Support. Clair3 can output GVCF using the --gvcf option, enabling downstream joint-sample genotyping and cohort merging.

Quick Demo

Run Clair3 ONT quick demo:

  • (Option 1) using Google Colab notebook:

    Open In Colab

  • (Option 2) using pre-built docker image:

cd ${HOME}
wget "http://www.bio8.cs.hku.hk/clair3/demo/clair3_ont_quick_demo.sh"
chmod +x clair3_ont_quick_demo.sh
./clair3_ont_quick_demo.sh

Check the results using less ${HOME}/clair3_ont_quickDemo/output/merge_output.vcf.gz


Installation

Option 1. Docker pre-built image

A pre-built docker image is available here. With it you can run Clair3 using a single command.

Caution: Absolute path is needed for both INPUT_DIR and OUTPUT_DIR.

INPUT_DIR="[YOUR_INPUT_FOLDER]"        # e.g. /home/user1/input (absolute path needed)
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"      # e.g. /home/user1/output (absolute path needed)
THREADS="[MAXIMUM_THREADS]"            # e.g. 8
MODEL_NAME="[YOUR_MODEL_NAME]"         # e.g. r941_prom_hac_g360+g422

docker run -it \
  -v ${INPUT_DIR}:${INPUT_DIR} \
  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \
  hkubal/clair3:latest \
  /opt/bin/run_clair3.sh \
  --bam_fn=${INPUT_DIR}/input.bam \    ## change your bam file name here
  --ref_fn=${INPUT_DIR}/ref.fa \       ## change your reference file name here
  --threads=${THREADS} \               ## maximum threads to be used
  --platform="ont" \                   ## options: {ont,hifi,ilmn}
  --model_path="/opt/models/${MODEL_NAME}" \
  --output=${OUTPUT_DIR}               ## absolute output path prefix 

Check Usage for more options.

Option 2. Singularity

Caution: Absolute path is needed for both INPUT_DIR and OUTPUT_DIR.

INPUT_DIR="[YOUR_INPUT_FOLDER]"        # e.g. /home/user1/input (absolute path needed)
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"      # e.g. /home/user1/output (absolute path needed)
THREADS="[MAXIMUM_THREADS]"            # e.g. 8
MODEL_NAME="[YOUR_MODEL_NAME]"         # e.g. r941_prom_hac_g360+g422

conda config --add channels defaults
conda create -n singularity-env -c conda-forge singularity -y
conda activate singularity-env

# singularity pull docker pre-built image
singularity pull docker://hkubal/clair3:latest

# run clair3 like this afterward
singularity exec \
  -B ${INPUT_DIR},${OUTPUT_DIR} \
  clair3_latest.sif \
  /opt/bin/run_clair3.sh \
  --bam_fn=${INPUT_DIR}/input.bam \    ## change your bam file name here
  --ref_fn=${INPUT_DIR}/ref.fa \       ## change your reference file name here
  --threads=${THREADS} \               ## maximum threads to be used
  --platform="ont" \                   ## options: {ont,hifi,ilmn}
  --model_path="/opt/models/${MODEL_NAME}" \
  --output=${OUTPUT_DIR}               ## absolute output path prefix

Option 3. Bioconda

For using Clair3 with Illumina data, install clair3-illumina package in bioconda channel instead.

# make sure channels are added in conda
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

# create conda environment named "clair3"
# replace clair3 by clair3-illumina for using illumina data
conda create -n clair3 -c bioconda clair3 python=3.9.0 -y
conda activate clair3

# run clair3 like this afterward
MODEL_NAME="[YOUR_MODEL_NAME]"         # e.g. r941_prom_hac_g360+g422

run_clair3.sh \
  --bam_fn=input.bam \                 ## change your bam file name here
  --ref_fn=ref.fa \                    ## change your reference file name here
  --threads=${THREADS} \               ## maximum threads to be used
  --platform="ont" \                   ## options: {ont,hifi,ilmn}
  --model_path="${CONDA_PREFIX}/bin/models/${MODEL_NAME}" \ 
  --output=${OUTPUT_DIR}               ## output path prefix 

Check Usage for more options. Pre-trained models are already included in the bioconda package.

Option 4. Build an anaconda virtual environment

Anaconda install:

Please install anaconda using the official guide or using the commands below:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x ./Miniconda3-latest-Linux-x86_64.sh 
./Miniconda3-latest-Linux-x86_64.sh

Install Clair3 using anaconda step by step:

For using Clair3 on Illumina data, additional installation steps after the following steps are mandatory. Please follow this guide for the additional steps.

INPUT_DIR="[YOUR_INPUT_FOLDER]"        # e.g. ./input
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"      # e.g. ./output
THREADS="[MAXIMUM_THREADS]"            # e.g. 8

# create and activate an environment named clair3
conda create -n clair3 python=3.9.0 -y
source activate clair3

# install pypy and packages in the environemnt
conda install -c conda-forge pypy3.6 -y
pypy3 -m ensurepip
pypy3 -m pip install mpmath==1.2.1

# install python packages in environment
conda install -c conda-forge tensorflow==2.8.0 -y
conda install -c conda-forge pytables -y
conda install -c anaconda pigz cffi==1.14.4 -y
conda install -c conda-forge parallel=20191122 zstd -y
conda install -c conda-forge -c bioconda samtools=1.15.1 -y
conda install -c conda-forge -c bioconda whatshap=1.7 -y
conda install -c conda-forge xz zlib bzip2 automake curl -y
# tensorflow-addons is required in training
pip install tensorflow-addons

# clone Clair3
git clone https://github.com/HKU-BAL/Clair3.git
cd Clair3

# compile samtools, longphase and cffi library for c implement
# after building, longphase binary is in `Clair3` folder
source activate clair3 && make PREFIX=${CONDA_PREFIX}

# download pre-trained models
mkdir models
wget http://www.bio8.cs.hku.hk/clair3/clair3_models/clair3_models.tar.gz 
tar -zxvf clair3_models.tar.gz -C ./models

# run clair3
MODEL_NAME="[YOUR_MODEL_NAME]"         # e.g. r941_prom_hac_g360+g422
./run_clair3.sh \
  --bam_fn=${INPUT_DIR}/input.bam \    ## change your bam file name here
  --ref_fn=${INPUT_DIR}/ref.fa \       ## change your reference file name here
  --threads=${THREADS} \               ## maximum threads to be used
  --platform="ont" \                   ## options: {ont,hifi,ilmn}
  --model_path=`pwd`"/models/${MODEL_NAME}" \
  --output=${OUTPUT_DIR}               ## output path prefix

Option 5. Docker Dockerfile

This is the same as option 1 except that you are building a docker image yourself. Please refer to option 1 for usage.

# clone Clair3
git clone https://github.com/hku-bal/Clair3.git
cd Clair3

# build a docker image named hkubal/clair3:latest
# might require docker authentication to build docker image 
docker build -f ./Dockerfile -t hkubal/clair3:latest .

# run clair3 docker image like option 1
docker run -it hkubal/clair3:latest /opt/bin/run_clair3.sh --help

Run Clair3 with Apple Silicon

Instructions are given as an answer to issue #149.


Usage

General Usage

Caution: Use =value for optional parameters, e.g. --bed_fn=fn.bed instead of --bed_fn fn.bed.

./run_clair3.sh \
  --bam_fn=${BAM} \
  --ref_fn=${REF} \
  --threads=${THREADS} \  		     
  --platform="ont" \               ## options: {ont,hifi,ilmn}
  --model_path=${MODEL_PREFIX} \   ## absolute model path prefix
  --output=${OUTPUT_DIR}           ## absolute output path prefix
## pileup output file: ${OUTPUT_DIR}/pileup.vcf.gz
## full-alignment output file: ${OUTPUT_DIR}/full_alignment.vcf.gz
## Clair3 final output file: ${OUTPUT_DIR}/merge_output.vcf.gz

Options

Required parameters:

  -b, --bam_fn=FILE             BAM file input. The input file must be samtools indexed.
  -f, --ref_fn=FILE             FASTA reference file input. The input file must be samtools indexed.
  -m, --model_path=STR          The folder path containing a Clair3 model (requiring six files in the folder, including pileup.data-00000-of-00002, pileup.data-00001-of-00002 pileup.index, full_alignment.data-00000-of-00002, full_alignment.data-00001-of-00002  and full_alignment.index).
  -t, --threads=INT             Max threads to be used. The full genome will be divided into small chunks for parallel processing. Each chunk will use 4 threads. The chunks being processed simultaneously is ceil($threads/4)*3. 3 is the overloading factor.
  -p, --platform=STR            Select the sequencing platform of the input. Possible options: {ont,hifi,ilmn}.
  -o, --output=PATH             VCF/GVCF output directory.

Other parameters:

Caution: Use =value for optional parameters, e.g., --bed_fn=fn.bed instead of --bed_fn fn.bed

      --bed_fn=FILE             Call variants only in the provided bed regions.
      --vcf_fn=FILE             Candidate sites VCF file input, variants will only be called at the sites in the VCF file if provided.
      --ctg_name=STR            The name of the sequence to be processed.
      --sample_name=STR         Define the sample name to be shown in the VCF file.
      --qual=INT                If set, variants with >$qual will be marked PASS, or LowQual otherwise.
      --samtools=STR            Path of samtools, samtools version >= 1.10 is required.
      --python=STR              Path of python, python3 >= 3.6 is required.
      --pypy=STR                Path of pypy3, pypy3 >= 3.6 is required.
      --parallel=STR            Path of parallel, parallel >= 20191122 is required.
      --whatshap=STR            Path of whatshap, whatshap >= 1.0 is required.
      --longphase=STR           Path of longphase, longphase >= 1.0 is required.
      --chunk_size=INT          The size of each chuck for parallel processing, default: 5Mbp.
      --pileup_only             Use the pileup model only when calling, default: disable.
      --print_ref_calls         Show reference calls (0/0) in vcf file, default: disable.
      --include_all_ctgs        Call variants on all contigs, otherwise call in chr{1..22,X,Y} and {1..22,X,Y}, default: disable.
      --gvcf                    Enable GVCF output, default: disable.
      --use_whatshap_for_intermediate_phasing
                                Phase high-quality heterozygous variants using whatshap for full-alignment model calling, default: enable.
      --use_longphase_for_intermediate_phasing
                                Phase high-quality heterozygous variants using longphase for full-alignment model calling, default: disable.
      --use_whatshap_for_final_output_phasing
                                Phase the output variants using whatshap, default: disable.
      --use_longphase_for_final_output_phasing
                                Phase the output variants using longphase, default: disable.
      --use_whatshap_for_final_output_haplotagging
                                Haplotag input BAM using output phased variants using whatshap, default: disable.
      --enable_phasing          Output phased variants using whatshap, default: disable.
      --longphase_for_phasing   Use longphase for phasing, default: enable.
      --disable_c_impl          Disable C implement with cffi for pileup and full-alignment create tensor, default: enable.
      --remove_intermediate_dir Remove intermediate directory, including intermediate phased BAM, pileup and full-alignment results. default: disable.
      --snp_min_af=FLOAT        Minimum SNP AF required for a candidate variant. Lowering the value might increase a bit of sensitivity in trade of speed and accuracy, default: ont:0.08,hifi:0.08,ilmn:0.08.
      --indel_min_af=FLOAT      Minimum INDEL AF required for a candidate variant. Lowering the value might increase a bit of sensitivity in trade of speed and accuracy, default: ont:0.15,hifi:0.08,ilmn:0.08.
      --var_pct_full=FLOAT      EXPERIMENTAL: Specify an expected percentage of low quality 0/1 and 1/1 variants called in the pileup mode for full-alignment mode calling, default: 0.3.
      --ref_pct_full=FLOAT      EXPERIMENTAL: Specify an expected percentage of low quality 0/0 variants called in the pileup mode for full-alignment mode calling, default: 0.3 for ilmn and hifi, 0.1 for ont.
      --var_pct_phasing=FLOAT   EXPERIMENTAL: Specify an expected percentage of high quality 0/1 variants used in WhatsHap phasing, default: 0.8 for ont guppy5 and 0.7 for other platforms.
      --pileup_model_prefix=STR EXPERIMENTAL: Model prefix in pileup calling, including $prefix.data-00000-of-00002, $prefix.data-00001-of-00002 $prefix.index. default: pileup.
      --fa_model_prefix=STR     EXPERIMENTAL: Model prefix in full-alignment calling, including $prefix.data-00000-of-00002, $prefix.data-00001-of-00002 $prefix.index, default: full_alignment.
      --min_mq=INT              EXPERIMENTAL: If set, reads with mapping quality with <$min_mq are filtered, default: 5.
      --min_coverage=INT        EXPERIMENTAL: Minimum coverage required to call a variant, default: 2.
      --min_contig_size=INT     EXPERIMENTAL: If set, contigs with contig size<$min_contig_size are filtered, default: 0.
      --fast_mode               EXPERIMENTAL: Skip variant candidates with AF <= 0.15, default: disable.
      --haploid_precise         EXPERIMENTAL: Enable haploid calling mode. Only 1/1 is considered as a variant, default: disable.
      --haploid_sensitive       EXPERIMENTAL: Enable haploid calling mode. 0/1 and 1/1 are considered as a variant, default: disable.
      --no_phasing_for_fa       EXPERIMENTAL: Call variants without whatshap phasing in full alignment calling, default: disable.
      --call_snp_only           EXPERIMENTAL: Call candidates pass SNP minimum AF only, ignore Indel candidates, default: disable.
      --enable_long_indel       EXPERIMENTAL: Call long Indel variants(>50 bp), default: disable.
      --keep_iupac_bases        EXPERIMENTAL: Keep IUPAC reference and alternate bases, default: convert all IUPAC bases to N.
      --base_err=FLOAT          EXPERIMENTAL: Estimated base error rate when enabling gvcf option, default: 0.001.
      --gq_bin_size=INT         EXPERIMENTAL: Default gq bin size for merge non-variant block when enabling gvcf option, default: 5.

Call variants in a chromosome

CONTIGS_LIST="[YOUR_CONTIGS_LIST]"     # e.g "chr21" or "chr21,chr22"
INPUT_DIR="[YOUR_INPUT_FOLDER]"        # e.g. /home/user1/input  (absolute path needed)
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"      # e.g. /home/user1/output (absolute path needed)
THREADS="[MAXIMUM_THREADS]"            # e.g. 8
MODEL_NAME="[YOUR_MODEL_NAME]"         # e.g. r941_prom_hac_g360+g422

docker run -it \
  -v ${INPUT_DIR}:${INPUT_DIR} \
  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \
  hkubal/clair3:latest \
  /opt/bin/run_clair3.sh \
  --bam_fn=${INPUT_DIR}/input.bam \    ## change your bam file name here
  --ref_fn=${INPUT_DIR}/ref.fa \       ## change your reference file name here
  --threads=${THREADS} \               ## maximum threads to be used
  --platform="ont" \                   ## options: {ont,hifi,ilmn}
  --model_path="/opt/models/${MODEL_NAME}" \
  --output=${OUTPUT_DIR} \             ## absolute output path prefix
  --ctg_name=${CONTIGS_LIST}

Call variants at known variant sites

KNOWN_VARIANTS_VCF="[YOUR_VCF_PATH]"   # e.g. /home/user1/known_variants.vcf.gz (absolute path needed)
INPUT_DIR="[YOUR_INPUT_FOLDER]"        # e.g. /home/user1/input (absolute path needed)
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"      # e.g. /home/user1/output (absolute path needed)
THREADS="[MAXIMUM_THREADS]"            # e.g. 8
MODEL_NAME="[YOUR_MODEL_NAME]"         # e.g. r941_prom_hac_g360+g422

docker run -it \
  -v ${INPUT_DIR}:${INPUT_DIR} \
  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \
  hkubal/clair3:latest \
  /opt/bin/run_clair3.sh \
  --bam_fn=${INPUT_DIR}/input.bam \    ## change your bam file name here
  --ref_fn=${INPUT_DIR}/ref.fa \       ## change your reference file name here
  --threads=${THREADS} \               ## maximum threads to be used
  --platform="ont" \                   ## options: {ont,hifi,ilmn}
  --model_path="/opt/models/${MODEL_NAME}" \
  --output=${OUTPUT_DIR} \             ## absolute output path prefix
  --vcf_fn=${KNOWN_VARIANTS_VCF}

Call variants at specific sites or bed regions

We highly recommended using BED file to define the regions of interest like:

# define 0-based "ctg start end" if at specific sites
CONTIGS="[YOUR_CONTIGS_NAME]"          # e.g. chr22
START_POS="[YOUR_START_POS]"           # e.g. 0
END_POS="[YOUR_END_POS]"               # e.g 10000
echo -e "${CONTIGS}\t${START_POS}\t${END_POS}" > /home/user1/tmp.bed ## change directory accordingly

Then run Clair3 like this:

BED_FILE_PATH="[YOUR_BED_FILE]"        # e.g. /home/user1/tmp.bed (absolute path needed)
INPUT_DIR="[YOUR_INPUT_FOLDER]"        # e.g. /home/user1/input (absolute path needed)
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"      # e.g. /home/user1/output (absolute path needed)
THREADS="[MAXIMUM_THREADS]"            # e.g. 8
MODEL_NAME="[YOUR_MODEL_NAME]"         # e.g. r941_prom_hac_g360+g422

docker run -it \
  -v ${INPUT_DIR}:${INPUT_DIR} \
  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \
  hkubal/clair3:latest \
  /opt/bin/run_clair3.sh \
  --bam_fn=${INPUT_DIR}/input.bam \    ## change your bam file name here
  --ref_fn=${INPUT_DIR}/ref.fa \       ## change your reference file name here
  --threads=${THREADS} \               ## maximum threads to be used
  --platform="ont" \                   ## options: {ont,hifi,ilmn}
  --model_path="/opt/models/${MODEL_NAME}" \
  --output=${OUTPUT_DIR} \             ## absolute output path prefix
  --bed_fn=${BED_FILE_PATH}

Call variants in non-diploid organisms (Haploid calling)

INPUT_DIR="[YOUR_INPUT_FOLDER]"        # e.g. /home/user1/input (absolute path needed)
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"      # e.g. /home/user1/output (absolute path needed)
THREADS="[MAXIMUM_THREADS]"            # e.g. 8
MODEL_NAME="[YOUR_MODEL_NAME]"         # e.g. r941_prom_hac_g360+g422

docker run -it \
  -v ${INPUT_DIR}:${INPUT_DIR} \
  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \
  hkubal/clair3:latest \
  /opt/bin/run_clair3.sh \
  --bam_fn=${INPUT_DIR}/input.bam \    ## change your bam file name here
  --ref_fn=${INPUT_DIR}/ref.fa \       ## change your reference file name here
  --threads=${THREADS} \               ## maximum threads to be used
  --platform="ont" \                   ## options: {ont,hifi,ilmn}
  --model_path="/opt/models/${MODEL_NAME}" \
  --output=${OUTPUT_DIR} \
  --no_phasing_for_fa \                ## disable phasing for full-alignment
  --include_all_ctgs \                 ## call variants on all contigs in the reference fasta
  --haploid_precise                    ## optional(enable --haploid_precise or --haploid_sensitive) for haploid calling

Tips

Dealing with amplicon data

If you are dealing with amplicon data with excessively high depth coverage, please try setting --var_pct_full, and --ref_pct_full to 1. If you are dealing with human data, set --var_pct_phasing to 1. If you are dealing with non-human data, enable the --no_phasing_for_fa option. Please refer to discussion #160, and #240 for more details.


Postprocessing scripts

SwitchZygosityBasedOnSVCalls module

The module takes a Clair3 VCF and a Sniffle2 VCF as inputs. It switches the zygosity from homozygous to heterozygous of a Clair3 called SNP that matches the following two criteria: 1) AF<=0.7, and 2) the flanking 16bp of the SNP is inside one or more SV deletions given in the Sniffle2 VCF. The usage is as follows.

pypy3 ${CLAIR3_PATH}/clair3.py SwitchZygosityBasedOnSVCalls
      --bam_fn input.bam
      --clair3_vcf_input clair3_input.vcf.gz
      --sv_vcf_input sniffle2.vcf.gz
      --vcf_output output.vcf
      --threads 8

This postprocessing script was inspired by Philipp Rescheneder from ONT. There are heterozygous SNPs that overlap large deletion, and some of these SNPs are clinically significant. Clair3 doesn't call structural variants and might incorrectly output these SNPs as homozygous SNP but with relatively low AF and QUAL. Given a Sniffle2 SV VCF, the script relabels these SNPs as heterozygous, and adds two INFO tags: 1) SVBASEDHET flag, and 2) ORG_CLAIR3_SCORE that shows the original Clair3 QUAL score. The new QUAL of an SNP that switched zygosity will be the top QUAL of the deletions that overlapped the SNP.


Folder Structure and Submodule Descriptions

Submodules in clair3/ are for variant calling and model training. Submodules in preprocess are for data preparation.

For all the submodules listed below, you can use -h or --help for available options.

clair3/ Note: submodules under this folder are pypy incompatible, please run using python
CallVariants Call variants using a trained model and tensors of candidate variants.
CallVarBam Call variants using a trained model and a BAM file.
Train Training a model using the RectifiedAdam optimizer. We also use the Lookahead optimizer to adjust the RectifiedAdam parameters dynamically. The initial learning rate is 1e-3 with 0.1 learning rate warm-up. Input a binary containing tensors created by Tensor2Bin.
preprocess/ Note: submodules under this folder is Pypy compatible unless specified.
CheckEnvs Check the environment and validity of the input variables, preprocess the BED input if necessary, --chunk_size sets the chuck size to be processed per parallel job.
CreateTensorPileup Generate variant candidate tensors in pileup format for training or calling.
CreateTensorFullAlignment Generate variant candidate tensors in phased full-alignment format for training or calling.
GetTruth Extract the variants from a truth VCF. Input: VCF; Reference FASTA if the VCF contains asterisks in ALT field.
MergeVcf Merge pileup and full-alignment VCF/GVCF.
RealignReads Reads local realignment for Illumina platform.
SelectCandidates Select pileup candidates for full-alignment calling.
SelectHetSnp Select heterozygous SNP candidates for whatshap phasing.
SelectQual Select a quality cutoff using the pileup calling results. Variants below the cutoff are included in phasing and full-alignment calling.
SortVcf Sort VCF file.
SplitExtendBed Split BED file regions according to the contig names and extend bed region by 33bp by default for variant calling.
UnifyRepresentation Representation unification between candidate sites and true variants.
MergeBin Combine tensor binaries into a single file.
CreateTrainingTensor Create tensor binaries for pileup or full-alignment training.
Tensor2Bin Combine the variant and non-variant tensors and convert them to a binary, using blosc:lz4hc meta-compressor, the overall training memory is 10~15G (pypy incompatible).

Training Data

Clair3 trained both its pileup and full-alignment models using four GIAB samples (HG001, HG002, HG004 and HG005), excluded HG003. On ONT, we also trained a model using HG001, 2, 3, and 5, excluded HG004. All models were trained with chr20 excluded (including only chr1-19, 21, 22).

Platform Reference Aligner Training samples
ONT GRCh38_no_alt minimap2 HG001,2,(3|4),5
PacBio HiFi Sequel II GRCh38_no_alt pbmm2 HG001,2,4,5
PacBio HiFi Revio GRCh38_no_alt pbmm2 HG002,4
Illumina GRCh38 BWA-MEM/NovoAlign HG001,2,4,5

Please find more details about the training data and links at Training Data.


VCF/GVCF Output Formats

Clair3 supports both VCF and GVCF output formats. Clair3 uses VCF version 4.2 specifications. Specifically, Clair3 adds a P INFO tag to the results called using a pileup model, and a F INFO tag to the results called using a full-alignment model.

Clair3 outputs a GATK-compatible GVCF format that passes GATK's ValidateVariants module. Different from DeepVariant that uses <*> to represent any possible alternative allele, Clair3 uses <NON_REF>, the same as GATK.

Clair3 GVCF files can be merged with GLNexus. A GLNexus caller based configuration file is available Download.