Guppy5 model (r941_prom_sup_g5014)

You can use the Guppy5 model r941_prom_sup_g5014 for Guppy6 data.

Regarding to an old Guppy5 model r941_prom_sup_g506 released on 20210609 fine-tuned from the Guppy3,4 model , please refer to this document.

About `r941_prom_sup_g5014`

Jan 13, 2022. In last June during the early release of ~70x Guppy5 HG002 data, we released the first Clair3 Guppy5 model r941_prom_sup_g506 by fine-tuning the Guppy3,4 model with the new Guppy5 data. The improvement was significant when applying the fine-tuned model to Guppy5 reads. But at that moment, without much Guppy5 data, we were unable to train a new model from scratch, and fine-tuning was the only option.

To train a Guppy5 model from scratch, we basecalled four GIAB samples (HG002, 3, 4, 5) with fast5 available from the Human Pangenome Reference Consortium (HPRC) using Guppy5 and the dna_r9.4.1_450bps_sup mode. The basecalling took (quite) a while. Then, we trained a new Clair3 model r941_prom_sup_g5014 using the training procedures detailed in Pileup Model Training and Full-Alignment Model Training.

We compared the new Guppy5 trained from scratch model r941_prom_sup_g5014 against the old fine-tuned model r941_prom_sup_g506 and PEPPER r0.7. The variant calling results of HG003 with input coverage from 10x to 50x are shown below. For the sake of comparison, the results of using Guppy4 data on the Guppy3+4 model r941_prom_hac_g360+g422 are also shown. Compared to r941_prom_sup_g506, r941_prom_sup_g5014 on average has improved SNP F1-score by ~0.05% and Indel F1-score by ~3% .

HG003 WGS 10x performance:

Model/ Caller	Testing Dataset	Overall Precision	Overall Recall	Overall F1	SNP Precision	SNP Recall	SNP F1	Indel Precision	Indel Recall	Indel F1
r941_prom_sup_g5014	Guppy5 sup 10x	96.01%	90.21%	93.02%	97.11%	96.53%	96.82%	83.76%	48.51%	61.44%
r941_prom_sup_g506	Guppy5 sup 10x	97.38%	88.86%	92.93%	98.19%	95.49%	96.82%	87.48%	45.14%	59.55%
PEPPER (r0.7)	Guppy5 sup 10x	96.39%	82.91%	89.14%	97.54%	89.24%	93.20%	82.73%	41.17%	54.98%
Guppy3+4	Guppy4 hac 10x	88.96%	86.11%	87.51%	90.55%	93.35%	91.93%	69.59%	38.32%	49.43%

HG003 WGS 20x performance:

Model/ Caller	Testing Dataset	Overall Precision	Overall Recall	Overall F1	SNP Precision	SNP Recall	SNP F1	Indel Precision	Indel Recall	Indel F1
r941_prom_sup_g5014	Guppy5 sup 20x	98.17%	94.54%	96.32%	99.22%	99.42%	99.32%	88.56%	62.33%	73.17%
r941_prom_sup_g506	Guppy5 sup 20x	98.35%	93.74%	95.99%	99.44%	99.01%	99.23%	87.90%	58.95%	70.57%
PEPPER (r0.7)	Guppy5 sup 20x	98.56%	93.59%	96.01%	99.49%	99.00%	99.25%	89.35%	57.88%	70.25%
Guppy3+4	Guppy4 hac 20x	97.47%	92.79%	95.07%	99.17%	98.77%	98.97%	80.91%	53.33%	64.29%

HG003 WGS 30x performance:

Model/ Caller	Testing Dataset	Overall Precision	Overall Recall	Overall F1	SNP Precision	SNP Recall	SNP F1	Indel Precision	Indel Recall	Indel F1
r941_prom_sup_g5014	Guppy5 sup 30x	98.56%	95.55%	97.03%	99.49%	99.65%	99.57%	90.63%	68.46%	78.00%
r941_prom_sup_g506	Guppy5 sup 30x	98.27%	94.95%	96.58%	99.60%	99.42%	99.51%	86.83%	65.42%	74.62%
PEPPER (r0.7)	Guppy5 sup 30x	98.77%	94.96%	96.83%	99.68%	99.55%	99.62%	90.58%	64.68%	75.47%
Guppy3+4	Guppy4 hac 30x	98.13%	94.10%	96.07%	99.50%	99.40%	99.45%	85.40%	59.18%	69.91%

HG003 WGS 40x performance:

Model/ Caller	Testing Dataset	Overall Precision	Overall Recall	Overall F1	SNP Precision	SNP Recall	SNP F1	Indel Precision	Indel Recall	Indel F1
r941_prom_sup_g5014	Guppy5 sup 40x	98.73%	96.11%	97.40%	99.59%	99.70%	99.65%	91.71%	72.39%	80.91%
r941_prom_sup_g506	Guppy5 sup 40x	98.14%	95.59%	96.85%	99.65%	99.56%	99.61%	86.01%	69.40%	76.82%
PEPPER (r0.7)	Guppy5 sup 40x	98.82%	95.59%	97.18%	99.72%	99.65%	99.68%	91.19%	68.86%	78.47%
Guppy3+4	Guppy4 hac 40x	98.62%	94.63%	96.58%	99.61%	99.55%	99.58%	89.47%	62.16%	73.36%

HG003 WGS 50x performance:

Model/ Caller	Testing Dataset	Overall Precision	Overall Recall	Overall F1	SNP Precision	SNP Recall	SNP F1	Indel Precision	Indel Recall	Indel F1
r941_prom_sup_g5014	Guppy5 sup 50x	98.80%	96.52%	97.64%	99.63%	99.73%	99.68%	92.19%	75.32%	82.91%
r941_prom_sup_g506	Guppy5 sup 50x	97.99%	95.99%	96.98%	99.66%	99.64%	99.65%	85.22%	71.93%	78.02%
PEPPER (r0.7)	Guppy5 sup 50x	98.86%	96.01%	97.42%	99.74%	99.68%	99.71%	91.70%	71.76%	80.51%
Guppy3+4	Guppy4 sup hac 50x	98.79%	95.01%	96.86%	99.67%	99.60%	99.63%	90.86%	64.73%	75.60%

Shall we use a hac model or sup model for hac data?

Conclusion first, use the sup model for hac data. But use sup data for the best performance.

Using the sup model for hac data, the overall F1 is 1.06% higher than using the hac model for hac data. Both SNP and Indel F1 were significantly better when using the sup model for hac data. But interestingly, both SNP recall and Indel recall were higher when the hac model was used for hac data. The result indicates that while some hac-data-specific characteristics have been learned by the hac model, the relatively higher noise in the hac data has hindered other characteristics shared between sup and hac data to be better learned. How to train a model that works best for hac data needs to be further explored, but the rule of thumb at the moment is to use the sup model for hac data.

HG003 WGS 50x performance:

Model	Testing Dataset	Overall Precision	Overall Recall	Overall F1	SNP Precision	SNP Recall	SNP F1	Indel Precision	Indel Recall	Indel F1
Guppy5 sup model	Guppy5 sup 50x	98.80%	96.52%	97.64%	99.63%	99.73%	99.68%	92.19%	75.32%	82.91%
Guppy5 sup model	Guppy5 hac 50x	98.12%	95.91%	97.00%	99.64%	99.67%	99.65%	86.28%	71.12%	77.97%
Guppy5 hac model	Guppy5 hac 50x	95.01%	96.90%	95.94%	99.02%	99.77%	99.39%	71.30%	77.94%	74.47%
Guppy5 hac model	Guppy5 sup 50x	94.93%	96.80%	95.86%	99.07%	99.78%	99.42%	70.45%	77.19%	73.67%

How to use the Guppy5 model

INPUT_DIR="[YOUR_INPUT_FOLDER]"           # e.g. /home/user1/input (absolute path needed)
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"         # e.g. /home/user1/output (absolute path needed)
THREADS="[MAXIMUM_THREADS]"               # e.g. 8

docker run -it \
  -v ${INPUT_DIR}:${INPUT_DIR} \
  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \
  hkubal/clair3:latest \
  /opt/bin/run_clair3.sh \
  --bam_fn=${INPUT_DIR}/input.bam \       ## change your bam file name here
  --ref_fn=${INPUT_DIR}/ref.fa \          ## change your reference name here
  --threads=${THREADS} \                  ## maximum threads to be used
  --platform="ont" \                       
  --model_path="/opt/models/r941_prom_sup_g5014" \
  --output=${OUTPUT_DIR}                  ## absolute output path prefix

Check Usage for more options.

Guppy5 data used for training and testing

FASTQs

http://www.bio8.cs.hku.hk/guppy5_data/

FAST5s

Sample	Reference	Aligner	Coverage	Basecaller	Training	Testing	link
HG002	GRCh38_no_alt	minimap2	70.0	Guppy v5.0.6 (dna_r9.4.1_450bps_sup_prom.cfg)	Yes		link
HG002	GRCh38_no_alt	minimap2	117.4	Guppy v5.0.14 (dna_r9.4.1_450bps_sup_prom.cfg)	Yes		link
HG003	GRCh38_no_alt	minimap2	77.1	Guppy v5.0.14 (dna_r9.4.1_450bps_sup_prom.cfg)		Yes	link
HG003	GRCh38_no_alt	minimap2	78.9	Guppy v5.0.14 (dna_r9.4.1_450bps_hac_prom.cfg)		Yes	link
HG004	GRCh38_no_alt	minimap2	79.0	Guppy v5.0.14 (dna_r9.4.1_450bps_sup_prom.cfg)	Yes		link
HG005	GRCh38_no_alt	minimap2	39.5	Guppy v5.0.14 (dna_r9.4.1_450bps_sup_prom.cfg)	Yes		link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

guppy5_20220113.md

guppy5_20220113.md

Guppy5 model (r941_prom_sup_g5014)

Contents

About `r941_prom_sup_g5014`

HG003 WGS 10x performance:

HG003 WGS 20x performance:

HG003 WGS 30x performance:

HG003 WGS 40x performance:

HG003 WGS 50x performance:

Shall we use a hac model or sup model for hac data?

HG003 WGS 50x performance:

How to use the Guppy5 model

Guppy5 data used for training and testing

FASTQs

FAST5s

Files

guppy5_20220113.md

Latest commit

History

guppy5_20220113.md

File metadata and controls

Guppy5 model (r941_prom_sup_g5014)

Contents

About r941_prom_sup_g5014

HG003 WGS 10x performance:

HG003 WGS 20x performance:

HG003 WGS 30x performance:

HG003 WGS 40x performance:

HG003 WGS 50x performance:

Shall we use a hac model or sup model for hac data?

HG003 WGS 50x performance:

How to use the Guppy5 model

Guppy5 data used for training and testing

FASTQs

FAST5s

About `r941_prom_sup_g5014`