Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IlluminaBasecallsToSam skipping barcodes #1774

Open
nchernia opened this issue Jan 27, 2022 · 5 comments
Open

IlluminaBasecallsToSam skipping barcodes #1774

nchernia opened this issue Jan 27, 2022 · 5 comments

Comments

@nchernia
Copy link

Bug Report

Affected tool(s)

IlluminaBasecallsToSam (possibly ExtractIlluminaBarcodes)

Affected version(s)

Version:2.26.3

Description

I am using ExtractIlluminaBarcodes and IlluminaBasecallsToSam as a first demultiplexing step to separate reads by library barcode. I am comparing to doing two passes via bcl2fastq and then directly examining the index reads to demultiplex by library.

The bam files produced by IlluminaBasecallsToSam are missing barcodes that exist in bcl2fastq. These are exact matches with high quality basecall strings, e.g. FFFFFFFF. I have tested this on two different bcl files so far and in both cases there's approximately 0.1% of reads missing.

Since ExtractIlluminaBarcodes is called before IlluminaBasecallsToSam and reports the number of matches per library (which is equal to the number of reads in the corresponding library bam file), this might be an error in ExtractIlluminaBarcodes instead of IlluminaBasecallsToSam.

Steps to reproduce

ExtractIlluminaBarcodes BASECALLS_DIR="Data/Intensities/BaseCalls" TMP_DIR=. OUTPUT_DIR=. \ BARCODE_FILE=barcode_params.tsv METRICS_FILE=barcode_metrics1.tsv \ READ_STRUCTURE=50T14S10M29S10M27S9M8B50T \ 
LANE=1 NUM_PROCESSORS=4 COMPRESSION_LEVEL=1 GZIP=true

barcode_params.tsv

barcode_sequence_1      barcode_name    library_name
GCGATCTA        P1.01   P1.01
ATAGAGAG        P1.02   P1.02
AGAGGATA        P1.03   P1.03
TCTACTCT        P1.04   P1.04
CTCCTTAC        P1.05   P1.05
TATGCAGT        P1.06   P1.06
TACTCCTT        P1.07   P1.07
AGGCTTAG        P1.08   P1.08
GATTTCCA        P1.09   P1.09
ATCATGTT        P1.10   P1.10
TTTCATCA        P1.11   P1.11
AGTCCGAC        P1.12   P1.12
GCTAGAAA        P1.13   P1.13
CTTGGTTA        P1.14   P1.14
CGATACAC        P1.15   P1.15
TTGATGGA        P1.16   P1.16
IlluminaBasecallsToSam BASECALLS_DIR="Data/Intensities/BaseCalls" BARCODES_DIR=. TMP_DIR=. \
LIBRARY_PARAMS=library_params1.tsv READ_STRUCTURE=50T14S10M29S10M27S9M8B50T  \
LANE=1 RUN_BARCODE="A01253:322:HNGMYDSX2" \
SEQUENCING_CENTER="BI" NUM_PROCESSORS=4 \
IGNORE_UNEXPECTED_BARCODES=true 

library_params1.tsv

OUTPUT  SAMPLE_ALIAS    LIBRARY_NAME    BARCODE_1
P1.01.bam       P1.01   P1.01   GCGATCTA
P1.02.bam       P1.02   P1.02   ATAGAGAG
P1.03.bam       P1.03   P1.03   AGAGGATA
P1.04.bam       P1.04   P1.04   TCTACTCT
P1.05.bam       P1.05   P1.05   CTCCTTAC
P1.06.bam       P1.06   P1.06   TATGCAGT
P1.07.bam       P1.07   P1.07   TACTCCTT
P1.08.bam       P1.08   P1.08   AGGCTTAG
P1.09.bam       P1.09   P1.09   GATTTCCA
P1.10.bam       P1.10   P1.10   ATCATGTT
P1.11.bam       P1.11   P1.11   TTTCATCA
P1.12.bam       P1.12   P1.12   AGTCCGAC
P1.13.bam       P1.13   P1.13   GCTAGAAA
P1.14.bam       P1.14   P1.14   CTTGGTTA
P1.15.bam       P1.15   P1.15   CGATACAC
P1.16.bam       P1.16   P1.16   TTGATGGA

Expected behavior

All reads with matching library barcode are in appropriate bam file

Actual behavior

0.1% of reads with exact match barcodes are missing.

@nchernia
Copy link
Author

An update: I have tried running with a simpler read structure (50T99S8B50T) and this did not change the results. I have also tried deleting the barcodes files and running just IlluminaBasecallsToSam with MATCH_BARCODES_INLINE=true and this also did not change the results.

@gbggrant
Copy link
Contributor

gbggrant commented Feb 2, 2022

Hi @nchernia have you tried running this on a newer version of Picard (the latest release is 2.26.10)? We fixed a number of errors in EIB and IBCTo[Sam|Fastq] last year (although this does not sound like one of those).

Also, is it possible that the missing barcodes are those that do not exactly match the input barcodes list (have more than one error)? That would be indicated in the metrics file generated by EIB. If possible, can you upload that file too?

Thanks.

George

@nchernia
Copy link
Author

nchernia commented Feb 3, 2022

Hi George,

Thanks for getting back to me.

I've rerun with the latest jar and the results are the same.

Here is the head of the metrics file.

## htsjdk.samtools.metrics.StringHeader
# IlluminaBasecallsToSam RUN_BARCODE=A01253:322:HNGMYDSX2 SEQUENCING_CENTER=BI LIBRARY_PARAMS=/mnt/disks/big_data/211110_SL-NVV_0322_AHNGMYDSX2/library_params1.tsv NUM_PROCESSORS=0
 APPLY_EAMSS_FILTER=false IGNORE_UNEXPECTED_BARCODES=true MATCH_BARCODES_INLINE=true LANE=[1] READ_STRUCTURE=50T14S10M28S10M28S9M8B50T BASECALLS_DIR=Data/Intensities/BaseCalls METRICS_FILE=metrics1.tsv TMP_DIR=[.] MAX_RECORDS_IN_RAM=5000000    PLATFORM=ILLUMINA INCLUDE_BC_IN_RG_TAG=false ADAPTERS_TO_CHECK=[INDEXED, DUAL_INDEXED, NEXTERA_V2, FLUIDIGM] MAX_READS_IN_RAM_PER_TILE=-1 INCLUDE_NON_PF_READS=true MOLECULAR_INDEX_TAG=RX MOLECULAR_INDEX_BASE_QUALITY_TAG=QX BARCODE_POPULATION_STRATEGY=ORPHANS_ONLY INCLUDE_BARCODE_QUALITY=false SO RT=true DISTANCE_MODE=HAMMING MAX_MISMATCHES=1 MIN_MISMATCH_DELTA=1 MAX_NO_CALLS=2 MINIMUM_BASE_QUALITY=0 MINIMUM_QUALITY=2 COMPRESS_OUTPUTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Wed Feb 02 21:51:42 UTC 2022
## METRICS CLASS        picard.illumina.BarcodeMetric
BARCODE BARCODE_WITHOUT_DELIMITER       BARCODE_NAME    LIBRARY_NAME    READS   PF_READS      PERFECT_MATCHES PF_PERFECT_MATCHES      ONE_MISMATCH_MATCHES    PF_ONE_MISMATCH_MATCHES     PCT_MATCHES     RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT  PF_PCT_MATCHES PF_RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT       PF_NORMALIZED_MATCHES
GCGATCTA        GCGATCTA                        250571769       250571769       245862068       245862068       4709701 4709701 0.102852        1       0.102852        1       1.64564
ATAGAGAG        ATAGAGAG                        92371331        92371331        86261451        86261451        6109880 6109880 0.037916        0.368642        0.037916        0.368642    0.606652
AGAGGATA        AGAGGATA                        104698897       104698897       100114893       100114893       4584004 4584004 0.042976        0.41784 0.042976        0.41784 0.687614

This run is with IBtoS directly, but when I run EIB and IBtoS the counts are the same, except that the metrics file produced via the flag METRICS_FILE (and INLINE_BARCODES) doesn't contain the library name; moreover as another issue reported, the NNNN string is reported as all 0s when running IBtoS directly, and so the percentages are different.

Looking at the metrics file when running EIB and IBtoS, I can sum the numbers in the READS column. This total number, which includes reads that matched one of the passed in barcodes and reads that didn't, sums to the total number of reads in the Undetermined_I2 fastq file produced by bcl2fastq. The number of reads reported in the READS column also matches the number of reads in the BAM file.

It seems as if Picard is reporting these as unmatched when they appear as though they should be perfect matches and assigned to a library. I am wondering if there's some other filter that's not obvious from the options that Picard is doing?

Here is an example of a read from the above - you can see it matches the first barcode perfectly. It is not in the resulting bam file.

~>zcat Data/Intensities/BaseCalls/Undetermined_S0_L001_I2_001.fastq.gz | grep -m 1 -A3 A01253:322:HNGMYDSX2:1:1102:19859:22106 
@A01253:322:HNGMYDSX2:1:1102:19859:22106 2:N:0:0
GCGATCTA
+
FFFFFFFF

@nchernia
Copy link
Author

nchernia commented Feb 4, 2022

Hello,

I've confirmed this bug on a third BCL folder, on a different experiment (this time ChiP-seq, with a simple read structure 76T8B). I am at the Broad and happy to share results with you so you can continue to debug. It's consistently a small but persistent error, affecting a small percentage of reads. For something like ChiP-seq, maybe losing a small number of reads isn't important, but for single cell experiments we don't want to lose any reads, and this bug would force us to use bcl2fastq (which we'd prefer not to do, since we'd then have to rewrite the Picard functionality of mismatch tolerance and writing to bams).

@nchernia
Copy link
Author

Please do let me know if you'd like me to share an example BCL together with the other files. They are stored on a Google bucket. My Broad username is neva if you'd like to email instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants