You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some UMIs produced by BCL convert are prefixed with "r", indicating "reverse complement."1
When sequencing a run with Unique Molecular Identifier (UMI) situated in the index2 (i5) on a NextSeq1000/2000 instrument, BCL Convert will put a leading "r" in front of the reverse-complemented UMI in the FASTQ header.
CopyUmiFromReadName enforces that the UMI sequence contains only valid bases (A/C/G/T/N2) or a delimiter between multiple UMIs (+ or -). UMIs prefixed with "r" fail this validation.
Proposed solution
I think it would be sensible to add the following features to CopyUmiFromReadName:
--umi-delimiter (Char, default=+)
The default should be +, as this is the default delimiter in Illumina FASTQs.2
If this character appears in the UMI sequence, split the sequence into multiple UMIs and validate each separately.
Join multiple UMIs with a hyphen (-) before storing them in the RX tag, per SAM spec.3
Support reverse complemented UMIs.
For each UMI, if it begins with "r", remove the "r" and (optionally?) reverse-complement the remaining sequence
(NB: This could be turned off by default, e.g. with --allow-reverse-umis as a flag. @clintval raised the concern that degenerate UMIs could include r as a masked A or G4, although this does not appear to be permitted under the current Illumina FASTQ spec.2)
In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the
recommended implementation concatenates all the barcodes with a hyphen (‘-’) between the different
barcodes.
msto
changed the title
CopyUmiFromReadName should support UMI separators longer than a single character.
CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r"
Jan 19, 2024
Problem
Some UMIs produced by BCL convert are prefixed with "r", indicating "reverse complement."1
CopyUmiFromReadName
enforces that the UMI sequence contains only valid bases (A/C/G/T/N2) or a delimiter between multiple UMIs (+
or-
). UMIs prefixed with "r" fail this validation.Proposed solution
I think it would be sensible to add the following features to
CopyUmiFromReadName
:--umi-delimiter
(Char
, default=+
)+
, as this is the default delimiter in Illumina FASTQs.2-
) before storing them in theRX
tag, per SAM spec.3Support reverse complemented UMIs.
--allow-reverse-umis
as a flag. @clintval raised the concern that degenerate UMIs could includer
as a masked A or G4, although this does not appear to be permitted under the current Illumina FASTQ spec.2)Footnotes
https://knowledge.illumina.com/software/general/software-general-reference_material-list/000007945 ↩
https://support.illumina.com/help/BaseSpace_Sequence_Hub_OLH_009008_2/Source/Informatics/BS/FileFormat_FASTQ-files_swBS.htm
↩ ↩2 ↩3https://samtools.github.io/hts-specs/SAMtags.pdf
↩https://www.bioinformatics.org/sms/iupac.html ↩
The text was updated successfully, but these errors were encountered: