CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r" #957

msto · 2024-01-19T16:32:14Z

Problem

Some UMIs produced by BCL convert are prefixed with "r", indicating "reverse complement."¹

When sequencing a run with Unique Molecular Identifier (UMI) situated in the index2 (i5) on a NextSeq1000/2000 instrument, BCL Convert will put a leading "r" in front of the reverse-complemented UMI in the FASTQ header.

CopyUmiFromReadName enforces that the UMI sequence contains only valid bases (A/C/G/T/N²) or a delimiter between multiple UMIs (+ or -). UMIs prefixed with "r" fail this validation.

Proposed solution

I think it would be sensible to add the following features to CopyUmiFromReadName:

--umi-delimiter (Char, default=+)
- The default should be +, as this is the default delimiter in Illumina FASTQs.²
- If this character appears in the UMI sequence, split the sequence into multiple UMIs and validate each separately.
- Join multiple UMIs with a hyphen (-) before storing them in the RX tag, per SAM spec.³
Support reverse complemented UMIs.
- For each UMI, if it begins with "r", remove the "r" and (optionally?) reverse-complement the remaining sequence
- (NB: This could be turned off by default, e.g. with --allow-reverse-umis as a flag. @clintval raised the concern that degenerate UMIs could include r as a masked A or G⁴, although this does not appear to be permitted under the current Illumina FASTQ spec.²)

https://knowledge.illumina.com/software/general/software-general-reference_material-list/000007945 ↩
https://support.illumina.com/help/BaseSpace_Sequence_Hub_OLH_009008_2/Source/Informatics/BS/FileFormat_FASTQ-files_swBS.htm

Restricted characters: A/T/G/C/N
UMI sequences for Read 1 and Read 2, separated by a plus [+].

↩ ↩² ↩³
https://samtools.github.io/hts-specs/SAMtags.pdf

In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the
recommended implementation concatenates all the barcodes with a hyphen (‘-’) between the different
barcodes.

↩
https://www.bioinformatics.org/sms/iupac.html ↩

The text was updated successfully, but these errors were encountered:

msto mentioned this issue Jan 19, 2024

feat: add --umi-prefix to CopyUmiFromReadName #958

Merged

msto changed the title ~~CopyUmiFromReadName should support UMI separators longer than a single character.~~ CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r" Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r" #957

CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r" #957

msto commented Jan 19, 2024 •

edited

Loading

CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r" #957

CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r" #957

Comments

msto commented Jan 19, 2024 • edited Loading

Problem

Proposed solution

Footnotes

msto commented Jan 19, 2024 •

edited

Loading