Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r" #957

Open
msto opened this issue Jan 19, 2024 · 0 comments
Open

Comments

@msto
Copy link
Contributor

msto commented Jan 19, 2024

Problem

Some UMIs produced by BCL convert are prefixed with "r", indicating "reverse complement."1

When sequencing a run with Unique Molecular Identifier (UMI) situated in the index2 (i5) on a NextSeq1000/2000 instrument, BCL Convert will put a leading "r" in front of the reverse-complemented UMI in the FASTQ header.

CopyUmiFromReadName enforces that the UMI sequence contains only valid bases (A/C/G/T/N2) or a delimiter between multiple UMIs (+ or -). UMIs prefixed with "r" fail this validation.

Proposed solution

I think it would be sensible to add the following features to CopyUmiFromReadName:

  • --umi-delimiter (Char, default=+)

    • The default should be +, as this is the default delimiter in Illumina FASTQs.2
    • If this character appears in the UMI sequence, split the sequence into multiple UMIs and validate each separately.
    • Join multiple UMIs with a hyphen (-) before storing them in the RX tag, per SAM spec.3
  • Support reverse complemented UMIs.

    • For each UMI, if it begins with "r", remove the "r" and (optionally?) reverse-complement the remaining sequence
    • (NB: This could be turned off by default, e.g. with --allow-reverse-umis as a flag. @clintval raised the concern that degenerate UMIs could include r as a masked A or G4, although this does not appear to be permitted under the current Illumina FASTQ spec.2)

Footnotes

  1. https://knowledge.illumina.com/software/general/software-general-reference_material-list/000007945

  2. https://support.illumina.com/help/BaseSpace_Sequence_Hub_OLH_009008_2/Source/Informatics/BS/FileFormat_FASTQ-files_swBS.htm

    Restricted characters: A/T/G/C/N
    UMI sequences for Read 1 and Read 2, separated by a plus [+].

    2 3
  3. https://samtools.github.io/hts-specs/SAMtags.pdf

    In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the
    recommended implementation concatenates all the barcodes with a hyphen (‘-’) between the different
    barcodes.

  4. https://www.bioinformatics.org/sms/iupac.html

@msto msto changed the title CopyUmiFromReadName should support UMI separators longer than a single character. CopyUmiFromReadName should support reverse-complemented UMIs prefixed by "r" Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant