Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffle identical alignments pseudo-randomly on query name instead? #413

Open
ksahlin opened this issue Mar 30, 2024 · 0 comments
Open

Shuffle identical alignments pseudo-randomly on query name instead? #413

ksahlin opened this issue Mar 30, 2024 · 0 comments

Comments

@ksahlin
Copy link
Owner

ksahlin commented Mar 30, 2024

Hi @marcelm (CC @Itolstoganov)

Currently we shuffle on chunk_ID, which makes read mappings different for different number of threads, or if reads are occurring in different chunk.

IIRC, BWA-MEM gets the pseudo random placement from the read name. Is it possible to do this instead of on chunk ID, without noticeable computational overhead? I don't think it's worth implementing if the code becomes complex or if it increases runtime.

I noticed this when running an experiment b/t symmetric and asymmetric seeds with reads simulated from either chr X or Y and mapping to only chr X and chr Y from CHM13.

When using asymmetric seeds (2*hash_s1 - hash_s2), the below read aligns to position 29094803 on chrY when aligned as the only read, but to position 29091249 on chrY when alignd as part of a file of 100k reads (using -t 2). In both cases it has CIGAR 114=1X161=1X3=1X42=1X8=1X64=1X30=1X24=1X16=1X8=1X3=1X1=1X14= and alignment score 900. The full simulated file is too large to attach here, I can provide it elsewhere if needed.

@simulated.308
TTCCTTTTGACTCCATTTCATTCGATTCCATTCCATTCCATTAATTTCCATTCCATTCGAGACCTTTCCATTGCAGTCTTTTCCCTTCGAGTCCATTCCGTTCGATTCCCTTCCTTTCGATTCCATTCCATTGGAGTCCGTACCAGTCGAGTCCATTCTATTCCAGTCCATTAGTTTCGACTCCATTGCATTCGAGTGCATTCCATTCCGTGGCTGTCCATTCCATTCCGTTTGATGCCATTCCATACGATTCCATTCAATTCGAGACCATTCTATACCTCTCCATTCCTTGTGGTTCGATTCCATTTCACTCTAGTCCATTCAATTCCATTGAATTCCATTCGACTCTATTCCGTTCCATTCAATTCCATTCCATTCGATTCCATTTTTTTCGAGATCCTTCCATTACACTCCCTTCCATTCCAGTGAATTCCATTCCAGTCTCTTCAGTTCTATTCCATTCCATTCGTATCGATTCCATTCAACTCCAGCCCATTCCA
+
HHIIIIHIHIHIHIHHIHIHIHIIIHHIHHIIHHHIHGGHHHHHIIHGIGGGIGIHIIIGHGHGIIIIFIGIIIHIIHGHCIHGIDIGGIHIHGIHIGGHIIIHIIIFHGIIHHGIIDIIIIHGHIHIFGIDIFIIIIIIFGFEFHIIEIIHHGDEIEEFIBFHIHIDIEIIHIIEGIIIIFDHIIGHFHIIIEHDIII>HIIDFIIIIEDHIFE@IICEDF@DIHFII?EDIIGHACIGBGHAIIIHDIIIDHIAIIHIBEFIID@IIHIGICDI6III>>BICGGIG:IIIIIBIHICBDGIIIIIBIHI@CIEICIIIICIEIIIBIIIGIIDADFA=>HAICI@IIABII<D=IBIIIIIIFIDIIGIDCBGII<ICI8IIBC9<IFIHFEIH@ID@;ICHDII;FIIACIIHIII?4AI@I;EIFII9IIIIFI:<II<HI<IG>I8DIAHIII6GE1=IIF<IIIIIBIII>IIIHI=I>CI<I<@=FEIII;@

Btw, for symmetric seeds (as is currently used) the read aligns with alignment score 1000 and 223=1X38=1X237= to position 44832808 on chr Y.

@ksahlin ksahlin changed the title Shuffle pseudo-random on query name instead? Shuffle identical alignments pseudo-randomly on query name instead? Mar 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant