Shuffle identical alignments pseudo-randomly on query name instead? #413

ksahlin · 2024-03-30T11:21:55Z

Currently we shuffle on chunk_ID, which makes read mappings different for different number of threads, or if reads are occurring in different chunk.

IIRC, BWA-MEM gets the pseudo random placement from the read name. Is it possible to do this instead of on chunk ID, without noticeable computational overhead? I don't think it's worth implementing if the code becomes complex or if it increases runtime.

I noticed this when running an experiment b/t symmetric and asymmetric seeds with reads simulated from either chr X or Y and mapping to only chr X and chr Y from CHM13.

When using asymmetric seeds (2*hash_s1 - hash_s2), the below read aligns to position 29094803 on chrY when aligned as the only read, but to position 29091249 on chrY when alignd as part of a file of 100k reads (using -t 2). In both cases it has CIGAR 114=1X161=1X3=1X42=1X8=1X64=1X30=1X24=1X16=1X8=1X3=1X1=1X14= and alignment score 900. The full simulated file is too large to attach here, I can provide it elsewhere if needed.

@simulated.308
TTCCTTTTGACTCCATTTCATTCGATTCCATTCCATTCCATTAATTTCCATTCCATTCGAGACCTTTCCATTGCAGTCTTTTCCCTTCGAGTCCATTCCGTTCGATTCCCTTCCTTTCGATTCCATTCCATTGGAGTCCGTACCAGTCGAGTCCATTCTATTCCAGTCCATTAGTTTCGACTCCATTGCATTCGAGTGCATTCCATTCCGTGGCTGTCCATTCCATTCCGTTTGATGCCATTCCATACGATTCCATTCAATTCGAGACCATTCTATACCTCTCCATTCCTTGTGGTTCGATTCCATTTCACTCTAGTCCATTCAATTCCATTGAATTCCATTCGACTCTATTCCGTTCCATTCAATTCCATTCCATTCGATTCCATTTTTTTCGAGATCCTTCCATTACACTCCCTTCCATTCCAGTGAATTCCATTCCAGTCTCTTCAGTTCTATTCCATTCCATTCGTATCGATTCCATTCAACTCCAGCCCATTCCA
+
HHIIIIHIHIHIHIHHIHIHIHIIIHHIHHIIHHHIHGGHHHHHIIHGIGGGIGIHIIIGHGHGIIIIFIGIIIHIIHGHCIHGIDIGGIHIHGIHIGGHIIIHIIIFHGIIHHGIIDIIIIHGHIHIFGIDIFIIIIIIFGFEFHIIEIIHHGDEIEEFIBFHIHIDIEIIHIIEGIIIIFDHIIGHFHIIIEHDIII>HIIDFIIIIEDHIFE@IICEDF@DIHFII?EDIIGHACIGBGHAIIIHDIIIDHIAIIHIBEFIID@IIHIGICDI6III>>BICGGIG:IIIIIBIHICBDGIIIIIBIHI@CIEICIIIICIEIIIBIIIGIIDADFA=>HAICI@IIABII<D=IBIIIIIIFIDIIGIDCBGII<ICI8IIBC9<IFIHFEIH@ID@;ICHDII;FIIACIIHIII?4AI@I;EIFII9IIIIFI:<II<HI<IG>I8DIAHIII6GE1=IIF<IIIIIBIII>IIIHI=I>CI<I<@=FEIII;@

Btw, for symmetric seeds (as is currently used) the read aligns with alignment score 1000 and 223=1X38=1X237= to position 44832808 on chr Y.

The text was updated successfully, but these errors were encountered:

ksahlin changed the title ~~Shuffle pseudo-random on query name instead?~~ Shuffle identical alignments pseudo-randomly on query name instead? Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffle identical alignments pseudo-randomly on query name instead? #413

Shuffle identical alignments pseudo-randomly on query name instead? #413

ksahlin commented Mar 30, 2024

Shuffle identical alignments pseudo-randomly on query name instead? #413

Shuffle identical alignments pseudo-randomly on query name instead? #413

Comments

ksahlin commented Mar 30, 2024