Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about SyncmerIterator #463

Open
drtconway opened this issue Dec 2, 2024 · 4 comments
Open

Question about SyncmerIterator #463

drtconway opened this issue Dec 2, 2024 · 4 comments

Comments

@drtconway
Copy link
Contributor

Hi,

I'm studying your implementation of SyncmerIterator, and I'm a bit confused. It's possible there's a bug in the code, but I think it's much more likely that I just don't understand the intended semantics correctly.

The code runs accumulators for for the current k-mer and the current s-mer, keeping both the forward and reverse complement *-mers, which makes sense.

It then makes a canonical choice of which s-mer to hash, and adds that to the queue.

What follows is a set of cases that evaluate or re-evaluate the position of the minimising s-mer.

The first question I have is whether it matters that the branch at

if (qs.size() == k - s + 1) { // We are at the last s-mer within the first k-mer, need to decide if we add it
and the branch at
for (int j = qs.size() - 1; j >= 0; j--) { //Iterate in reverse to choose the rightmost minimizer in a window
iterate in opposite directions? In the case of equal hashes, they will select a different minimum position, which could result in the selection of different strobemers. It is a very corner case, I concede, since the former branch in the code handles the very first k-mer, but unless I'm wrong, this could result in inconsistent strobmer choice. This is really a very minor detail, but I mention it since I have been examining the area of the code in detail.

The more important question pertains to the canonical choice of k-mers. I note that the code at

uint64_t yk = std::min(xk[0], xk[1]);
makes a canonical choice between the forward and reverse complement sequences, and reports that hash as the syncmer.

However, if I present the code with the reverse complement of the original sequence, the same set of s-mers will be derived, but in the opposite order, so that the position of the minimiser will be at the opposite end of the queue, and so the test will not pass.

My understanding of the intended semantics is that the set of syncmers would be the same regardless of which strand the original sequence is on, and certainly by reporting the hash of the min k-mer, the code creates that impression.

I'd love to understand where my reasoning has gone off the rails!

Thanks,

Tom.

@marcelm
Copy link
Collaborator

marcelm commented Dec 3, 2024

Hi and thanks for doing this analysis. I’ve so far only had time to look into the first issue, and I think it’s a bug that the very first k-mer is treated differently. What I’m wondering about is whether the same applies here:

} else if (hash_s < qs_min_val) { // the new value added to queue is the new minimum

This should also only find the first minimum.

Please give me some time for looking into your other question/observation.

marcelm added a commit that referenced this issue Dec 3, 2024
This makes the two for loops that find the minimum hash identical.

Thanks to @drtconway for noticing the discrepancy!

See #463
@drtconway
Copy link
Contributor Author

I think you're right, that perhaps the test on the line you cite should be <=.

To a certain extent, it doesn't necessarily matter exactly which precise definition you use for syncmers, so long as the implementation expresses that definition on all code paths.

The other issue is probably more important - if I give the sequence to be indexed on the reverse complement strand, will I get the same sequence of syncmers (in reverse order)? Should that be the case? (I think it should!)

What I haven't thought through carefully is how the strandedness issue intersects with the formation of strobemers.

I note that neither the Syncmer nor Strobemer papers mention strandedness.

@marcelm
Copy link
Collaborator

marcelm commented Dec 4, 2024

To a certain extent, it doesn't necessarily matter exactly which precise definition you use for syncmers, so long as the implementation expresses that definition on all code paths.

Yes, and syncmers just need to be generated in the same way for the reference and the query. Even if the comparisons are a bit inconsistent, it doesn’t matter so much as long as it’s inconsistent in the same way.

The other issue is probably more important - if I give the sequence to be indexed on the reverse complement strand, will I get the same sequence of syncmers (in reverse order)? Should that be the case? (I think it should!)

I looked into this now as well and as a first exercise, wrote a test to ensure that syncmers are canonical (see #465/#464). You’re correct that the order of s-mers is reversed in the vector, but this does not matter because the s-mer that determines whether the syncmer is chosen or not is the one in the middle. The t parameter is chosen in that way. See also

throw BadParameter("(k - s) must be an even number to create canonical syncmers. Please set s to e.g. k-2, k-4, k-6, ...");

So the relevant s-mer is in the same position both in forward and reverse direction.

However, there’s a slight inconsistency once again due to the way how ties are handled: Because the syncmer was chosen if the leftmost minimum was at that middle position, this is different in the reverse complemented version of the k-mer because the leftmost minimum would then be the rightmost one. The changes I made in #464 happen to fix that issue as well.

I note that neither the Syncmer nor Strobemer papers mention strandedness.

Yes, I think this also needs a bit more research. I’d like to switch to non-canonical syncmers and randstrobes because my impression is that it would both simplify the code in many places, speed up mapping, and make it easier to think about what happens, but this loses some accuracy, see the discussion in #405.

@ksahlin
Copy link
Owner

ksahlin commented Dec 10, 2024

Hi,

As mentioned elsewhere I have been 'off' github because an old university email address was discontinued so I haven’t gotten any notifications from GitHub for about three weeks.

I now get the context behind the commit to sample all syncmers in homopolymer and other STR regions: d7c5cbe#comments

Given this discussion I feel more inclined to accept sampling such regions more frequently if it means that syncmers are identical in forward and reverse directions. Initially I was sceptical because some seeds may become very repetitive, causing lower accuracy and more runtime. But I am inclined to see a working implementation of canonical strobemers, which makes it justified.

marcelm added a commit that referenced this issue Dec 19, 2024
This makes the two for loops that find the minimum hash identical.

Thanks to @drtconway for noticing the discrepancy!

See #463
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants