Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments regarding speedup to O(n+m+s^2) #5

Open
RagnarGrootKoerkamp opened this issue Mar 31, 2022 · 3 comments
Open

Comments regarding speedup to O(n+m+s^2) #5

RagnarGrootKoerkamp opened this issue Mar 31, 2022 · 3 comments

Comments

@RagnarGrootKoerkamp
Copy link

The section on the speedup to O(n+m+s^2) should have a remark on whether you are assuming a finite or infinite alphabet.
Implicitly in the context of pairwise alignment, I would assume a finite alphabet.

However, The Myers'86 paper already writes that building the suffix tree takes linear time for finite alphabets, giving the required O(N+M+D^2) in their terms. They don't explicitly mention this (I think), because they are supposedly doing LCS in the context of diff algorithms, where lines of code actually give an infinite alphabet.

Similarly, Landau-Vishkin'89 writes that the complexity is O(nk) for finding all matches with up to k mistakes of a pattern in a text for finite alphabets, reduced from O(nm) by using suffix trees as well. Their paper is not about global alignment, but applying their result to Ukkonen'85s global alignment algorithm directly gives the O(n+m+s^2) runtime.

@RagnarGrootKoerkamp RagnarGrootKoerkamp changed the title Comments regarding speedup Comments regarding speedup to O(n+m+s^2) Mar 31, 2022
@jeizenga
Copy link
Owner

The log factor in Myers' O(ND) paper actually comes from the least common ancestor data structure rather than the suffix tree. However, since publishing the preprint, I had discovered that the Landau-Vishkin algorithm uses a similar approach. Considering that it didn't provide a practical speed up, I think I will probably remove that aspect from the manuscript.

@RagnarGrootKoerkamp
Copy link
Author

RagnarGrootKoerkamp commented Mar 31, 2022

The log factor in Myers' O(ND) paper actually comes from the least common ancestor data structure rather than the suffix tree.

They write:

Second, a recent RAM-based algorithm for answering Q on-line queris of the LCA of vertices in a fixed V-vertex tree takes O(V+Q) time [6].

and they basically repeat this later, so my understanding is still that the (N+M) log(N+M)+D^2 logarithm is only needed for the suffix tree construction for infinite alphabets.

Considering that it didn't provide a practical speed up, I think I will probably remove that aspect from the manuscript.

Right makes sense. Nethertheless, do you have an idea how large n or s need to be before this becomes worthwhile?

@jeizenga
Copy link
Owner

Oh, nevermind then, I must be misremembering.

My expectation is that the tradeoff has less to do with N or S than with the properties of the sequences. Because the constants are so much higher on the suffix tree algorithm, there would only be a substantial speed-up if there are many long matches that can be reduced to O(1) queries. In high entropy sequences, the lengths of spurious matches are very short (expected length ~4/3 in uniform random sequences). To have many long matches, the sequences need to have a high rate of internal repetition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants