Comments regarding speedup to `O(n+m+s^2)` #5

RagnarGrootKoerkamp · 2022-03-31T03:42:13Z

The section on the speedup to O(n+m+s^2) should have a remark on whether you are assuming a finite or infinite alphabet.
Implicitly in the context of pairwise alignment, I would assume a finite alphabet.

However, The Myers'86 paper already writes that building the suffix tree takes linear time for finite alphabets, giving the required O(N+M+D^2) in their terms. They don't explicitly mention this (I think), because they are supposedly doing LCS in the context of diff algorithms, where lines of code actually give an infinite alphabet.

Similarly, Landau-Vishkin'89 writes that the complexity is O(nk) for finding all matches with up to k mistakes of a pattern in a text for finite alphabets, reduced from O(nm) by using suffix trees as well. Their paper is not about global alignment, but applying their result to Ukkonen'85s global alignment algorithm directly gives the O(n+m+s^2) runtime.

The text was updated successfully, but these errors were encountered:

jeizenga · 2022-03-31T16:02:50Z

The log factor in Myers' O(ND) paper actually comes from the least common ancestor data structure rather than the suffix tree. However, since publishing the preprint, I had discovered that the Landau-Vishkin algorithm uses a similar approach. Considering that it didn't provide a practical speed up, I think I will probably remove that aspect from the manuscript.

RagnarGrootKoerkamp · 2022-03-31T16:19:06Z

The log factor in Myers' O(ND) paper actually comes from the least common ancestor data structure rather than the suffix tree.

They write:

Second, a recent RAM-based algorithm for answering Q on-line queris of the LCA of vertices in a fixed V-vertex tree takes O(V+Q) time [6].

and they basically repeat this later, so my understanding is still that the (N+M) log(N+M)+D^2 logarithm is only needed for the suffix tree construction for infinite alphabets.

Considering that it didn't provide a practical speed up, I think I will probably remove that aspect from the manuscript.

Right makes sense. Nethertheless, do you have an idea how large n or s need to be before this becomes worthwhile?

jeizenga · 2022-03-31T17:28:40Z

Oh, nevermind then, I must be misremembering.

My expectation is that the tradeoff has less to do with N or S than with the properties of the sequences. Because the constants are so much higher on the suffix tree algorithm, there would only be a substantial speed-up if there are many long matches that can be reduced to O(1) queries. In high entropy sequences, the lengths of spurious matches are very short (expected length ~4/3 in uniform random sequences). To have many long matches, the sequences need to have a high rate of internal repetition.

RagnarGrootKoerkamp changed the title ~~Comments regarding speedup~~ Comments regarding speedup to O(n+m+s^2) Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments regarding speedup to `O(n+m+s^2)` #5

Comments regarding speedup to `O(n+m+s^2)` #5

RagnarGrootKoerkamp commented Mar 31, 2022

jeizenga commented Mar 31, 2022

RagnarGrootKoerkamp commented Mar 31, 2022 •

edited

Loading

jeizenga commented Mar 31, 2022

Comments regarding speedup to O(n+m+s^2) #5

Comments regarding speedup to O(n+m+s^2) #5

Comments

RagnarGrootKoerkamp commented Mar 31, 2022

jeizenga commented Mar 31, 2022

RagnarGrootKoerkamp commented Mar 31, 2022 • edited Loading

jeizenga commented Mar 31, 2022

Comments regarding speedup to `O(n+m+s^2)` #5

Comments regarding speedup to `O(n+m+s^2)` #5

RagnarGrootKoerkamp commented Mar 31, 2022 •

edited

Loading