Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rescue using partial hits, even in non-MCS mode #476

Open
wants to merge 2 commits into
base: base-hash
Choose a base branch
from

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Jan 15, 2025

Credit for this idea goes to @Itolstoganov.

PR #472 changes the index so that the first syncmer is always chosen as the base. This improves accuracy in MCS, but reduces mapping rate (and thus accuracy) in non-MCS mode. Apparently, some reads with many errors no longer get any hits at all.

However, we can easily detect when a query doesn’t get any hits and then - even if MCS isn’t explicitly enabled - just use the method we already have of looking up partial seeds, as a different kind of MCS rescue.

To be clear, only using partial seeds when no regular hits can be found was an idea by @Itolstoganov.

This actually works much better than I would have expected: Both accuracy and speed get improve!

Details below.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 15, 2025

(Comparisons are still running, I will update the tables below when the results are in.)

Comparing accuracy (paf) - sim3

8f79881 Version 0.15.0

library 8f79881 this PR difference
sim3-drosophila-50-se 83.4170 85.9487 +2.5317
sim3-drosophila-75-se 87.3502 87.7274 +0.3772
sim3-drosophila-100-se 88.5679 88.7693 +0.2014
sim3-drosophila-150-se 90.3653 90.3935 +0.0282
sim3-drosophila-200-se 91.4271 91.4321 +0.0050
sim3-drosophila-300-se 92.7857 92.7790 -0.0067
sim3-maize-50-se 43.7357 44.4181 +0.6824
sim3-maize-75-se 56.7183 56.7283 +0.0100
sim3-maize-100-se 65.1711 65.1622 -0.0089
sim3-maize-150-se 77.0898 77.1823 +0.0925
sim3-maize-200-se 83.4425 83.5704 +0.1279
sim3-maize-300-se 89.1560 89.3088 +0.1528
sim3-CHM13-50-se 81.2211 82.8875 +1.6664
sim3-CHM13-75-se 87.4537 87.8153 +0.3616
sim3-CHM13-100-se 89.6358 89.8031 +0.1673
sim3-CHM13-150-se 91.8078 91.8012 -0.0066
sim3-CHM13-200-se 92.7195 92.6373 -0.0822
sim3-CHM13-300-se 93.5527 93.3979 -0.1548
sim3-rye-50-se 41.0999 41.5645 +0.4646
sim3-rye-75-se 54.9693 55.0105 +0.0412
sim3-rye-100-se 64.1215 64.0279 -0.0936
sim3-rye-150-se 76.7345 76.8148 +0.0803
sim3-rye-200-se 83.1314 83.2770 +0.1456
sim3-rye-300-se 88.4104 88.6000 +0.1896
sim3-ecoli50-50-se 11.7713 11.9962 +0.2249
sim3-ecoli50-75-se 14.8981 14.9894 +0.0913
sim3-ecoli50-100-se 17.4152 17.4677 +0.0525
sim3-ecoli50-150-se 22.1174 22.1550 +0.0376
sim3-ecoli50-200-se 25.6747 25.6448 -0.0299
sim3-ecoli50-300-se 31.1758 31.1882 +0.0124

Average difference se: +0.2454

Comparing accuracy (paf) - sim5

8f79881 Version 0.15.0

library 8f79881 this PR difference
sim5-drosophila-50-se 76.9207 82.8214 +5.9007
sim5-drosophila-75-se 83.4332 85.4542 +2.0210
sim5-drosophila-100-se 84.9742 86.9654 +1.9912
sim5-drosophila-150-se 87.5069 89.0132 +1.5063
sim5-drosophila-200-se 89.2447 90.0349 +0.7902
sim5-drosophila-300-se 91.2737 91.3069 +0.0332
sim5-maize-50-se 39.0947 40.5819 +1.4872
sim5-maize-75-se 51.1997 51.7045 +0.5048
sim5-maize-100-se 58.7611 59.3002 +0.5391
sim5-maize-150-se 70.7927 71.3318 +0.5391
sim5-maize-200-se 77.6704 78.0476 +0.3772
sim5-maize-300-se 84.6004 84.7826 +0.1822
sim5-CHM13-50-se 74.2728 78.0569 +3.7841
sim5-CHM13-75-se 82.7277 84.3564 +1.6287
sim5-CHM13-100-se 85.3533 86.9793 +1.6260
sim5-CHM13-150-se 88.6346 89.8783 +1.2437
sim5-CHM13-200-se 90.4095 91.0122 +0.6027
sim5-CHM13-300-se 92.1489 92.0702 -0.0787
sim5-rye-50-se 36.5498 37.5834 +1.0336
sim5-rye-75-se 49.1617 49.4974 +0.3357
sim5-rye-100-se 57.3582 57.6925 +0.3343
sim5-rye-150-se 70.0147 70.4496 +0.4349
sim5-rye-200-se 77.0137 77.3454 +0.3317
sim5-rye-300-se 83.8602 84.0447 +0.1845
sim5-ecoli50-50-se 10.7429 11.2117 +0.4688
sim5-ecoli50-75-se 13.6476 13.8009 +0.1533
sim5-ecoli50-100-se 15.8012 15.9856 +0.1844
sim5-ecoli50-150-se 20.1191 20.3361 +0.2170
sim5-ecoli50-200-se 23.3852 23.4603 +0.0751
sim5-ecoli50-300-se 28.3011 28.3274 +0.0263

Average difference se: +0.9486

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 15, 2025

Here are some initial runtime measurements.

Comparing user time in seconds (paf) - sim5

8f79881 Version 0.15.0

library 8f79881 this PR difference
sim3-drosophila-50-se 14 12 -2
sim3-drosophila-75-se 17 13 -3
sim3-drosophila-100-se 18 15 -4
sim3-drosophila-150-se 22 18 -4
sim3-drosophila-200-se 26 22 -5
sim3-drosophila-300-se 38 30 -8
sim3-maize-50-se 186 167 -19
sim3-maize-75-se 185 175 -11
sim3-maize-100-se 180 163 -17
sim3-maize-150-se 177 170 -8
sim3-maize-200-se 194 180 -15
sim3-maize-300-se 244 223 -21
sim3-CHM13-50-se 194 186 -8
sim3-CHM13-75-se 193 193 -1
sim3-CHM13-100-se 202 194 -8
sim3-CHM13-150-se 190 194 +5
sim3-CHM13-200-se 201 198 -3
sim3-CHM13-300-se 210 206 -4
sim3-rye-50-se 455 441 -14
sim3-rye-75-se 472 463 -8
sim3-rye-100-se 446 431 -15
sim3-rye-150-se 435 460 +25
sim3-rye-200-se 484 445 -38
sim3-rye-300-se 533 490 -43
sim3-ecoli50-50-se 33 24 -9
sim3-ecoli50-75-se 43 29 -14
sim3-ecoli50-100-se 48 33 -14
sim3-ecoli50-150-se 57 40 -17
sim3-ecoli50-200-se 68 47 -21
sim3-ecoli50-300-se 98 64 -34

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 15, 2025

Here’s a comparison between v0.15.0 using --mcs and this PR (without --mcs) - the difference isn’t that big, that is, we get most of the MCS benefits by using this different strategy of only looking up partial hits if we find no full hits at all for a query.

Comparing accuracy (paf) - sim3

library v0.15.0 + --mcs this PR difference
sim3-drosophila-50-se 85.4675 85.9487 +0.4812
sim3-drosophila-75-se 87.7179 87.7274 +0.0095
sim3-drosophila-100-se 88.8573 88.7693 -0.0880
sim3-drosophila-150-se 90.4459 90.3935 -0.0524
sim3-drosophila-200-se 91.4787 91.4321 -0.0466
sim3-drosophila-300-se 92.8118 92.7790 -0.0328
sim3-maize-50-se 44.4397 44.4181 -0.0216
sim3-maize-75-se 56.9325 56.7283 -0.2042
sim3-maize-100-se 65.3415 65.1622 -0.1793
sim3-maize-150-se 77.3521 77.1823 -0.1698
sim3-maize-200-se 83.6409 83.5704 -0.0705
sim3-maize-300-se 89.2354 89.3088 +0.0734
sim3-CHM13-50-se 82.5333 82.8875 +0.3542
sim3-CHM13-75-se 87.8095 87.8153 +0.0058
sim3-CHM13-100-se 89.8613 89.8031 -0.0582
sim3-CHM13-150-se 91.8827 91.8012 -0.0815
sim3-CHM13-200-se 92.7431 92.6373 -0.1058
sim3-CHM13-300-se 93.5715 93.3979 -0.1736
sim3-rye-50-se 41.6418 41.5645 -0.0773
sim3-rye-75-se 55.1969 55.0105 -0.1864
sim3-rye-100-se 64.2993 64.0279 -0.2714
sim3-rye-150-se 76.9284 76.8148 -0.1136
sim3-rye-200-se 83.3421 83.2770 -0.0651
sim3-rye-300-se 88.5048 88.6000 +0.0952
sim3-ecoli50-50-se 11.9889 11.9962 +0.0073
sim3-ecoli50-75-se 15.0408 14.9894 -0.0514
sim3-ecoli50-100-se 17.5144 17.4677 -0.0467
sim3-ecoli50-150-se 22.2572 22.1550 -0.1022
sim3-ecoli50-200-se 25.7903 25.6448 -0.1455
sim3-ecoli50-300-se 31.2731 31.1882 -0.0849

Average difference se: -0.0467

@marcelm marcelm force-pushed the mcs-rescue branch 2 times, most recently from f230178 to 0026eba Compare January 15, 2025 12:58
@ksahlin
Copy link
Owner

ksahlin commented Jan 15, 2025

Really nice -- great idea @Itolstoganov !

Am I interpreting it correctly that:

  1. This PR enables us to use the same index for non-mcs and mcs, thus we can safely merge this into main.
  2. For paper benchmarks, it is then not correct to benchmark latest main in non-mcs and mcs modes, because both versions use the "mcs feature", they are thus two different variants of mcs. So we still need to invoke v0.15.0 in the benchmarks.

Also, because of this very interesting result, should we now benchmark three versions in the paper:

  • v0.15.0
  • mcs approach 1 (current approach in paper)
  • mcs approach 2 (@Itolstoganov's suggestion)

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 15, 2025

(To clarify: A while ago, you mentioned that Ivan tried this approach in order to speed up MCS mode - at least that is how I understood it at the time. That is where I got it from.)

Am I interpreting it correctly that:

1. This PR enables us to use the same index for non-mcs and mcs, thus we can safely merge this into main.

Yes. (But more accurately, we wouldn’t even have a "pure" non-MCS mode anymore.)

Also, because of this very interesting result, should we now benchmark three versions in the paper:

* v0.15.0
* mcs approach 1 (current approach in paper)
* mcs approach 2 (@Itolstoganov's suggestion)

My first thought is that this is too much - we already have too many plots. The difference wouldn’t be visible anyway (maybe except for read length 50). We should focus on one MCS approach and present that. Maybe add a sentence or perhaps a single plot that shows what the difference is in results between approach 1 and 2.

Alternatively, maybe we can find a combined approach that is (nearly) as fast as the first approach and as accurate as the second.

2. For paper benchmarks, it is then not correct to benchmark latest main in non-mcs and mcs modes, because both versions use the "mcs feature", they are thus two different variants of mcs. So we still need to invoke v0.15.0 in the benchmarks.

Yes, but we could also choose to abandon approach 1 entirely and change --mcs to mean "enable approach 2", giving us again the non-MCS and MCS modes.

(However, for the next release, we would definitely want to enable MCS approach 1 by default, but we could do that afterwards.)

@ksahlin
Copy link
Owner

ksahlin commented Jan 15, 2025

We should focus on one MCS approach and present that. Maybe add a sentence or perhaps a single plot that shows what the difference is in results between approach 1 and 2.

Ok, I agree.

Alternatively, maybe we can find a combined approach that is (nearly) as fast as the first approach and as accurate as the second.

You mean the opposite, right? That is, "as fast as the second approach and as accurate as the first"?

for the next release, we would definitely want to enable MCS approach 1 by default

hmm, you mean "definitely want to enable MCS approach 2 by default", right? because according to this PR (which is approach 2), approach 2 is both faster and more accurate. Approach 1 (the "previous" mcs in the paper) still has the accuracy-speed tradeoff, making it a more subjective decition.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 15, 2025

Alternatively, maybe we can find a combined approach that is (nearly) as fast as the first approach and as accurate as the second.

You mean the opposite, right? That is, "as fast as the second approach and as accurate as the first"?

Yeah, I mixed them up.

for the next release, we would definitely want to enable MCS approach 1 by default

hmm, you mean "definitely want to enable MCS approach 2 by default", right?

Yes, same mixup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants