Rescue using partial hits, even in non-MCS mode #476

marcelm · 2025-01-15T09:33:04Z

Credit for this idea goes to @Itolstoganov.

PR #472 changes the index so that the first syncmer is always chosen as the base. This improves accuracy in MCS, but reduces mapping rate (and thus accuracy) in non-MCS mode. Apparently, some reads with many errors no longer get any hits at all.

However, we can easily detect when a query doesn’t get any hits and then - even if MCS isn’t explicitly enabled - just use the method we already have of looking up partial seeds, as a different kind of MCS rescue.

To be clear, only using partial seeds when no regular hits can be found was an idea by @Itolstoganov.

This actually works much better than I would have expected: Both accuracy and speed get improve!

Details below.

marcelm · 2025-01-15T09:34:30Z

~~(Comparisons are still running, I will update the tables below when the results are in.)~~

Comparing accuracy (paf) - sim3

8f79881 Version 0.15.0

library	`8f79881`	this PR	difference
sim3-drosophila-50-se	83.4170	85.9487	+2.5317
sim3-drosophila-75-se	87.3502	87.7274	+0.3772
sim3-drosophila-100-se	88.5679	88.7693	+0.2014
sim3-drosophila-150-se	90.3653	90.3935	+0.0282
sim3-drosophila-200-se	91.4271	91.4321	+0.0050
sim3-drosophila-300-se	92.7857	92.7790	-0.0067
sim3-maize-50-se	43.7357	44.4181	+0.6824
sim3-maize-75-se	56.7183	56.7283	+0.0100
sim3-maize-100-se	65.1711	65.1622	-0.0089
sim3-maize-150-se	77.0898	77.1823	+0.0925
sim3-maize-200-se	83.4425	83.5704	+0.1279
sim3-maize-300-se	89.1560	89.3088	+0.1528
sim3-CHM13-50-se	81.2211	82.8875	+1.6664
sim3-CHM13-75-se	87.4537	87.8153	+0.3616
sim3-CHM13-100-se	89.6358	89.8031	+0.1673
sim3-CHM13-150-se	91.8078	91.8012	-0.0066
sim3-CHM13-200-se	92.7195	92.6373	-0.0822
sim3-CHM13-300-se	93.5527	93.3979	-0.1548
sim3-rye-50-se	41.0999	41.5645	+0.4646
sim3-rye-75-se	54.9693	55.0105	+0.0412
sim3-rye-100-se	64.1215	64.0279	-0.0936
sim3-rye-150-se	76.7345	76.8148	+0.0803
sim3-rye-200-se	83.1314	83.2770	+0.1456
sim3-rye-300-se	88.4104	88.6000	+0.1896
sim3-ecoli50-50-se	11.7713	11.9962	+0.2249
sim3-ecoli50-75-se	14.8981	14.9894	+0.0913
sim3-ecoli50-100-se	17.4152	17.4677	+0.0525
sim3-ecoli50-150-se	22.1174	22.1550	+0.0376
sim3-ecoli50-200-se	25.6747	25.6448	-0.0299
sim3-ecoli50-300-se	31.1758	31.1882	+0.0124

Average difference se: +0.2454

Comparing accuracy (paf) - sim5

8f79881 Version 0.15.0

library	`8f79881`	this PR	difference
sim5-drosophila-50-se	76.9207	82.8214	+5.9007
sim5-drosophila-75-se	83.4332	85.4542	+2.0210
sim5-drosophila-100-se	84.9742	86.9654	+1.9912
sim5-drosophila-150-se	87.5069	89.0132	+1.5063
sim5-drosophila-200-se	89.2447	90.0349	+0.7902
sim5-drosophila-300-se	91.2737	91.3069	+0.0332
sim5-maize-50-se	39.0947	40.5819	+1.4872
sim5-maize-75-se	51.1997	51.7045	+0.5048
sim5-maize-100-se	58.7611	59.3002	+0.5391
sim5-maize-150-se	70.7927	71.3318	+0.5391
sim5-maize-200-se	77.6704	78.0476	+0.3772
sim5-maize-300-se	84.6004	84.7826	+0.1822
sim5-CHM13-50-se	74.2728	78.0569	+3.7841
sim5-CHM13-75-se	82.7277	84.3564	+1.6287
sim5-CHM13-100-se	85.3533	86.9793	+1.6260
sim5-CHM13-150-se	88.6346	89.8783	+1.2437
sim5-CHM13-200-se	90.4095	91.0122	+0.6027
sim5-CHM13-300-se	92.1489	92.0702	-0.0787
sim5-rye-50-se	36.5498	37.5834	+1.0336
sim5-rye-75-se	49.1617	49.4974	+0.3357
sim5-rye-100-se	57.3582	57.6925	+0.3343
sim5-rye-150-se	70.0147	70.4496	+0.4349
sim5-rye-200-se	77.0137	77.3454	+0.3317
sim5-rye-300-se	83.8602	84.0447	+0.1845
sim5-ecoli50-50-se	10.7429	11.2117	+0.4688
sim5-ecoli50-75-se	13.6476	13.8009	+0.1533
sim5-ecoli50-100-se	15.8012	15.9856	+0.1844
sim5-ecoli50-150-se	20.1191	20.3361	+0.2170
sim5-ecoli50-200-se	23.3852	23.4603	+0.0751
sim5-ecoli50-300-se	28.3011	28.3274	+0.0263

Average difference se: +0.9486

marcelm · 2025-01-15T09:37:31Z

Here are some initial runtime measurements.

Comparing user time in seconds (paf) - sim5

8f79881 Version 0.15.0

library	`8f79881`	this PR	difference
sim3-drosophila-50-se	14	12	-2
sim3-drosophila-75-se	17	13	-3
sim3-drosophila-100-se	18	15	-4
sim3-drosophila-150-se	22	18	-4
sim3-drosophila-200-se	26	22	-5
sim3-drosophila-300-se	38	30	-8
sim3-maize-50-se	186	167	-19
sim3-maize-75-se	185	175	-11
sim3-maize-100-se	180	163	-17
sim3-maize-150-se	177	170	-8
sim3-maize-200-se	194	180	-15
sim3-maize-300-se	244	223	-21
sim3-CHM13-50-se	194	186	-8
sim3-CHM13-75-se	193	193	-1
sim3-CHM13-100-se	202	194	-8
sim3-CHM13-150-se	190	194	+5
sim3-CHM13-200-se	201	198	-3
sim3-CHM13-300-se	210	206	-4
sim3-rye-50-se	455	441	-14
sim3-rye-75-se	472	463	-8
sim3-rye-100-se	446	431	-15
sim3-rye-150-se	435	460	+25
sim3-rye-200-se	484	445	-38
sim3-rye-300-se	533	490	-43
sim3-ecoli50-50-se	33	24	-9
sim3-ecoli50-75-se	43	29	-14
sim3-ecoli50-100-se	48	33	-14
sim3-ecoli50-150-se	57	40	-17
sim3-ecoli50-200-se	68	47	-21
sim3-ecoli50-300-se	98	64	-34

marcelm · 2025-01-15T11:22:46Z

Here’s a comparison between v0.15.0 using --mcs and this PR (without --mcs) - the difference isn’t that big, that is, we get most of the MCS benefits by using this different strategy of only looking up partial hits if we find no full hits at all for a query.

Comparing accuracy (paf) - sim3

library	v0.15.0 + `--mcs`	this PR	difference
sim3-drosophila-50-se	85.4675	85.9487	+0.4812
sim3-drosophila-75-se	87.7179	87.7274	+0.0095
sim3-drosophila-100-se	88.8573	88.7693	-0.0880
sim3-drosophila-150-se	90.4459	90.3935	-0.0524
sim3-drosophila-200-se	91.4787	91.4321	-0.0466
sim3-drosophila-300-se	92.8118	92.7790	-0.0328
sim3-maize-50-se	44.4397	44.4181	-0.0216
sim3-maize-75-se	56.9325	56.7283	-0.2042
sim3-maize-100-se	65.3415	65.1622	-0.1793
sim3-maize-150-se	77.3521	77.1823	-0.1698
sim3-maize-200-se	83.6409	83.5704	-0.0705
sim3-maize-300-se	89.2354	89.3088	+0.0734
sim3-CHM13-50-se	82.5333	82.8875	+0.3542
sim3-CHM13-75-se	87.8095	87.8153	+0.0058
sim3-CHM13-100-se	89.8613	89.8031	-0.0582
sim3-CHM13-150-se	91.8827	91.8012	-0.0815
sim3-CHM13-200-se	92.7431	92.6373	-0.1058
sim3-CHM13-300-se	93.5715	93.3979	-0.1736
sim3-rye-50-se	41.6418	41.5645	-0.0773
sim3-rye-75-se	55.1969	55.0105	-0.1864
sim3-rye-100-se	64.2993	64.0279	-0.2714
sim3-rye-150-se	76.9284	76.8148	-0.1136
sim3-rye-200-se	83.3421	83.2770	-0.0651
sim3-rye-300-se	88.5048	88.6000	+0.0952
sim3-ecoli50-50-se	11.9889	11.9962	+0.0073
sim3-ecoli50-75-se	15.0408	14.9894	-0.0514
sim3-ecoli50-100-se	17.5144	17.4677	-0.0467
sim3-ecoli50-150-se	22.2572	22.1550	-0.1022
sim3-ecoli50-200-se	25.7903	25.6448	-0.1455
sim3-ecoli50-300-se	31.2731	31.1882	-0.0849

Average difference se: -0.0467

ksahlin · 2025-01-15T14:53:44Z

Really nice -- great idea @Itolstoganov !

Am I interpreting it correctly that:

This PR enables us to use the same index for non-mcs and mcs, thus we can safely merge this into main.
For paper benchmarks, it is then not correct to benchmark latest main in non-mcs and mcs modes, because both versions use the "mcs feature", they are thus two different variants of mcs. So we still need to invoke v0.15.0 in the benchmarks.

Also, because of this very interesting result, should we now benchmark three versions in the paper:

v0.15.0
mcs approach 1 (current approach in paper)
mcs approach 2 (@Itolstoganov's suggestion)

marcelm · 2025-01-15T20:00:01Z

(To clarify: A while ago, you mentioned that Ivan tried this approach in order to speed up MCS mode - at least that is how I understood it at the time. That is where I got it from.)

Am I interpreting it correctly that:

1. This PR enables us to use the same index for non-mcs and mcs, thus we can safely merge this into main.

Yes. (But more accurately, we wouldn’t even have a "pure" non-MCS mode anymore.)

Also, because of this very interesting result, should we now benchmark three versions in the paper:
* v0.15.0
* mcs approach 1 (current approach in paper)
* mcs approach 2 (@Itolstoganov's suggestion)

My first thought is that this is too much - we already have too many plots. The difference wouldn’t be visible anyway (maybe except for read length 50). We should focus on one MCS approach and present that. Maybe add a sentence or perhaps a single plot that shows what the difference is in results between approach 1 and 2.

Alternatively, maybe we can find a combined approach that is (nearly) as fast as the first approach and as accurate as the second.

2. For paper benchmarks, it is then not correct to benchmark latest main in non-mcs and mcs modes, because both versions use the "mcs feature", they are thus two different variants of mcs. So we still need to invoke v0.15.0 in the benchmarks.

Yes, but we could also choose to abandon approach 1 entirely and change --mcs to mean "enable approach 2", giving us again the non-MCS and MCS modes.

(However, for the next release, we would definitely want to enable MCS approach 1 by default, but we could do that afterwards.)

ksahlin · 2025-01-15T21:02:55Z

We should focus on one MCS approach and present that. Maybe add a sentence or perhaps a single plot that shows what the difference is in results between approach 1 and 2.

Ok, I agree.

Alternatively, maybe we can find a combined approach that is (nearly) as fast as the first approach and as accurate as the second.

You mean the opposite, right? That is, "as fast as the second approach and as accurate as the first"?

for the next release, we would definitely want to enable MCS approach 1 by default

hmm, you mean "definitely want to enable MCS approach 2 by default", right? because according to this PR (which is approach 2), approach 2 is both faster and more accurate. Approach 1 (the "previous" mcs in the paper) still has the accuracy-speed tradeoff, making it a more subjective decition.

marcelm · 2025-01-15T21:36:18Z

Alternatively, maybe we can find a combined approach that is (nearly) as fast as the first approach and as accurate as the second.

You mean the opposite, right? That is, "as fast as the second approach and as accurate as the first"?

Yeah, I mixed them up.

for the next release, we would definitely want to enable MCS approach 1 by default

hmm, you mean "definitely want to enable MCS approach 2 by default", right?

Yes, same mixup.

marcelm force-pushed the base-hash branch from c0667ff to 43a74d6 Compare January 15, 2025 12:50

marcelm force-pushed the mcs-rescue branch 2 times, most recently from f230178 to 0026eba Compare January 15, 2025 12:58

Rescue using partial hits, even in non-MCS mode

3c568dc

marcelm force-pushed the mcs-rescue branch from 0026eba to 3c568dc Compare January 15, 2025 20:18

marcelm force-pushed the base-hash branch from 43a74d6 to 9ea6fa5 Compare January 15, 2025 20:18

Refactor: Convert hits_fw/hits_rc into array

229d0df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rescue using partial hits, even in non-MCS mode #476

Rescue using partial hits, even in non-MCS mode #476

marcelm commented Jan 15, 2025

marcelm commented Jan 15, 2025 •

edited

Loading

marcelm commented Jan 15, 2025 •

edited

Loading

marcelm commented Jan 15, 2025 •

edited

Loading

ksahlin commented Jan 15, 2025

marcelm commented Jan 15, 2025

ksahlin commented Jan 15, 2025

marcelm commented Jan 15, 2025

Rescue using partial hits, even in non-MCS mode #476

Are you sure you want to change the base?

Rescue using partial hits, even in non-MCS mode #476

Conversation

marcelm commented Jan 15, 2025

marcelm commented Jan 15, 2025 • edited Loading

Comparing accuracy (paf) - sim3

Comparing accuracy (paf) - sim5

marcelm commented Jan 15, 2025 • edited Loading

Comparing user time in seconds (paf) - sim5

marcelm commented Jan 15, 2025 • edited Loading

Comparing accuracy (paf) - sim3

ksahlin commented Jan 15, 2025

marcelm commented Jan 15, 2025

ksahlin commented Jan 15, 2025

marcelm commented Jan 15, 2025

marcelm commented Jan 15, 2025 •

edited

Loading

marcelm commented Jan 15, 2025 •

edited

Loading

marcelm commented Jan 15, 2025 •

edited

Loading