Faster bootstrap by not doing fsync during bootstrap #21432

horschi · 2024-11-04T15:45:03Z

Hi,

I think I don't have to explain that bootstrap is slow with scylla. I hope this does not make me a heretic :-)

I think it would be pretty safe for scylla to not fsync during bootstrap, since the node is not fully joined yet anyway. There needs to be only a fsync at the end of the bootstrap.

This improves bootstrap performance dramatically and should be a very low hanging fruit.

Next level could be to optimize also normal repair. If a repair session creates multiple sstables, there should be no need to fsync them individually, but do one fsync at the end of the repair session.

I think this would help a lot. For HDDs unsafe-fsync brings down bootstrap times by a factor of 20.

mykaul · 2024-11-04T15:54:33Z

@horschi - can you clarify a bit - are you using Raft for the bootstrap process, etc.? Do you have specific points where fsync is an issue during the bootstrap process, that you think we should remove, or is it a general statement?

horschi · 2024-11-04T17:01:31Z

@mykaul

Bootstrap creates many thousand of tiny files, which are very shortlived and will be compacted very quicky.

I did some bootstrapping into an existing cluster last week with spinning disks with and without fsync. Bootstrapping with fsync took a day with multiple timeouts, while bootstrapping with unsafe-fsync took only an hour. It definitetaly make a huge difference.

Sorry, I am not into the scylla code to be able to point you to any specific fsync calls, but I saw the difference when using unsafe-fsync.

mykaul · 2024-11-05T10:46:16Z

@horschi - I'm a bit surprised bootstrap creates (1) many thousand and (2) tiny files .
@kbr-scylla - thoughts?

horschi · 2024-11-05T10:52:29Z

Bootstrapping a 30GB node creates many thousands of sstables. Its basically not possible to bootstrap with fsync with spinning disks any more.

You get tons of these during bootstrap:

Nov 05 09:07:14 os-d-L311-3 scylla[25200]:  [shard 1:strm] repair - repair[554ed947-7e50-4fd3-acf2-df1a5fa9f631]: Started to repair 65 out of 89 tables in keyspace=pc, table=srlvltcnf, table_id=37614459-81a4-a32a-2fae-000000000000, repair_reason=bootstrap
Nov 05 09:07:14 os-d-L311-3 scylla[25200]:  [shard 1:strm] compaction - [Reshape pc.scr 5922c0f0-9b55-11ef-9d31-204eebdbfe46] Reshaping [/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbt_3kdww2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbt_3x8vk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbt_46ods2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbt_5i9cg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_0kkqo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_18snk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_238c02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_2ig682e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_31b6o2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_39ge82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_3dba82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_3z6bk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_4ee5s2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_57yz42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_02cvk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_08scw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_0kd0w2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_0tsj42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_1sxyo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_29ntc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_2ndn42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_2vqkg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_36vsw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_3qdyo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_4igrk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_5b6ps2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_5x1r42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_0tktc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_142wg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_1d2z42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_268cw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_2zt682e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_3ndxs2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_3zlr42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_4lw802e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_59wf42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_5oh402e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_097sg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_15d742e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_22swg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_2fvkw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_3elkw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_3u8uo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_4abk02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_4ktn42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_4yjgw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_5cwg02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_04i1c2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_0cuyo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_17an42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_1g3002e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_28sy82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_2x0v42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_3blk02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_3jqrk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_443sg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_5bm5c2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_0axio2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_1fva82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_1vq9s2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_2i0qo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_2r8j42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_31iwg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_3uoa82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_4iohc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_5cwg02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_5sjps2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_01xg02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_0hd002e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_0q5cw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_0zd5c2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_1axtc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_22dgw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_2j3bk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_2xo0g2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_3c8pc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_3gygg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_3uoa82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_43w2o2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_4igrk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_51bs02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_5gbwg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_5rh4w2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_0tsj42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_198342e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_1l0gw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_23nrk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_2jy6o2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_2z60w2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_438xc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_4mjdc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_4vjg02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_59wf42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_5uwlc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc2_08scw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc2_0iv0g2e20wbz22ak3q-big-Data.db:level=0:origin=repair]

Scrolling down few lines I found an even longer compaction log line, that is so long, that I cannot copy it out, as it breaks my text editor. (Edit: My text editor still hangs, but it seems its a compaction merging over 670 sstables)

kbr-scylla · 2024-11-05T11:00:42Z

@horschi - I'm a bit surprised bootstrap creates (1) many thousand and (2) tiny files .

@raphaelsc @denesb these are probably sstables created by streaming/repair and getting compacted by off-strategy compaction, right?

I think it would be pretty safe for scylla to not fsync during bootstrap, since the node is not fully joined yet anyway. There needs to be only a fsync at the end of the bootstrap.

It's an interesting idea. In raft-based topology, if a node crashes or otherwise fails bootstrap, we permanently ban it from the cluster, it cannot be restarted, unless you purge it (at which point it basically becomes a completely new node). So at the end of repair/streaming phase, we could fsync everything in a batch. I think it's viable to consider. At least for vnodes mode. cc @gleb-cloudius

But for tablet migrations it might be harder to reason about, because we migrate tablets only to a normal node (which successfully bootstrapped). But perhaps it would still be possible -- we wouldn't fsync sstables created from migration until the end; we don't read from pending replica so it should be safe. If migration fails we have to perform it from scratch. cc @tgrabiec

avikivity · 2024-11-05T11:27:24Z

With tablets, bootstrap only transfers cluster metadata, so we don't save much there (and that metadata must be properly fsynced).

horschi · 2024-11-05T11:28:07Z

I guess with tablets the number of sstables will be very low. But it vnodes its bad.

denesb · 2024-11-05T11:44:15Z

With vnodes, it is well known problem that bootstrap (or indeed any streaming and repair) will create many files. The number of files scales with the number of nodes and number of tables and it can easily get into the thousands.

mykaul · 2024-11-05T11:51:20Z

With vnodes, it is well known problem that bootstrap (or indeed any streaming and repair) will create many files. The number of files scales with the number of nodes and number of tables and it can easily get into the thousands.

But the question of fsync (and when) is still valid and interesting.

avikivity · 2024-11-05T12:26:50Z

fsync is hard to consolidate because each file needs to be individually fsynced, and there's no good way to batch them. If any write sneaks in between two fsyncs, it results in two separate disk flushes.

michoecho · 2024-11-05T12:30:24Z

fsync is hard to consolidate because each file needs to be individually fsynced

Can't you just call sync()/syncfs() on the entire filesystem between the end of bootstrap and the commit to raft?

avikivity · 2024-11-05T13:40:10Z

fsync is hard to consolidate because each file needs to be individually fsynced

Can't you just call sync()/syncfs() on the entire filesystem between the end of bootstrap and the commit to raft?

To call syncfs, you have to collect all affected filesystems. This can be tricky with bind mounts and soft links. It can be done but isn't trivial.

sync() syncs too much, so if you're writing to a floppy in parallel it wouldn't work well, but practially speaking it should work.

We'd also need to check the protocol for acknowledging sstable writes, right now it relies on fsync order. We'd need some sort of super-transaction that spans many sstable creations.

I don't think it's worth the effort, with tablets obsoleting all that.

horschi · 2024-12-30T17:07:20Z

During bootstrap scylla creates many hundred-thousands of files. Even with a floppy being used on the server, it would still be faster than it currently is ;-)

Alternatively I could imagine not doing an fsync at all during bootstrap would also be acceptable. I think linux triggers a fsync every 5 seconds, writing everything older than 30 seconds or something like that.

I doubt that building a complex mechsnism for this seems unreasonable.

Another thought: Scylla uses fsync(int fd). Couldn't scylla instead use fdatasync(int fd) ?

fdatasync() is similar to fsync(), but does not flush modified
       metadata unless that metadata is needed in order to allow a
       subsequent data retrieval to be correctly handled.  For example,
       changes to st_atime or st_mtime (respectively, time of last
       access and time of last modification; see [inode(7)](https://man7.org/linux/man-pages/man7/inode.7.html)) do not
       require flushing because they are not necessary for a subsequent
       data read to be handled correctly.  On the other hand, a change
       to the file size (st_size, as made by say [ftruncate(2)](https://man7.org/linux/man-pages/man2/ftruncate.2.html)), would
       require a metadata flush.

       The aim of fdatasync() is to reduce disk activity for
       applications that do not require all metadata to be synchronized
       with the disk.

raphaelsc · 2024-12-30T17:18:40Z

During bootstrap scylla creates many hundred-thousands of files. Even with a floppy being used on the server, it would still be faster than it currently is ;-)

Alternatively I could imagine not doing an fsync at all during bootstrap would also be acceptable. I think linux triggers a fsync every 5 seconds, writing everything older than 30 seconds or something like that.

I doubt that building a complex mechsnism for this seems unreasonable.

Another thought: Scylla uses fsync(int fd). Couldn't scylla instead use fdatasync(int fd) ?
fdatasync() is similar to fsync(), but does not flush modified
       metadata unless that metadata is needed in order to allow a
       subsequent data retrieval to be correctly handled.  For example,
       changes to st_atime or st_mtime (respectively, time of last
       access and time of last modification; see [inode(7)](https://man7.org/linux/man-pages/man7/inode.7.html)) do not
       require flushing because they are not necessary for a subsequent
       data read to be handled correctly.  On the other hand, a change
       to the file size (st_size, as made by say [ftruncate(2)](https://man7.org/linux/man-pages/man2/ftruncate.2.html)), would
       require a metadata flush.

       The aim of fdatasync() is to reduce disk activity for
       applications that do not require all metadata to be synchronized
       with the disk.

Although the interface says fsync (metric, config for bypassing, etc), scylla actually uses fdatasync for guaranteeing integrity of data written into files. see posix_file_impl::flush().

mykaul · 2024-12-30T17:57:59Z

fsync is hard to consolidate because each file needs to be individually fsynced

Can't you just call sync()/syncfs() on the entire filesystem between the end of bootstrap and the commit to raft?

You'd also need some code NOT to fsync() while in bootstrap in ALL relevant code paths.

michoecho · 2024-12-30T18:31:02Z

fsync is hard to consolidate because each file needs to be individually fsynced

Can't you just call sync()/syncfs() on the entire filesystem between the end of bootstrap and the commit to raft?

You'd also need some code NOT to fsync() while in bootstrap in ALL relevant code paths.

@mykaul Call seastar::engine().set_bypass_fsync(true) when you start streaming tables, call seastar::engine().set_bypass_fsync(false); co_await sync(); after you finish streaming tables and before you commit the bootstrap. To avoid working with any state possibly corrupted due to violated persistence guarantees, add an atomic way to detect that non-synced files possibly exist (e.g. some fsynced empty file in /var/lib/scylla that you create before the set_bypass_fsync(true) and remove after the sync(), whatever) and add a check at the beginning of scylla_main() which will exit immediately if that's the case, directing the administrator to wipe out (nuke, if you will) /var/lib/scylla and redo the bootstrap.

mykaul · 2024-12-30T18:56:53Z

@michoecho - I was under the impression it's not just the streaming (there are no tablets involved in this case, yet). I was under the impression it's the whole bootstrap process.

michoecho · 2024-12-30T19:03:20Z

@michoecho - I was under the impression it's not just the streaming (there are no tablets involved in this case, yet). I was under the impression it's the whole bootstrap process.

Streaming/repair is the part that creates many files. But it doesn't matter, you can extend the sync bypass over the entire bootstrap process if you want. The important part is that you end the bypass and sync the filesystem before committing the bootstrap to raft, and that the node refuses to work with /var/lib/scylla constructed by an unsynced bootstrap.

avikivity · 2025-01-01T14:24:52Z

But doesn't the raft part come quite early? And after that, any raft changes must be synced.

We could skip syncing the non-raft changes, but it becomes hairy.

mykaul added the enhancement label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster bootstrap by not doing fsync during bootstrap #21432

Faster bootstrap by not doing fsync during bootstrap #21432

horschi commented Nov 4, 2024

mykaul commented Nov 4, 2024

horschi commented Nov 4, 2024

mykaul commented Nov 5, 2024

horschi commented Nov 5, 2024 •

edited

Loading

kbr-scylla commented Nov 5, 2024

avikivity commented Nov 5, 2024

horschi commented Nov 5, 2024 •

edited

Loading

denesb commented Nov 5, 2024

mykaul commented Nov 5, 2024

avikivity commented Nov 5, 2024

michoecho commented Nov 5, 2024 •

edited

Loading

avikivity commented Nov 5, 2024

horschi commented Dec 30, 2024

raphaelsc commented Dec 30, 2024

mykaul commented Dec 30, 2024

michoecho commented Dec 30, 2024 •

edited

Loading

mykaul commented Dec 30, 2024

michoecho commented Dec 30, 2024 •

edited

Loading

avikivity commented Jan 1, 2025

Faster bootstrap by not doing fsync during bootstrap #21432

Faster bootstrap by not doing fsync during bootstrap #21432

Comments

horschi commented Nov 4, 2024

mykaul commented Nov 4, 2024

horschi commented Nov 4, 2024

mykaul commented Nov 5, 2024

horschi commented Nov 5, 2024 • edited Loading

kbr-scylla commented Nov 5, 2024

avikivity commented Nov 5, 2024

horschi commented Nov 5, 2024 • edited Loading

denesb commented Nov 5, 2024

mykaul commented Nov 5, 2024

avikivity commented Nov 5, 2024

michoecho commented Nov 5, 2024 • edited Loading

avikivity commented Nov 5, 2024

horschi commented Dec 30, 2024

raphaelsc commented Dec 30, 2024

mykaul commented Dec 30, 2024

michoecho commented Dec 30, 2024 • edited Loading

mykaul commented Dec 30, 2024

michoecho commented Dec 30, 2024 • edited Loading

avikivity commented Jan 1, 2025

horschi commented Nov 5, 2024 •

edited

Loading

horschi commented Nov 5, 2024 •

edited

Loading

michoecho commented Nov 5, 2024 •

edited

Loading

michoecho commented Dec 30, 2024 •

edited

Loading

michoecho commented Dec 30, 2024 •

edited

Loading