Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster bootstrap by not doing fsync during bootstrap #21432

Open
horschi opened this issue Nov 4, 2024 · 19 comments
Open

Faster bootstrap by not doing fsync during bootstrap #21432

horschi opened this issue Nov 4, 2024 · 19 comments

Comments

@horschi
Copy link

horschi commented Nov 4, 2024

Hi,

I think I don't have to explain that bootstrap is slow with scylla. I hope this does not make me a heretic :-)

I think it would be pretty safe for scylla to not fsync during bootstrap, since the node is not fully joined yet anyway. There needs to be only a fsync at the end of the bootstrap.

This improves bootstrap performance dramatically and should be a very low hanging fruit.

Next level could be to optimize also normal repair. If a repair session creates multiple sstables, there should be no need to fsync them individually, but do one fsync at the end of the repair session.

I think this would help a lot. For HDDs unsafe-fsync brings down bootstrap times by a factor of 20.

@mykaul
Copy link
Contributor

mykaul commented Nov 4, 2024

@horschi - can you clarify a bit - are you using Raft for the bootstrap process, etc.? Do you have specific points where fsync is an issue during the bootstrap process, that you think we should remove, or is it a general statement?

@horschi
Copy link
Author

horschi commented Nov 4, 2024

@mykaul

Bootstrap creates many thousand of tiny files, which are very shortlived and will be compacted very quicky.

I did some bootstrapping into an existing cluster last week with spinning disks with and without fsync. Bootstrapping with fsync took a day with multiple timeouts, while bootstrapping with unsafe-fsync took only an hour. It definitetaly make a huge difference.

Sorry, I am not into the scylla code to be able to point you to any specific fsync calls, but I saw the difference when using unsafe-fsync.

@mykaul
Copy link
Contributor

mykaul commented Nov 5, 2024

@horschi - I'm a bit surprised bootstrap creates (1) many thousand and (2) tiny files .
@kbr-scylla - thoughts?

@horschi
Copy link
Author

horschi commented Nov 5, 2024

Bootstrapping a 30GB node creates many thousands of sstables. Its basically not possible to bootstrap with fsync with spinning disks any more.

You get tons of these during bootstrap:

Nov 05 09:07:14 os-d-L311-3 scylla[25200]:  [shard 1:strm] repair - repair[554ed947-7e50-4fd3-acf2-df1a5fa9f631]: Started to repair 65 out of 89 tables in keyspace=pc, table=srlvltcnf, table_id=37614459-81a4-a32a-2fae-000000000000, repair_reason=bootstrap
Nov 05 09:07:14 os-d-L311-3 scylla[25200]:  [shard 1:strm] compaction - [Reshape pc.scr 5922c0f0-9b55-11ef-9d31-204eebdbfe46] Reshaping [/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbt_3kdww2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbt_3x8vk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbt_46ods2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbt_5i9cg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_0kkqo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_18snk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_238c02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_2ig682e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_31b6o2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_39ge82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_3dba82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_3z6bk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_4ee5s2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbu_57yz42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_02cvk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_08scw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_0kd0w2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_0tsj42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_1sxyo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_29ntc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_2ndn42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_2vqkg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_36vsw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_3qdyo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_4igrk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_5b6ps2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbv_5x1r42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_0tktc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_142wg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_1d2z42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_268cw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_2zt682e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_3ndxs2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_3zlr42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_4lw802e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_59wf42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbw_5oh402e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_097sg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_15d742e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_22swg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_2fvkw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_3elkw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_3u8uo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_4abk02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_4ktn42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_4yjgw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbx_5cwg02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_04i1c2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_0cuyo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_17an42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_1g3002e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_28sy82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_2x0v42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_3blk02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_3jqrk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_443sg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pby_5bm5c2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_0axio2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_1fva82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_1vq9s2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_2i0qo2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_2r8j42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_31iwg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_3uoa82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_4iohc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_5cwg02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pbz_5sjps2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_01xg02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_0hd002e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_0q5cw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_0zd5c2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_1axtc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_22dgw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_2j3bk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_2xo0g2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_3c8pc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_3gygg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_3uoa82e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_43w2o2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_4igrk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_51bs02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_5gbwg2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc0_5rh4w2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_0tsj42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_198342e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_1l0gw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_23nrk2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_2jy6o2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_2z60w2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_438xc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_4mjdc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_4vjg02e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_59wf42e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc1_5uwlc2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc2_08scw2e20wbz22ak3q-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/pc/scr-f46af460f18311e88438000000000000/me-3gkz_0pc2_0iv0g2e20wbz22ak3q-big-Data.db:level=0:origin=repair]

Scrolling down few lines I found an even longer compaction log line, that is so long, that I cannot copy it out, as it breaks my text editor. (Edit: My text editor still hangs, but it seems its a compaction merging over 670 sstables)

@kbr-scylla
Copy link
Contributor

@horschi - I'm a bit surprised bootstrap creates (1) many thousand and (2) tiny files .

@raphaelsc @denesb these are probably sstables created by streaming/repair and getting compacted by off-strategy compaction, right?

I think it would be pretty safe for scylla to not fsync during bootstrap, since the node is not fully joined yet anyway. There needs to be only a fsync at the end of the bootstrap.

It's an interesting idea. In raft-based topology, if a node crashes or otherwise fails bootstrap, we permanently ban it from the cluster, it cannot be restarted, unless you purge it (at which point it basically becomes a completely new node). So at the end of repair/streaming phase, we could fsync everything in a batch. I think it's viable to consider. At least for vnodes mode. cc @gleb-cloudius

But for tablet migrations it might be harder to reason about, because we migrate tablets only to a normal node (which successfully bootstrapped). But perhaps it would still be possible -- we wouldn't fsync sstables created from migration until the end; we don't read from pending replica so it should be safe. If migration fails we have to perform it from scratch. cc @tgrabiec

@avikivity
Copy link
Member

With tablets, bootstrap only transfers cluster metadata, so we don't save much there (and that metadata must be properly fsynced).

@horschi
Copy link
Author

horschi commented Nov 5, 2024

I guess with tablets the number of sstables will be very low. But it vnodes its bad.

@denesb
Copy link
Contributor

denesb commented Nov 5, 2024

With vnodes, it is well known problem that bootstrap (or indeed any streaming and repair) will create many files. The number of files scales with the number of nodes and number of tables and it can easily get into the thousands.

@mykaul
Copy link
Contributor

mykaul commented Nov 5, 2024

With vnodes, it is well known problem that bootstrap (or indeed any streaming and repair) will create many files. The number of files scales with the number of nodes and number of tables and it can easily get into the thousands.

But the question of fsync (and when) is still valid and interesting.

@avikivity
Copy link
Member

fsync is hard to consolidate because each file needs to be individually fsynced, and there's no good way to batch them. If any write sneaks in between two fsyncs, it results in two separate disk flushes.

@michoecho
Copy link
Contributor

michoecho commented Nov 5, 2024

fsync is hard to consolidate because each file needs to be individually fsynced

Can't you just call sync()/syncfs() on the entire filesystem between the end of bootstrap and the commit to raft?

@avikivity
Copy link
Member

fsync is hard to consolidate because each file needs to be individually fsynced

Can't you just call sync()/syncfs() on the entire filesystem between the end of bootstrap and the commit to raft?

To call syncfs, you have to collect all affected filesystems. This can be tricky with bind mounts and soft links. It can be done but isn't trivial.

sync() syncs too much, so if you're writing to a floppy in parallel it wouldn't work well, but practially speaking it should work.

We'd also need to check the protocol for acknowledging sstable writes, right now it relies on fsync order. We'd need some sort of super-transaction that spans many sstable creations.

I don't think it's worth the effort, with tablets obsoleting all that.

@horschi
Copy link
Author

horschi commented Dec 30, 2024

During bootstrap scylla creates many hundred-thousands of files. Even with a floppy being used on the server, it would still be faster than it currently is ;-)

Alternatively I could imagine not doing an fsync at all during bootstrap would also be acceptable. I think linux triggers a fsync every 5 seconds, writing everything older than 30 seconds or something like that.

I doubt that building a complex mechsnism for this seems unreasonable.

Another thought: Scylla uses fsync(int fd). Couldn't scylla instead use fdatasync(int fd) ?

fdatasync() is similar to fsync(), but does not flush modified
       metadata unless that metadata is needed in order to allow a
       subsequent data retrieval to be correctly handled.  For example,
       changes to st_atime or st_mtime (respectively, time of last
       access and time of last modification; see [inode(7)](https://man7.org/linux/man-pages/man7/inode.7.html)) do not
       require flushing because they are not necessary for a subsequent
       data read to be handled correctly.  On the other hand, a change
       to the file size (st_size, as made by say [ftruncate(2)](https://man7.org/linux/man-pages/man2/ftruncate.2.html)), would
       require a metadata flush.

       The aim of fdatasync() is to reduce disk activity for
       applications that do not require all metadata to be synchronized
       with the disk.

@raphaelsc
Copy link
Member

During bootstrap scylla creates many hundred-thousands of files. Even with a floppy being used on the server, it would still be faster than it currently is ;-)

Alternatively I could imagine not doing an fsync at all during bootstrap would also be acceptable. I think linux triggers a fsync every 5 seconds, writing everything older than 30 seconds or something like that.

I doubt that building a complex mechsnism for this seems unreasonable.

Another thought: Scylla uses fsync(int fd). Couldn't scylla instead use fdatasync(int fd) ?

fdatasync() is similar to fsync(), but does not flush modified
       metadata unless that metadata is needed in order to allow a
       subsequent data retrieval to be correctly handled.  For example,
       changes to st_atime or st_mtime (respectively, time of last
       access and time of last modification; see [inode(7)](https://man7.org/linux/man-pages/man7/inode.7.html)) do not
       require flushing because they are not necessary for a subsequent
       data read to be handled correctly.  On the other hand, a change
       to the file size (st_size, as made by say [ftruncate(2)](https://man7.org/linux/man-pages/man2/ftruncate.2.html)), would
       require a metadata flush.

       The aim of fdatasync() is to reduce disk activity for
       applications that do not require all metadata to be synchronized
       with the disk.

Although the interface says fsync (metric, config for bypassing, etc), scylla actually uses fdatasync for guaranteeing integrity of data written into files. see posix_file_impl::flush().

@mykaul
Copy link
Contributor

mykaul commented Dec 30, 2024

fsync is hard to consolidate because each file needs to be individually fsynced

Can't you just call sync()/syncfs() on the entire filesystem between the end of bootstrap and the commit to raft?

You'd also need some code NOT to fsync() while in bootstrap in ALL relevant code paths.

@michoecho
Copy link
Contributor

michoecho commented Dec 30, 2024

fsync is hard to consolidate because each file needs to be individually fsynced

Can't you just call sync()/syncfs() on the entire filesystem between the end of bootstrap and the commit to raft?

You'd also need some code NOT to fsync() while in bootstrap in ALL relevant code paths.

@mykaul Call seastar::engine().set_bypass_fsync(true) when you start streaming tables, call seastar::engine().set_bypass_fsync(false); co_await sync(); after you finish streaming tables and before you commit the bootstrap. To avoid working with any state possibly corrupted due to violated persistence guarantees, add an atomic way to detect that non-synced files possibly exist (e.g. some fsynced empty file in /var/lib/scylla that you create before the set_bypass_fsync(true) and remove after the sync(), whatever) and add a check at the beginning of scylla_main() which will exit immediately if that's the case, directing the administrator to wipe out (nuke, if you will) /var/lib/scylla and redo the bootstrap.

@mykaul
Copy link
Contributor

mykaul commented Dec 30, 2024

@michoecho - I was under the impression it's not just the streaming (there are no tablets involved in this case, yet). I was under the impression it's the whole bootstrap process.

@michoecho
Copy link
Contributor

michoecho commented Dec 30, 2024

@michoecho - I was under the impression it's not just the streaming (there are no tablets involved in this case, yet). I was under the impression it's the whole bootstrap process.

Streaming/repair is the part that creates many files. But it doesn't matter, you can extend the sync bypass over the entire bootstrap process if you want. The important part is that you end the bypass and sync the filesystem before committing the bootstrap to raft, and that the node refuses to work with /var/lib/scylla constructed by an unsynced bootstrap.

@avikivity
Copy link
Member

But doesn't the raft part come quite early? And after that, any raft changes must be synced.

We could skip syncing the non-raft changes, but it becomes hairy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants