-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster bootstrap by not doing fsync during bootstrap #21432
Comments
@horschi - can you clarify a bit - are you using Raft for the bootstrap process, etc.? Do you have specific points where fsync is an issue during the bootstrap process, that you think we should remove, or is it a general statement? |
Bootstrap creates many thousand of tiny files, which are very shortlived and will be compacted very quicky. I did some bootstrapping into an existing cluster last week with spinning disks with and without fsync. Bootstrapping with fsync took a day with multiple timeouts, while bootstrapping with unsafe-fsync took only an hour. It definitetaly make a huge difference. Sorry, I am not into the scylla code to be able to point you to any specific fsync calls, but I saw the difference when using unsafe-fsync. |
@horschi - I'm a bit surprised bootstrap creates (1) many thousand and (2) tiny files . |
Bootstrapping a 30GB node creates many thousands of sstables. Its basically not possible to bootstrap with fsync with spinning disks any more. You get tons of these during bootstrap:
Scrolling down few lines I found an even longer compaction log line, that is so long, that I cannot copy it out, as it breaks my text editor. (Edit: My text editor still hangs, but it seems its a compaction merging over 670 sstables) |
@raphaelsc @denesb these are probably sstables created by streaming/repair and getting compacted by off-strategy compaction, right?
It's an interesting idea. In raft-based topology, if a node crashes or otherwise fails bootstrap, we permanently ban it from the cluster, it cannot be restarted, unless you purge it (at which point it basically becomes a completely new node). So at the end of repair/streaming phase, we could fsync everything in a batch. I think it's viable to consider. At least for vnodes mode. cc @gleb-cloudius But for tablet migrations it might be harder to reason about, because we migrate tablets only to a normal node (which successfully bootstrapped). But perhaps it would still be possible -- we wouldn't fsync sstables created from migration until the end; we don't read from pending replica so it should be safe. If migration fails we have to perform it from scratch. cc @tgrabiec |
With tablets, bootstrap only transfers cluster metadata, so we don't save much there (and that metadata must be properly fsynced). |
I guess with tablets the number of sstables will be very low. But it vnodes its bad. |
With vnodes, it is well known problem that bootstrap (or indeed any streaming and repair) will create many files. The number of files scales with the number of nodes and number of tables and it can easily get into the thousands. |
But the question of fsync (and when) is still valid and interesting. |
fsync is hard to consolidate because each file needs to be individually fsynced, and there's no good way to batch them. If any write sneaks in between two fsyncs, it results in two separate disk flushes. |
Can't you just call sync()/syncfs() on the entire filesystem between the end of bootstrap and the commit to raft? |
To call syncfs, you have to collect all affected filesystems. This can be tricky with bind mounts and soft links. It can be done but isn't trivial. sync() syncs too much, so if you're writing to a floppy in parallel it wouldn't work well, but practially speaking it should work. We'd also need to check the protocol for acknowledging sstable writes, right now it relies on fsync order. We'd need some sort of super-transaction that spans many sstable creations. I don't think it's worth the effort, with tablets obsoleting all that. |
During bootstrap scylla creates many hundred-thousands of files. Even with a floppy being used on the server, it would still be faster than it currently is ;-) Alternatively I could imagine not doing an fsync at all during bootstrap would also be acceptable. I think linux triggers a fsync every 5 seconds, writing everything older than 30 seconds or something like that. I doubt that building a complex mechsnism for this seems unreasonable. Another thought: Scylla uses fsync(int fd). Couldn't scylla instead use fdatasync(int fd) ?
|
Although the interface says fsync (metric, config for bypassing, etc), scylla actually uses fdatasync for guaranteeing integrity of data written into files. see posix_file_impl::flush(). |
You'd also need some code NOT to fsync() while in bootstrap in ALL relevant code paths. |
@mykaul Call |
@michoecho - I was under the impression it's not just the streaming (there are no tablets involved in this case, yet). I was under the impression it's the whole bootstrap process. |
Streaming/repair is the part that creates many files. But it doesn't matter, you can extend the sync bypass over the entire bootstrap process if you want. The important part is that you end the bypass and sync the filesystem before committing the bootstrap to raft, and that the node refuses to work with |
But doesn't the raft part come quite early? And after that, any raft changes must be synced. We could skip syncing the non-raft changes, but it becomes hairy. |
Hi,
I think I don't have to explain that bootstrap is slow with scylla. I hope this does not make me a heretic :-)
I think it would be pretty safe for scylla to not fsync during bootstrap, since the node is not fully joined yet anyway. There needs to be only a fsync at the end of the bootstrap.
This improves bootstrap performance dramatically and should be a very low hanging fruit.
Next level could be to optimize also normal repair. If a repair session creates multiple sstables, there should be no need to fsync them individually, but do one fsync at the end of the repair session.
I think this would help a lot. For HDDs unsafe-fsync brings down bootstrap times by a factor of 20.
The text was updated successfully, but these errors were encountered: