-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: database garbage collection #2638
Conversation
Ran benchmarks with the entire snapshot in the "old" partition and it doesn't affect performance in any way I can measure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright!
I'll add issues for the following improvements:
- Load data directly into the old space rather than the new space.
- Show progress during garbage collection (and chain exporting).
- Possibly get rid of the
Store
trait altogether. - Yield during GC to allow Forest to shutdown properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be fantastic to make a blogpost about it :)
let db_garbage_collector = { | ||
let db = db.clone(); | ||
let chain_store = chain_store.clone(); | ||
let get_tipset = move || chain_store.heaviest_tipset().as_ref().clone(); | ||
Arc::new(DbGarbageCollector::new(db, get_tipset)) | ||
}; | ||
|
||
#[allow(clippy::redundant_async_block)] | ||
services.spawn({ | ||
let db_garbage_collector = db_garbage_collector.clone(); | ||
async move { db_garbage_collector.collect_loop_passive().await } | ||
}); | ||
#[allow(clippy::redundant_async_block)] | ||
services.spawn({ | ||
let db_garbage_collector = db_garbage_collector.clone(); | ||
async move { db_garbage_collector.collect_loop_event().await } | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to move it to a separate method. To my understanding, those calls in general need each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collect_loop_passive
and collect_loop_event
work independently, either or both can be disabled
forest/daemon/src/daemon.rs
Outdated
@@ -583,28 +600,28 @@ mod test { | |||
|
|||
use super::*; | |||
|
|||
#[tokio::test] | |||
#[tokio::test(flavor = "multi_thread")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshot import spawns a task for writing to DB now, so that single thread become insufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the single threaded runtime insufficient? Surely it can run multiple tasks in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test deadlocked so that I made the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to debug why that happened. We shouldn't be doing anything blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reverted the multi_threading change and the deadlock is gone
350fd3f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last time I encountered deadlock with tokio was because there were multiple runtimes. It was a bug. :)
//! used to speed up the database write operation | ||
//! | ||
//! ## Scheduling | ||
//! 1. GC is triggered automatically when total DB size is greater than 2x of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to add an actual mainnet example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do u mean an example output in log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just example numbers with mainnet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example numbers added
node/db/src/rolling/gc.rs
Outdated
.ok_or_else(|| anyhow::anyhow!("Cid {cid} not found in blockstore"))?; | ||
|
||
let pair = (cid, block.clone()); | ||
// Key size is 32 bytes in paritydb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, does this work at all with RocksDb? If not, how does Forest behave when RocksDb is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All functionalities work with rocksdb, however, I did not test thru the crash resilience story for rocksdb so it might not be able to recover when forest is shutdown improperly. I believe we had crash resilience issue with rocksdb before so I think the potential regression is not introduced by this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you tested that the GC works with rocksdb? How long does it take? I'm asking because the key size value is hardcoded here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time cost is similar. Key size is used for estimating reachable data size as trigger conditions. However, rocksdb is compressed while paritydb is (mostly) not, the gc trigger interval of rocksdb tends to be longer
//! This module contains a concurrent, semi-space garbage collector. The garbage | ||
//! collector is guaranteed to be non-blocking and can be expected to run with a | ||
//! fixed memory overhead and require disk space proportional to the size of the | ||
//! reachable graph. For example, if the size of the reachable graph is 100GiB, | ||
//! expect this garbage collector to use 3x100GiB = 300GiB of storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before merging this, we should update the specs of our DO droplets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, 320GB is not sufficient if docker and rust toolchains need to be installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we could mount a volume of ~500GB on the fly and symbol link the data folder
} | ||
} | ||
self.put_many_keyed(buffer)?; | ||
info!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this instead a debug? Also, I'm not sure calling timers by default makes sense in a performance-related method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is called either during snapshot import
or during database garbage collection
which take 1-2h on laptop or 3-5h on DO droplet. The info log should never flood the screen but it's quite useful for collecting metrics. I would prefer to keep it info level so that it's available in the log by default. What do u think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's helpful, perhaps let's have a proper Prometheus metric? Like, average speed or something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I think log of individual runs is more helpful than aggregated metrics to understand details and nature of the database and for troubleshooting. How about keeping it as it is, given it only emits 2-3 lines every 1.5-2 days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
Co-authored-by: Hubert <[email protected]>
Summary of changes
Changes introduced in this pull request:
ChainStore
forest-cli
(Blocked by #2635)
Reference issue to close (if applicable)
Closes
#2292
#1708
Other information and links
Change checklist