Discussion: use Rocksdb as a `StateStore` #5684

lmatz · 2022-10-03T09:30:35Z

The goal of having Rocksdb integrated is:

providing a baseline for the performance of RisingWave + non-in-memory state store.
Also, Rocksdb is the popular state backend used by Flink.

It may reveal some additional insights when it comes to optimizing stream executor + storage, but not 100% sure.

So looking for some low-effort approach to integrate RocksDB into our system as we currently don't know whether it would indeed help a lot.

We used to have Rocksdb as one of the StateStore, but it:

requires extra effort to maintain it when the interface of StateStore changes.
slows down compilation a lot.

Now (1) seems to be much less critical as our interface is relatively stable now.

Approach:

maintain an independent branch that hardcodes using RocksDB if we don't want it to mess with our normal development. As we may not run it very often, we can just manually merge the branch with main weekly, bi-weekly, or on demand.
or using dynamic linking as suggested by @BugenZhao

Reference: https://github.com/BugenZhao/humrock

The text was updated successfully, but these errors were encountered:

hzxa21 · 2022-10-09T05:30:01Z

If we are looking for low-effort approach to try RocksDB to see whether it can help for streaming executor optimization, how about porting the deleted rocksdb state store codes in an independent branch and trying it out first?

Note that we can mimic both one RocksDB instance per CN and one RocksDB instance per executor.

jon-chuang · 2022-11-17T03:16:15Z

Summary

My suggested path forward:

Use shared instance for entire compute node
Do not consider recovery or creating snapshots. Recovery is inherited from local disk/EBS persistence.
Ok with using EBS initially
Build: use the rocksdb crate which vendors the rocksDB source code. Put behind a feature flag and run the CI for that flag periodically (every one month - still required to define how to achieve this). We should not need to do much for maintenace except to periodically update when the StateStore interface changes or bump the rocksdb crate version.
(as per previous implementation) Use tokio blocking threadpool to deal with the sync/async boundary.

Thus, my main tasks now is to update the previous implementation to have the current interface, merge behind a feature flag to main, and talk to CI team to ask how to schedule a periodic CI (it should not interfere with release build actually and should only alert us to make some updates to the code behind the feature flag if necessary).

It would also be nice if we can trigger this CI to run for a particular PR.

Thereafter, we can perform some benchmarks using EBS storage as part of our performance and benchmarking efforts, and see where we go from there.

Updated tasks:

Update previous implementation
CI/cron script that tries to merge main into rocksdb branch
[ ]

Details

one RocksDB instance per executor.

Is this a good idea? @hzxa21. Seems overkill? Well, we can test it anyways. I guess since the state does not overlap between executors.

If we use a shared instance instead, we can even use Column Families, which may provide better performance (similar to our compaction groups). Anw, we do not need to overly optimize for now.

Anyway, at least the old implementation uses a single instance for the entire compute node. It seems Flink does too (e.g. see: https://towardsdatascience.com/heres-how-flink-stores-your-state-7b37fbb60e1a)

Scaling/Recovery

Anyway, I guess there is no path to scaling or recovery for this rocksDB local store at the current moment. So it will behave similarly to in-memory store except that we persist to disk. In the case of Flink, they actually periodically persist snapshots (incremental or full) to remote storage. So I'm not sure we will do the same but I guess as a first iteration we will ignore this. (See: RocksDB backend in Flink)

Our objective is just to bench the performance at a steady state and assuming no faults... However, I guess not creating and uploading snapshots will be a key difference to Flink's implementation and will affect performance. But we will ignore this for now.

Build

Approach:

maintain an independent branch that hardcodes using RocksDB if we don't want it to mess with our normal development. As we may not run it very often, we can just manually merge the branch with main weekly, bi-weekly, or on demand.
or using dynamic linking as suggested by @BugenZhao

As for the approach, I would rather maintain it as a feature on main, we may or may not run a basic CI on it (so it doesn't get forgotten). I don't think the maintenance burden will be too high. And anw, using a feature flag can also be used to conditionally include dependencies.

Regarding lazily updating, I am on board with this. I think we can have it outside of CI, but maintain it on main behind feature flags. We probably want to at least sync it every month or so. So perhaps we can run CI with this feature flag everytime we do a release (just to have a periodic check)? Suggestions welcome.

As for dynamically linking, the previously used rust library rust-rocksdb actually uses vendoring and building with clang. So the crate would contain rocksDB source code if I'm not wrong. According to: tikv/tikv#4338, I don't think its worth looking too much into dynamically linking, it may not even produce better compile times...

RocksDB + Async

It seems like RocksDB will block the thread as it does not have async APIs. The old implementation will use the blocking threadpool in tokio. I think this is acceptable in our use-case.

DBVector and DBPinnableSlice's Send/Sync rust-rocksdb/rust-rocksdb#336

Using the RocksDB backend with EBS v.s. Locally-attached SSD

We should probably try out both. Some people seem to say not to use network attached storage such as EBS due to the performance (see: https://stackoverflow.com/questions/61655220/how-to-store-checkpoint-into-remote-rocksdb-in-apache-flink). However, Veverica itself seems to use EBS in some cases: (https://www.ververica.com/blog/the-impact-of-disks-on-rocksdb-state-backend-in-flink-a-case-study).

Further, EBS has higher fault tolerance than local disk and hence we may be able to more justifiably ignore the lack of fault tolerance in our implementation of this state backend.

Some additional info about configuring EBS in TiKV's case: https://www.pingcap.com/blog/best-practices-for-tidb-on-aws-cloud/

Anw, this is on the operational/user side of things and has no actual bearing on the RocksDB backend.

hzxa21 · 2022-11-17T04:17:46Z

My suggested path forward:

Use shared instance for entire compute node

Do not consider recovery or creating snapshots. Recovery is inherited from local disk/EBS persistence.

Ok with using EBS initially

Build: use the rocksdb crate which vendors the rocksDB source code. Put behind a feature flag and run the CI for that flag periodically (every one month - still required to define how to achieve this). We should not need to do much for maintenace except to periodically update when the StateStore interface changes or bump the rocksdb crate version.

(as per previous implementation) Use tokio blocking threadpool to deal with the sync/async boundary.

+1. Basically I think we should keep the implementation as simple as possible so there is no need to consider checkpointing/failover.

Is this a good idea? @hzxa21. Seems overkill? Well, we can test it anyways. I guess since the state does not overlap between executors.

If we use a shared instance instead, we can even use [Column Families]

(https://github.com/EighteenZi/rocksdb_wiki/blob/master/Column-Families.md), which may provide better performance (similar to our compaction groups). Anw, we do not need to overly optimize for now.

To keep things simple, we can start with one shared rocksdb instance per CN process (consistent with what Flink does) without using column families to see how things go.

As for the approach, I would rather maintain it as a feature on main, we may or may not run a basic CI on it (so it doesn't get forgotten). I don't think the maintenance burden will be too high. And anw, using a feature flag can also be used to conditionally include dependencies.

Maintaining a rocksdb state store in the main branch is okay as long as we don't complie and run it very often. But I think inlcuding rocksdb state store in CI will increase the CI running time and cause unneccessary distraction code development, which hurts developer experience, so it doesn't sound like a good idea. I would prefer we only compile it when we really need to run test on it and fix things on demand if it is broken.

Using the RocksDB backend with EBS v.s. Locally-attached SSD

This is deployment related and won't affect implementation so I am okay with both.

lmatz · 2022-11-17T05:44:32Z

cause unneccessary distraction code development, which hurts developer experience

Agree, an independent rocksdb branch is created: https://github.com/risingwavelabs/risingwave/tree/rocksdb

merge behind a feature flag to main, and talk to CI team to ask how to schedule a periodic CI

We can set a daily/weekly Cron GitHub workflow that automatically opens a pull request that tries to merge the comments from main branch into rocksdb branch.
reference: https://github.com/risingwavelabs/risingwave/blob/rocksdb/ci/workflows/main-cron.yml
reference: https://github.com/marketplace/actions/create-pull-request

lmatz added the type/perf label Oct 3, 2022

github-actions bot added this to the release-0.1.14 milestone Oct 3, 2022

fuyufjh assigned kwannoel Oct 25, 2022

lmatz assigned jon-chuang and unassigned kwannoel Nov 13, 2022

fuyufjh removed this from the release-0.1.14 milestone Nov 21, 2022

This was referenced Dec 20, 2022

(draft) RocksDB State Store #6983

Closed

(draft) RocksDB State Store #7010

Closed

xxchan closed this as not planned Won't fix, can't repro, duplicate, stale May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: use Rocksdb as a `StateStore` #5684

Discussion: use Rocksdb as a `StateStore` #5684

lmatz commented Oct 3, 2022 •

edited

Loading

hzxa21 commented Oct 9, 2022

jon-chuang commented Nov 17, 2022 •

edited

Loading

Scaling/Recovery

Build

RocksDB + Async

Using the RocksDB backend with EBS v.s. Locally-attached SSD

hzxa21 commented Nov 17, 2022

lmatz commented Nov 17, 2022

Discussion: use Rocksdb as a StateStore #5684

Discussion: use Rocksdb as a StateStore #5684

Comments

lmatz commented Oct 3, 2022 • edited Loading

hzxa21 commented Oct 9, 2022

jon-chuang commented Nov 17, 2022 • edited Loading

Summary

Details

Scaling/Recovery

Build

RocksDB + Async

Using the RocksDB backend with EBS v.s. Locally-attached SSD

hzxa21 commented Nov 17, 2022

lmatz commented Nov 17, 2022

Discussion: use Rocksdb as a `StateStore` #5684

Discussion: use Rocksdb as a `StateStore` #5684

lmatz commented Oct 3, 2022 •

edited

Loading

jon-chuang commented Nov 17, 2022 •

edited

Loading