Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: consider backward compatibility #41

Open
xxchan opened this issue Jan 27, 2023 · 15 comments
Open

Discussion: consider backward compatibility #41

xxchan opened this issue Jan 27, 2023 · 15 comments

Comments

@xxchan
Copy link
Member

xxchan commented Jan 27, 2023

I think we should start thinking backward compatibility during development given that we have users.
https://risingwave-labs.slack.com/archives/C03L8DNMDDW/p1674026516400229?thread_ts=1674020631.181329&cid=C03L8DNMDDW

previous breaking-change https://github.com/risingwavelabs/risingwave/issues?q=label%3Abreaking-change

There are many aspects to consider for compatibility

However

Maintaining compatibility is hard and may not worth it. If there is a need to upgrade version for the already deployed cases, we can write a simple tool to migrate old metadata to the new one on demand.
risingwavelabs/risingwave#7281 (comment)

So I just create an issue first to collect ideas :)

@TennyZhuang
Copy link
Contributor

TennyZhuang commented Jan 28, 2023

That's a good question, and I've thought a lot about compatibility, so let's take it case by case.

SQL syntax

After our first stable release, it's better to always keep SQL compatibility. Generally speaking, this won't be too hard, since most of our SQL is just following how Postgres behaves.

We can mark a stabilized syntax as deprecated, but never have to remove it. I guess it doesn't take much effort.

For very rare design flaws, used by few users, break changes are allowed. We should discuss them case by case.

Example: risingwavelabs/risingwave#4503 doesn't break anything, and we may still find the better property naming convention in the future.

Data format in storage

SST format

All objects stored in our storage backend are in our private SST format. There is an SstableMeta in the header of each SST file, with 4 bytes of version stored in it. I think it's flexible enough for further break changes.

Row Encoding

The SST data blocks are organized as rows of key/value pairs.

We use memcomparable encoding for the key and value encoding for the value. There's a lot of design space, and we've even made several break changes now. After we stabilize, we need to ensure that any old-version key/value pairs can be decoded correctly.

Fortunately, we can ensure that all key/value pairs in the same SST file have the same encoding version, so we don't need to store the version per row.

IMHO, a good practice to maintain encoding compatibility is to use separate crates and versions for the row encodings. Based on the power of cargo, we can use different versions of the same crate in one project.

For example:

memcomparable_0_1 = { package = "memcomparable", version = "0.1.2" }
memcomparable_0_2 = { package = "memcomparable", version = "0.2.3" }
memcomparable_1 = { package = "memcomparable", version = "1.4" }
value_encoding_0_1 = { package = "value_encoding", version = "0.1" }
value_encoding_1 = { package = "value_encoding", version = "1.2" }
value_encoding_2 = { package = "value_encoding", version = "2.1" }

And dispatch the implementation statically while decoding the SSTs.

Pseudocode:

fn decode<const SST_VERSION: usize>(buf: &Bytes) -> (Key, Value) {
  match SST_VERSION {
    1 => (memcomparable_0_1.decode(buf), value_encoding_0_1.decode(buf))
    2 => (memcomparable_0_2.decode(buf), value_encoding_1.decode(buf))
    3 => (memcomparable_0_2.decode(buf), value_encoding_2.decode(buf))
    4 => (memcomparable_1.decode(buf), value_encoding_2.decode(buf)),
    _ => unreachable!("corrupted data")
  }
}

The encoding crate version should follow the semver rules, so that we can release some patches (e.g. bump deps) to the older versions.

Upgrade During Compacting

When a new storage version (both SST and row encoding) is released, no data is changed until the SST is compacted. Compacting will rebuild the SST file so that we can upgrade the version in the meantime.

Compaction always reconstructs an entire SST file, so we can ensure that all key/value pairs in the same SST file have the same encoding version.

Data Format in Meta

Data in the meta is organized as a tree structure, and currently the major meta backend is etcd. The meta values are encoded by protobuf encoding.

If a change only adds/reduces fields in the meta, then the protobuf can ensure that nothing is broken.

If a change must reorganize the tree structure itself, then a break change must be made.

Since the meta-store data is never too large, a migration script is preferred for such a break change.

If we implement a SQL-based metastore later, we can use some SQL like migrations/5_to_6.sql to do the migration, but currently a Rust snippet is recommended.

Protocols between nodes

In general, the protobuf can handle the situation well enough.

To make things easier, we can add a version field to each request, so that an older compute node can always reject some DDL operations from a newer frontend.

For CN-CN communication (mostly exchange), protobuf compatibility should be handled carefully by developers.

CLI args/config file

No ideas, we should create a good enough configuration framework before the first release.

@arkbriar
Copy link

In production, there will always be cases in which we upgrade a RisingWave cluster. The problem is that upgrading isn't an atomic process, meaning multiple node versions can exist. One simple solution is to fully stop the cluster and start the newer version afterward, but the connections from our user side will be disconnected. Apparently, it hurts user experience and thus shouldn't be the first choice in most cases.

Compatibility of the communication protocols should be carefully kept and tested for those promised compatible versions, i.e., versions with the same major (and minor) numbers in semantic versions.

For the CLI args and config parameters, I'd like to propose that the unrecognized parameters should trigger warnings instead of panics. This makes the cloud easier to manage the config templates. Also, critical args should be carefully handled because the operator can not distinguish from versions. If the operator also applies the arg changes (e.g., supports a new argument or deprecates an old one), the older RisingWave versions won't be deployable anymore. That has already happened several times.

@wangrunji0408
Copy link

What about testing. Is it possible to setup some kind of compatibility test in CI? 👀

@TennyZhuang
Copy link
Contributor

TennyZhuang commented Jan 28, 2023

In production, there will always be cases in which we upgrade a RisingWave cluster. The problem is that upgrading isn't an atomic process, meaning multiple node versions can exist. One simple solution is to fully stop the cluster and start the newer version afterward, but the connections from our user side will be disconnected. Apparently, it hurts user experience and thus shouldn't be the first choice in most cases.

I'd prefer always rolling-update the whole cluster, with specified orders. Two versions of nodes should only co-exist in several seconds to minutes, so we can reject some older requests directly.

@TennyZhuang
Copy link
Contributor

What about testing. Is it possible to setup some kind of compatibility test in CI? 👀

Not ready, we haven't implemented all mechanisms, just some thoughts.

@arkbriar
Copy link

I'd prefer always rolling-update the whole cluster, with specified orders. Two versions of nodes should only co-exist in several seconds to minutes, so we can reject some older requests directly.

I agree, but it could take a long time to upgrade from a really old version.

@st1page

This comment was marked as resolved.

@neverchanje
Copy link

neverchanje commented Jan 30, 2023

#41 (comment)

What @TennyZhuang mentioned is absolutely essential that we should ensure the extensibility of our major components. But the reality is, even some trivial configuration changes could break the compatibility: risingwavelabs/risingwave#7527. Without actually testing out an upgrade, we'll never know what could break. @arkbriar @sumittal Let's include upgrade testing in our release process.

@arkbriar
Copy link

@neverchanje I've commented under risingwavelabs/risingwave#7530 (comment) and proposed to keep the backward compatibility for at least one patch version so that the change won't block any others at the exact time it is made to make others life easier.

@huangjw806
Copy link

Let's include upgrade testing in our release process.

Of course we can add an upgrade test during the release process, but what if the test fails? Whether to roll back the kernel or upgrade the cloud, such break changes will cause a lot of trouble, and even affect customers, because the cloud operator is public in eks, and not all problems can be solved by upgrading the operator. We should formulate more detailed compatibility rules.

@fuyufjh

This comment was marked as resolved.

@neverchanje
Copy link

but what if the test fails?

@huangjw806 We shouldn't release any stable versions with upgradability issues, whether or not it's a patch release or a major release. If the migration tools are not ready, we should revert the changes that cause the issues.

@xxchan

This comment was marked as resolved.

@st1page
Copy link
Contributor

st1page commented Jan 31, 2023

draft a RFC to discuss how to maintain and manage the Backward Compatibility of the stream plan and SQL features. need more discussion #43

@xxchan
Copy link
Member Author

xxchan commented Feb 10, 2023

#43 also contains valuable discussions about the requirements for rolling upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants