RFC: Backward Compatibility of Stream Plan #43

st1page · 2023-01-31T10:47:54Z

No description provided.

rfcs/0043-backward-compatibility-of-stream-plan.md

yuhao-su · 2023-01-31T11:24:44Z

rfcs/0043-backward-compatibility-of-stream-plan.md

+To be discussed: what is the proper format of those comments in proto files and how to check all plan node should have one in CI check?
+
+### Copy-on-Write Style Changes on Stable Plan Node Protobuf
+How to maintain the compatibility of the plan node's protobuf? If developer want to do any changes on a stable plan node, he should add a new plan node protobuf definition. For example, if he want to add a new field in `StreamHashAgg`, he must define a new protobuf struct `StreamHashAggV2` and add the field on that. Notice that there are multi versions protobuf but they can share the same implementation.  


Should we keep all StreamHashAgg versions even after the release?
If we have two versions of StreamHashAggV2 in a release, will CN also handle V1?

Yes, because the any old meta store can store a stream plan with StreamHashAggV1

hmm in this case, will add a version field in StreamHashAgg a valid choice?

it is a special case that we just add a new field and I am not sure if it is general enough

yuhao-su · 2023-01-31T11:24:48Z

rfcs/0043-backward-compatibility-of-stream-plan.md

+
+In https://github.com/risingwavelabs/rfcs/issues/41, we discuss the backward compatibility. And the protobuf structure of stream plan nodes is a special part. 
+- the plan node's structure usually modified more frequently than other protobuf structure such as catalog, especially when we are developing new SQL features and we even do not know how to do it right. The plan node's changes are not only adding some optional field(which can be solved by protobuf) but also of meaning and behaviors of the operator. For example, our state table information of streamAgg having breaking changed in 0.1.13 and in 0.1.16, the source executor is no longer responsible for generating row_id. And we do not confirm the sort and overAgg's format so far.
+- in other databases, the plan node is just used as a communicating protocol between frontend and compute node. So the compute node can only support the latest version's plan node format and reject all the requests with unknown plan node. But our stream plan should be persistent in meta store which means that a compute node must be compatible with all versions of old plans' protobuf format.


But our stream plan should be persistent in meta store which means that a compute node must be compatible with all versions of old plans' protobuf format.

I'm thinking if it is reasonable to force users rebuild the plan when incompatible🤣 Making CN compatible with all versions can be challenging.

that's what I want to do on MV with the nightly SQL features. But it is not good enough. consider some data have been consumed and despaired in the source. how to reuse these state data,

in other databases, the plan node is just used as a communicating protocol between frontend and compute node. So the compute node can only support the latest version's plan node format and reject all the requests with unknown plan node

Our batch can follow the same approach? Maybe we can discuss this issue together in this RFC for completeness and comparison 🤔

If the consideration is only for rolling-update. And we only want to reject old requests, does it mean that we don't even need protobuf compatibility for communication? (Just let gRPC report error for the client...)

One problem is the request from new client to old server. But we can avoid it by updating servers first.

What's on earth the requirements we need to meet for rolling-update? 🥵 cc @hzxa21 @arkbriar for a comment.

In general, the rolling upgrade requires that the cluster finally works and breaks nothing (temporarily unavailable is fine for some services). Ideally, we should consider each version combination to prevent unexpected behaviors. But that would be too much. We could add constraints like the compute must be upgraded after the meta, proposed by @hzxa21 in another discussion. We can add these constraints by either updating the deployment tools (enforce the order) or the kernel (deny of service when mismatch). If upgrading compute nodes requires rebuilding the streaming plan and its states, we should ensure that's worth it and try to automate it, e.g., fixing a buggy operator and the states as well. Otherwise, we should keep it compatible.

I can understand the spirit "breaks nothing (but temporarily unavailable is fine)", but it's still a little big vague to me.

To be more specific, it is understandable that things need to be persisted (like stream plan) should be taken care of. But what confused me is that is it ok to break all other things, i.e., the communication protocol? 🤔 IMHO that would just make the service temporarily unavailable, and will come back to normal after upgrade finishes. Correct me if I'm wrong.

What's on earth the requirements we need to meet for rolling-update? 🥵 cc @hzxa21 @arkbriar for a comment.

IMO, in production, the requirements are:

Be able to rollback: we must provide a mechnism to rollback the upgrade in case there is any performance/corrcetness issue observered by user during rolling upgrade. Note that upgrading to a new version doesn't mean the users have adopted the new features brought by this version. We normally don't need to ensure the ability to rollback after users have adopted the new features because it is hard and not realistic.

Minimize downtime: ideally user wants zero performance/correctness impacts during (and after) upgrade. There are normally two approaches to minimize the downtime:
a. Ensure two versions of codes are compatible during the upgrade period so the number of out-of-service nodes are controllable.
b. Speed up the upgrade and the warm-up/recovery period after the upgrade.

I can understand the spirit "breaks nothing (but temporarily unavailable is fine)", but it's still a little big vague to me.

To be more specific, it is understandable that things need to be persisted (like stream plan) should be taken care of. But what confused me is that is it ok to break all other things, i.e., the communication protocol? 🤔 IMHO that would just make the service temporarily unavailable, and will come back to normal after upgrade finishes. Correct me if I'm wrong.

It all depends on how fast the upgrade is and whether we allow rollback as I mentioned above. Sometimes fast upgrade contradicts with the ability to rollback, especially when breaking changes are introduced. Rolling out breaking changes as fast as possible can minimize unavailbiliy but hurt ability to rollback if we don't maintain compatibility because we won't be able to rollback after upgrade finishes. However, since we are in early production stage and in rapid development, I think it is okay to relax the requirement and introduce breaking changes without considering too much about ability to rollback.

But what confused me is that is it ok to break all other things, i.e., the communication protocol? 🤔 IMHO that would just make the service temporarily unavailable, and will come back to normal after upgrade finishes.

Yes, it's ok in most cases. If you think of upgrading the simplest service, a stateless web service, there will be a time that users will get errors when the old services are shut down one by one, but it recovers quickly, thanks to the LB. This kind of outage is allowed and accepted when upgrading non-availability-critical applications by most users.

Besides, I totally agree with what @hzxa21 said, the ability to rollback and the downtime matter greatly. At least one of these two should be considered carefully if not both can be satisfied.

fuyufjh · 2023-02-01T07:48:55Z

rfcs/0043-backward-compatibility-of-stream-plan.md

+## Design
+
+### Nightly and Stable SQL Features 
+Distinguish the nightly and stable feature when publishing release version. RW will do not ensure compatibility for the streaming jobs with the nightly features in following releases. For example, if we release the "emit on close" as a nightly feature in the release v0.1.17 and user create a mv with that feature on a v0.1.17 cluster. The v0.1.18 and following version's RW can not ensure it can run successfully on the existing streaming jobs. User can drop the MVs with the nightly feature before they upgrade the cluster. For those nightly feature users really what to upgrade, we can write helper scripts too. And the stable features will be tested with new released compute node on old version streaming plans. Also, with the convinced stable feature list, we can test the backward compatibility more easily. 


I prefer the word "experimental" than "nightly" because it's more straightforward to common people.

BTW, when delivering the delta join feature (by Dylan), we had discussed this and decided to mark it as "experimental", especially on the user docs.

xxchan · 2023-02-02T12:16:31Z

rfcs/0043-backward-compatibility-of-stream-plan.md

+
+To be discussed: what is the proper format of those comments in proto files and how to check all plan node should have one in CI check?
+
+### Copy-on-Write Style Changes on Stable Plan Node Protobuf


So basically you mean immutable and versioned protobuf messages? 🤔

BTW we can assume all fields to be required afterwards?

Yes, I think so. And I think our current stream execution expect that in fact(so many unwrap in from_proto).

hzxa21 · 2023-02-09T14:37:27Z

rfcs/0043-backward-compatibility-of-stream-plan.md

+To be discussed: What is the proper format of these comments in the proto files and how to check that all plan nodes have one in CI checks?
+
+### Copy-on-Write Style Changes on Stable Plan Node Protobuf
+How to maintain compatibility of the plan node's protobuf? If a developer wants to make changes to a stable plan node, he should add a new plan node protobuf definition. For example, if they want to add a new field in `StreamHashAgg`, they must define a new protobuf struct `StreamHashAggV2` and add the field to it. Note that there can be multiple versions of protobuf, but they can share the same implementation.


If the only change allowed is to add new fields, we can just add a version field in the protobuf message and add new fields to the same message defintion instead of creating multiple protobuf definitions.

It’s discussed here #43 (comment)

adding field is just one of the cases🤔 Any change is allowed

st1page added 2 commits January 31, 2023 18:47

add rfc

b8ecade

rename pr id

548cc72

st1page mentioned this pull request Jan 31, 2023

Discussion: consider backward compatibility #41

Open

yuhao-su reviewed Jan 31, 2023

View reviewed changes

fix typo

cc5dbd4

fuyufjh reviewed Feb 1, 2023

View reviewed changes

xxchan reviewed Feb 2, 2023

View reviewed changes

Update 0043-backward-compatibility-of-stream-plan.md (#47)

b0bc692

hzxa21 reviewed Feb 9, 2023

View reviewed changes

xxchan mentioned this pull request Dec 7, 2023

fix: misuse of props and options in CREATE SOURCE risingwavelabs/risingwave#13762

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Backward Compatibility of Stream Plan #43

RFC: Backward Compatibility of Stream Plan #43

st1page commented Jan 31, 2023

yuhao-su Jan 31, 2023

st1page Jan 31, 2023

yuhao-su Jan 31, 2023

st1page Feb 2, 2023

yuhao-su Jan 31, 2023

st1page Jan 31, 2023

xxchan Feb 2, 2023

xxchan Feb 2, 2023

xxchan Feb 9, 2023

arkbriar Feb 9, 2023

xxchan Feb 9, 2023 •

edited

Loading

hzxa21 Feb 9, 2023

hzxa21 Feb 9, 2023

arkbriar Feb 10, 2023

fuyufjh Feb 1, 2023 •

edited

Loading

xxchan Feb 2, 2023 •

edited

Loading

xxchan Feb 2, 2023

st1page Feb 5, 2023

hzxa21 Feb 9, 2023

xxchan Feb 9, 2023


		To be discussed: what is the proper format of those comments in proto files and how to check all plan node should have one in CI check?

		### Copy-on-Write Style Changes on Stable Plan Node Protobuf

RFC: Backward Compatibility of Stream Plan #43

Are you sure you want to change the base?

RFC: Backward Compatibility of Stream Plan #43

Conversation

st1page commented Jan 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xxchan Feb 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh Feb 1, 2023 • edited Loading

Choose a reason for hiding this comment

xxchan Feb 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xxchan Feb 9, 2023 •

edited

Loading

fuyufjh Feb 1, 2023 •

edited

Loading

xxchan Feb 2, 2023 •

edited

Loading