risingwavelabs · st1page · Jan 31, 2023 · Jan 31, 2023 · Jan 31, 2023 · Feb 7, 2023
diff --git a/rfcs/0043-backward-compatibility-of-stream-plan.md b/rfcs/0043-backward-compatibility-of-stream-plan.md
@@ -0,0 +1,39 @@
+---
+feature: Backward Compatibility of Stream Plan
+authors:
+  - "st1page"
+start_date: "2023/1/31"
+---
+
+# Backward Compatibility of Stream Plan
+
+## Summary
+
+- distinguish the nightly and stable for SQL features and stream plan node protobuf. 
+- use a Copy-on-Write style on changing the stable stream plan node protobuf.
+
+## Motivation
+
+In https://github.com/risingwavelabs/rfcs/issues/41, we discuss the backward compatibility. And the protobuf structure of stream plan nodes is a special part. 
+- the plan node's structure usually modified more frequently than other protobuf structure such as catalog, especially when we are developing new SQL features and we even do not know how to do it right. The plan node's changes are not only adding some optional field(which can be solved by protobuf) but also of meaning and behaviors of the operator. For example, our state table information of streamAgg having breaking changed in 0.1.13 and in 0.1.16, the source executor is no longer responsible for generating row_id. And we do not confirm the sort and overAgg's format so far.
+- in other databases, the plan node is just used as a communicating protocol between frontend and compute node. So the compute node can only support the latest version's plan node format and reject all the requests with unknown plan node. But our stream plan should be persistent in meta store which means that a compute node must be compatible with all versions of old plans' protobuf format.
+
+In conclusion, we need find a way to achieve a balance between rapid development and backward compatibility, especially for stream plan node.
+
+## Design
+
+### Nightly and Stable SQL Features 
+Distinguish the nightly and stable feature when publishing release version. RW will do not ensure compatibility for the streaming jobs with the nightly features in following releases. For example, if we release the "emit on close" as a nightly feature in the release v0.1.17 and user create a mv with that feature on a v0.1.17 cluster. The v0.1.18 and following version's RW can not ensure it can run successfully on the existing streaming jobs. User can drop the MVs with the nightly feature before they upgrade the cluster. For those nightly feature users really what to upgrade, we can write helper scripts too. And the stable features will be tested with new released compute node on old version streaming plans. Also, with the convinced stable feature list, we can test the backward compatibility more easily. 
+
+### Nightly and Stable Stream Plan Node
+How to know if a SQL Feature has been stable? Developer should comment the compatibility annotation on protobuf struct of every stream plan node(like annotation in java). the annotation contains: "nightly v0.1.14", "stable v0.1.15", "deprecated v0.1.16". A plan node will be with a nightly annotation firstly. When developer ensure that the plan node struct is stable enough, a stable annotation should be comments on the protobuf struct. When developer ensure that frontend will not generate the plan node, a deprecated annotation should be comments on the protobuf struct. A SQL feature is stable means that all the stream plan nodes generated by any version's optimizer should have been stable.
+
+To be discussed: what is the proper format of those comments in proto files and how to check all plan node should have one in CI check?
+
+### Copy-on-Write Style Changes on Stable Plan Node Protobuf
+How to maintain the compatibility of the plan node's protobuf? If developer want to do any changes on a stable plan node, he should add a new plan node protobuf definition. For example, if he want to add a new field in `StreamHashAgg`, he must define a new protobuf struct `StreamHashAggV2` and add the field on that. Notice that there are multi versions protobuf but they can share the same implementation.  
+
+Why make it so complicated and why not just rely on the protobuf's compatibility? To achieve the compatibility, protobuf actually give the struct that all fields are optional. When a protobuf struct is used as a RPC interface, the caller will give a combination of those optional fields and the callee should try best to try all kinds of meaningful combinations or return an error. based on the following facts I think the Copy-on-Write Style Changes is better.
+- the changes of the stable plan node is limited. we can make breaking changes in the same release arbitrarily and a stable plan node will not be modified too much. So the duplicated plan node definition will not be too much.
+- here the “return error” is unacceptable for us because if we can not resolve the stored streaming plan, the cluster can not boot up anyway. So we must make sure that the compute node can accept any combination of the fields in historical versions. Store all these combination in different version's plannode definition will help to maintain the compatibility, or it will just exist in the compute node's code and easily be forgotten.
+