-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: misuse of props and options in CREATE SOURCE
#13762
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #13762 +/- ##
==========================================
- Coverage 68.05% 67.96% -0.10%
==========================================
Files 1536 1537 +1
Lines 265406 265680 +274
==========================================
- Hits 180628 180569 -59
- Misses 84778 85111 +333
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
let properties = WithOptions::new(prost.properties.clone()).into_inner(); | ||
let options = WithOptions::new(prost.options.clone()).into_inner(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to rename the type alias now... 🥵
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, properties
and options
are not meaningful names. Can we change them to sth like with_options
and encode_format_options
/row_options
?
Also in all other places, including proto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about with_properties
and row_options
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the most of fields in option
are related to parser behavior, I prefer encode_option
or parser_option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been calling them format options
for sink 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parser_options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a minute, the syntax seems to be FORMAT ... ENCODE ... (...)
, so at least it should be format_encode_options
🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the syntax, the options indeed look like it's for ENCODE
🤔.
From the parser's definition it seems to be called row_options
, so I guess this is what it should be like when it was designed (and not only for ENCODE
). So might worth asking @st1page what's the best name for it.
risingwave/src/sqlparser/src/ast/statement.rs
Lines 338 to 342 in b5c3f5d
pub struct ConnectorSchema { | |
pub format: Format, | |
pub row_encode: Encode, | |
pub row_options: Vec<SqlOption>, | |
} |
the most of fields in option are related to parser behavior, I prefer
encode_option
orparser_option
I'm wondering whether parser_options
will be confusing to users. We need to consider this in the error messages (maybe also syntax reference, but currently it also doesn't have a name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c.c. @tabVersion
I think it is the historical reason... format_encode_options
LGTM.
Btw, we might need some options in the KEY ENCODE
in the future.
format upsert
encode avro (
schema.registry = 'http://message_queue:8081/',
schema.registry.name.strategy = 'typo')
key encode avro (
schema.registry = 'http://message_queue:8081/',
schema.registry.name.strategy = 'typo')
)
or maybe this syntax
format upsert
encode avro (
schema.registry = 'http://message_queue:8081/',
schema.registry.name.strategy = 'typo')
INCLUDE KEY AS kafka_key ENCODE avro (
schema.registry = 'http://message_queue:8081/',
schema.registry.name.strategy = 'typo')
)
Considering the unknown part of the other options... I think format_encode_options
is the most specific and clear name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is the historical reason... format_encode_options LGTM.
Btw, we might need some options in the KEY ENCODE in the future.
+1 for format_encode_options
for KEY ENCODE, I think it is ok to add some fields in the options because the clause is optional in SQL and we don't expect it as strong as the value part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work. Can you explain a little about what's changed in the code? Especially:
- Whether is the separated
options
used - Whether and how is backwards compatibility achived
with_properties: &HashMap<String, String>, | ||
row_options: &mut BTreeMap<String, String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between the two's type might be confusing (especially mut
). Better to add some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use &mut for row_options
because in the bind_columns_from_source
function fields of options are consumed (i.e. row_options.remove(&key)). And we will check options.is_empty()
in the end.
https://github.com/risingwavelabs/risingwave/blob/main/src/frontend/src/handler/create_source.rs#L572-L584
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between the two's type might be confusing (especially mut).
We hope to consume both configs here but it seems impossible for connector props. Let's keep format_encode_options
safe from unknown fields.
Besides, how is #12974 related? |
The options are separated and persisted in meta, but are not further used for now. But as @tabVersion inform that there is a plan to support
There is no convenient or absolutely correct way to do this. I just carefully go through the code from frontend to meta and check the docs to check if the key is supposed to be in options other than props. And to ensure options are consumed correctly. |
let properties = WithOptions::new(prost.properties.clone()).into_inner(); | ||
let options = WithOptions::new(prost.options.clone()).into_inner(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, should we consider backward compatibility here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, legacy codes that try to find keys of format_encode_options
in with_properties
are identified and corrected. So unmerging format_encode_options
from with_properties
will not result in key-not-found-like errors. Is this the backward compatibility
you suggested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the most important thing is: already created source can still work. It seems compute node doesn't use format_encode_options
at all and thus not broken? (I'm not very sure. Correct me if I'm wrong.)
Besides is the semantics change on the frontend. I'm still not sure whether sth is changed (i.e., can work previously, but can't work now, and vice versa). But actually that's acceptable to me...
Glad to hear others' thoughts about detailed backward compatibility concerns. @BugenZhao zkss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already created source can still work.
Yes!
It seems compute node doesn't use
format_encode_options
at all and thus not broken?
Also the meta service part: does it handle the cases that the new fields do not exist?
TBH, I'm not that familiar with the connector part. 🤣 Practice is the sole criterion for testing truth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a backwards-compat test for invalid WITH options #13824. Can we or do we need to add similar stuff for this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the most important thing is: already created source can still work.
Do you mean that, for example, a user creates a source using RW 1.4, and then upgrades to RW 1.5. So codes in RW 1.5 should recognize it as legacy source and try to find option keys in properties. Is this hot-update issue you concerned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. We do need to add more codes to handle this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI 😇 this is my solution for rejecting unknown WITH
2e39b51#diff-a80dd5cfb3d2b493a15182ad5c447741b10820fa16b8eea2719b9c0ae6f003a4R415-R418
fix the doc mismatch: https://github.com/risingwavelabs/risingwave-docs/pull/1627 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally LGTM, good work!
))) | ||
} | ||
|
||
fn consume_aws_config_from_options(format_encode_options: &mut BTreeMap<String, String>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn consume_aws_config_from_options(format_encode_options: &mut BTreeMap<String, String>) { | |
fn remove_aws_config_from_options(format_encode_options: &mut BTreeMap<String, String>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest LGTM
PREFIXES.iter().any(|prefix| key.starts_with(prefix)) | ||
|| (key == "endpoint" && !connector.eq_ignore_ascii_case(KINESIS_CONNECTOR)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endpoint
is also found used (and only used) in Kinesis properties, so we move endpoint
to format_encode_options
if the connector is not Kinesis.
pub struct KinesisCommon {
#[serde(rename = "stream", alias = "kinesis.stream.name")]
pub stream_name: String,
#[serde(rename = "aws.region", alias = "kinesis.stream.region")]
pub stream_region: String,
#[serde(rename = "endpoint", alias = "kinesis.endpoint")]
pub endpoint: Option<String>,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I have no experience on aws config names used for connecting to kinesis (
endpoint
/kinesis.endpoint
) vs names used for reading schema definition from s3 (endpoint_url
). Would rely on @tabVersion with more experience to make sure the names are listed correctly, and with proper tests. - My previous comment used the word copy instead of move. I am not sure if some names in legacy
with_properties
are actually used by bothwith
andformat encode
. Moving them away fromwith_properties
may break things. Again, whether this is an real issue depends on the meaning / usage of specific names, which has been obscure for a long time every time we talk aboutwith properties
. Just bringing the idea up and would rely on your judgement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Studied serde rename
and alias
, and it turns out the only name shared by both is endpoint
, which would not be moved in the case of kinesis. And existing kinesis connector are not using format encode
from s3. (or we need to copy endpoint
even when it is kinesis)
All other names for reading format encode
from s3 has already been listed above properly.
Just for clarity, this is a concrete example of the current situation:
create [source | table] t with (
connector = 'kinesis',
stream = 't',
aws.region = 'r0',
endpoint = 'e0',
aws.credentials.access_key_id = 'ak0',
aws.credentials.secret_access_key = 'sk0'
) format plain encode avro (
schema.location = 's3://foo/bar',
region = 'r1',
access_key = 'ak1',
secret_key = 'sk1'
);
It is unfortunate we seem to have no test or doc example about reading schema definition from s3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM.
PREFIXES.iter().any(|prefix| key.starts_with(prefix)) | ||
|| (key == "endpoint" && !connector.eq_ignore_ascii_case(KINESIS_CONNECTOR)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I have no experience on aws config names used for connecting to kinesis (
endpoint
/kinesis.endpoint
) vs names used for reading schema definition from s3 (endpoint_url
). Would rely on @tabVersion with more experience to make sure the names are listed correctly, and with proper tests. - My previous comment used the word copy instead of move. I am not sure if some names in legacy
with_properties
are actually used by bothwith
andformat encode
. Moving them away fromwith_properties
may break things. Again, whether this is an real issue depends on the meaning / usage of specific names, which has been obscure for a long time every time we talk aboutwith properties
. Just bringing the idea up and would rely on your judgement.
PREFIXES.iter().any(|prefix| key.starts_with(prefix)) | ||
|| (key == "endpoint" && !connector.eq_ignore_ascii_case(KINESIS_CONNECTOR)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Studied serde rename
and alias
, and it turns out the only name shared by both is endpoint
, which would not be moved in the case of kinesis. And existing kinesis connector are not using format encode
from s3. (or we need to copy endpoint
even when it is kinesis)
All other names for reading format encode
from s3 has already been listed above properly.
Just for clarity, this is a concrete example of the current situation:
create [source | table] t with (
connector = 'kinesis',
stream = 't',
aws.region = 'r0',
endpoint = 'e0',
aws.credentials.access_key_id = 'ak0',
aws.credentials.secret_access_key = 'sk0'
) format plain encode avro (
schema.location = 's3://foo/bar',
region = 'r1',
access_key = 'ak1',
secret_key = 'sk1'
);
It is unfortunate we seem to have no test or doc example about reading schema definition from s3.
Built an image with |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Resolve #13365 #13367.
Properties in
WITH
clause and options inFORMAT ... ENCODE ...
are misused. Options are merged into props, and parameters defined in options will be further looked up within props. This may result in the following consequences:WITH
options #13654)Here are some keys that supposed to be in the options instead of props:
Also found a misuse in the kafka doc: https://docs.risingwave.com/docs/current/ingest-from-kafka/#connector-parameters.
schema.registry.username
andschema.registry.password
should be defined inENCODE ... FORMAT
. cc @neverchanje.Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.