Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(sink): Use s3 sink to replace the original snowflake backend implementation #18996

Merged
merged 9 commits into from
Nov 7, 2024

Conversation

wcy-fdu
Copy link
Contributor

@wcy-fdu wcy-fdu commented Oct 18, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

As the backend implementation of snowflake sink is almost the same as file sink, I think it's ok to only keep the frontend part of snowflake sink.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@wcy-fdu wcy-fdu marked this pull request as draft October 18, 2024 06:31
@wcy-fdu wcy-fdu marked this pull request as ready for review October 21, 2024 05:28
@wcy-fdu wcy-fdu requested review from xxhZs and hzxa21 October 21, 2024 05:29
@wcy-fdu
Copy link
Contributor Author

wcy-fdu commented Oct 21, 2024

If we are going to implement sink into snowflake directly, we need to change the backend implementation but keep the frontend part for compatibility. Thus I think the current PR is safe because it is invisible to the user.

The only one concern is that previously the snowflake sink is paid feature, we now implement it using file sink and have to make it non-paid.

@graphite-app graphite-app bot requested a review from a team October 21, 2024 05:59
Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xxhZs is more familiar with how to set up snowflake environment. You may ask him to help for the testing.

@@ -199,12 +200,12 @@ def _table():
test_timestamp,
test_timestamptz
from {_table()} WITH (
connector = 's3',
connector = 'snowflake',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird to test snowflake sink in fs sink test. Let's keep the previous codes instead.

e2e_test/s3/fs_sink.py Outdated Show resolved Hide resolved
@@ -897,6 +897,9 @@ static CONNECTORS_COMPATIBLE_FORMATS: LazyLock<HashMap<String, HashMap<Format, V
FileSink::<S3Sink>::SINK_NAME => hashmap!(
Format::Plain => vec![Encode::Parquet, Encode::Json],
),
FileSink::<SnowflakeSink>::SINK_NAME => hashmap!(
Format::Plain => vec![Encode::Parquet, Encode::Json],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this PR, if FORMAT ... ENCODE ... is not specified by user (See the example here), json encode will be used while after this PR, it will be an error. Let's keep the behavior backward compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the encoder of the snowflake sink was previously hardcoded as a json encoder, json is the default here. In other words, even without this pr, when the snowflake sink supports other encodings in the future, it will also need to add the FORMAT ... ENCODE ... field. Can we regard this pr as an enhancement to the snowflake sink, that is, changing the encoding it supports from json to json and parquet, so users need to add the FORMAT ... ENCODE ... field.

Let's keep the behavior backward compatible

This means under this pr, we need to give a default value for FileSink's format_desc, at least the default value needs to be used when the sink is snowflake, which I think is a bit strange.

Copy link
Contributor Author

@wcy-fdu wcy-fdu Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to give a default value for FileSink's format_desc when connector is snowflake

There are actually two dimensions of compatibility here.
In the frontend syntax part, we can achieve full compatibility with the previous snowflake sink; but in the implementation part, since the file sink implements batching, the new snowflake sink is still different from the previous one.

It's not hard to implement full compatibility in frontend syntax .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it paid and modify this in a separate PR if needed.

src/connector/src/sink/mod.rs Outdated Show resolved Hide resolved
@wcy-fdu
Copy link
Contributor Author

wcy-fdu commented Oct 21, 2024

@xxhZs is more familiar with how to set up snowflake environment. You may ask him to help for the testing.

Talked with xinhao before, I think we don't have snowflake sink test in CI, as snowflake is not very well integrated. So I chose to use the test logic of File Sink. cc @xxhZs for confirm.

@hzxa21
Copy link
Collaborator

hzxa21 commented Oct 21, 2024

@xxhZs is more familiar with how to set up snowflake environment. You may ask him to help for the testing.

Talked with xinhao before, I think we don't have snowflake sink test in CI, as snowflake is not very well integrated. So I chose to use the test logic of File Sink. cc @xxhZs for confirm.

We have a demo integration test for snowflake sink. We should at least manually run this demo to make sure the steps are still working if we don't have a CI test.

Copy link
Contributor

@xxhZs xxhZs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!I'll be testing to see if it's working these days.

@@ -46,7 +46,6 @@ macro_rules! for_all_features {
{ TestPaid, Paid, "A dummy feature that's only available on paid tier for testing purposes." },
{ TimeTravel, Paid, "Query historical data within the retention period."},
{ GlueSchemaRegistry, Paid, "Use Schema Registry from AWS Glue rather than Confluent." },
{ SnowflakeSink, Paid, "Delivering data to SnowFlake." },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to set sw sink as a paid feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, paid or non-paid is a bit strange, I don’t have any preference.

Copy link
Contributor Author

@wcy-fdu wcy-fdu Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think twice, we may keep it paid, because user does not aware of this change. cc @neverchanje @hzxa21 for decision.

Copy link
Contributor

@xxhZs xxhZs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested to work properly, also need to modify the relevant configuration in integration_tests/snowflake_sink

@wcy-fdu wcy-fdu enabled auto-merge November 7, 2024 05:17
@wcy-fdu wcy-fdu added this pull request to the merge queue Nov 7, 2024
Merged via the queue into main with commit e6e1ca9 Nov 7, 2024
28 of 29 checks passed
@wcy-fdu wcy-fdu deleted the wcy/remove_snowflake_sink.pr branch November 7, 2024 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants