Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically apply schema changes on sinks when the parent table changes schema. #16839

Open
Tracked by #16834
neverchanje opened this issue May 20, 2024 · 6 comments
Open
Tracked by #16834
Assignees

Comments

@neverchanje
Copy link
Contributor

neverchanje commented May 20, 2024

See #16834 for the background.

This proposal aims to enable the automatic synchronization of schema changes from source tables to destination tables, such as a StarRocks table, as illustrated below.

CREATE TABLE users (...);

CREATE SINK sink_users FROM users WITH (
  connector = 'starrocks',
  starrocks.table = 'sr_users',
  auto.schema.evolution = 'true'
);

-- The SR table `sr_users` will adapt this change.
ALTER TABLE users ADD COLUMN address VARCHAR;

When auto.schema.evolution is set to true, the sink_users sink will handle schema change events emitted from the users table by executing ALTER ADD COLUMN on sr_users in StarRocks.

Specifically,

  • When the schema change is ADD COLUMN, the sink will execute ALTER TABLE ADD COLUMN.
  • When the schema change is DROP COLUMN, the sink will update its local metadata by tagging the column as "dropped", but not removing it. Instead, it fills the column with null.
  • Other events like type changes will be omitted for now, since we don't yet support them.

When auto.schema.evolution is enabled, the downstream table is expected to be fully managed by RW. We need to warn users to be cautious when executing manual schema changes. From this viewpoint, we can also let auto.schema.evolution auto-create the table (see #15586) if absent, thus offering users a fully-automated ELT experience.

It should be noted that the E2E auto schema change is not allowed for SINK AS SELECT or SINK FROM <mv>, given that queries are not able to change schema. It's only available to SINK FROM <table> or <source>.

@hzxa21
Copy link
Collaborator

hzxa21 commented May 22, 2024

Generally I think this is doable but let's talk about one specific case: not all databases support instant ALTER TABLE ADD/DROP COLUMN like we do. If ALTER TABLE ADD/DROP COLUMN takes a long time (due to locking / large data volume / ...) in the external sink, is auto schema evolution still beneficial or should we automatically pause/resume the stream for schema evolution?

@neverchanje
Copy link
Contributor Author

neverchanje commented May 22, 2024

We don't have to provide this option for databases where ALTER TABLE is time-consuming.
Let's support it on a connector-by-connector basis. In the worst case, users can simply disable the option if the latency is not desired.

@hzxa21
Copy link
Collaborator

hzxa21 commented May 22, 2024

Generally I think this is doable but let's talk about one specific case: not all databases support instant ALTER TABLE ADD/DROP COLUMN like we do. If ALTER TABLE ADD/DROP COLUMN takes a long time (due to locking / large data volume / ...) in the external sink, is auto schema evolution still beneficial or should we automatically pause/resume the stream for schema evolution?

Discussed offline, auto schema change can be supported case by case. For example, if the downstream system is slow in ALTER statement, we won't support auto schema evolution on it. Note that for DROP COLUMN, we don't necessarily need to issue a ALTER DROP COLUMN statement and instead we can just sink NULL for the dropped column.

@fuyufjh
Copy link
Member

fuyufjh commented May 31, 2024

For better understanding, this task can be split into 2 parts.

  1. The support of following schema change from upstream
  2. The support of doing schema change (i.e. ADD COLUMN) in downstream.
  3. The support of altering a streaming job, which is a SINK here.

For the 1st point, the challenge is majorly for CDC connectors. Let's discuss it in separated issue.

For the 2nd point, as mentioned in Patrick's comment, this work is largely case by case.

  • If the downstream is structured, had better to check the ADD COLUMN can be done instantly
  • If the downstream is semi- or unstructured e.g. JSON, no need to do anything
  • If the downstream is semi-structured and managed by schema registry, need to update the schema registry.

For the 3rd point, note that even for CREATE SINK sink_users FROM users, which might be the simplest form I can tell, it's still a streaming job comprising 2 or 3 executors:

MaterializeExecutor --> [ExhcnageExecutor]   -->    SinkExecutor
(table)          (only if sink_key != table_pk)

When a new column is added, all the channels and executors in the pipeline need to be aware and change accordingly.

This problem is exactly altering MV (no rebuilding) that we have discussed a lot before, but the boundary of the problem can be well-defined and fairly limited. cc. @BugenZhao

@neverchanje
Copy link
Contributor Author

@fuyufjh Do you think it's technically feasible?
Since the problem scope is greatly reduced, I think it is quite different from the alter-mv that we discussed before.

@fuyufjh
Copy link
Member

fuyufjh commented Jun 4, 2024

@fuyufjh Do you think it's technically feasible? Since the problem scope is greatly reduced, I think it is quite different from the alter-mv that we discussed before.

From my understanding, the essential work is similar: changing the definition of the existing actors. But the scope is much more limited.

@fuyufjh fuyufjh assigned StrikeW and BugenZhao and unassigned StrikeW Jun 7, 2024
@BugenZhao BugenZhao modified the milestones: release-2.0, release-2.1 Aug 19, 2024
@StrikeW StrikeW removed their assignment Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants