Automatically apply schema changes on sinks when the parent table changes schema. #16839

neverchanje · 2024-05-20T11:56:03Z

See #16834 for the background.

This proposal aims to enable the automatic synchronization of schema changes from source tables to destination tables, such as a StarRocks table, as illustrated below.

CREATE TABLE users (...);

CREATE SINK sink_users FROM users WITH (
  connector = 'starrocks',
  starrocks.table = 'sr_users',
  auto.schema.evolution = 'true'
);

-- The SR table `sr_users` will adapt this change.
ALTER TABLE users ADD COLUMN address VARCHAR;

When auto.schema.evolution is set to true, the sink_users sink will handle schema change events emitted from the users table by executing ALTER ADD COLUMN on sr_users in StarRocks.

Specifically,

When the schema change is ADD COLUMN, the sink will execute ALTER TABLE ADD COLUMN.
When the schema change is DROP COLUMN, the sink will update its local metadata by tagging the column as "dropped", but not removing it. Instead, it fills the column with null.
Other events like type changes will be omitted for now, since we don't yet support them.

When auto.schema.evolution is enabled, the downstream table is expected to be fully managed by RW. We need to warn users to be cautious when executing manual schema changes. From this viewpoint, we can also let auto.schema.evolution auto-create the table (see #15586) if absent, thus offering users a fully-automated ELT experience.

It should be noted that the E2E auto schema change is not allowed for SINK AS SELECT or SINK FROM <mv>, given that queries are not able to change schema. It's only available to SINK FROM <table> or <source>.

The text was updated successfully, but these errors were encountered:

hzxa21 · 2024-05-22T09:07:07Z

Generally I think this is doable but let's talk about one specific case: not all databases support instant ALTER TABLE ADD/DROP COLUMN like we do. If ALTER TABLE ADD/DROP COLUMN takes a long time (due to locking / large data volume / ...) in the external sink, is auto schema evolution still beneficial or should we automatically pause/resume the stream for schema evolution?

neverchanje · 2024-05-22T10:42:19Z

We don't have to provide this option for databases where ALTER TABLE is time-consuming.
Let's support it on a connector-by-connector basis. In the worst case, users can simply disable the option if the latency is not desired.

hzxa21 · 2024-05-22T10:42:48Z

Generally I think this is doable but let's talk about one specific case: not all databases support instant ALTER TABLE ADD/DROP COLUMN like we do. If ALTER TABLE ADD/DROP COLUMN takes a long time (due to locking / large data volume / ...) in the external sink, is auto schema evolution still beneficial or should we automatically pause/resume the stream for schema evolution?

Discussed offline, auto schema change can be supported case by case. For example, if the downstream system is slow in ALTER statement, we won't support auto schema evolution on it. Note that for DROP COLUMN, we don't necessarily need to issue a ALTER DROP COLUMN statement and instead we can just sink NULL for the dropped column.

fuyufjh · 2024-05-31T08:47:38Z

For better understanding, this task can be split into 2 parts.

The support of following schema change from upstream
The support of doing schema change (i.e. ADD COLUMN) in downstream.
The support of altering a streaming job, which is a SINK here.

For the 1st point, the challenge is majorly for CDC connectors. Let's discuss it in separated issue.

For the 2nd point, as mentioned in Patrick's comment, this work is largely case by case.

If the downstream is structured, had better to check the ADD COLUMN can be done instantly
If the downstream is semi- or unstructured e.g. JSON, no need to do anything
If the downstream is semi-structured and managed by schema registry, need to update the schema registry.

For the 3rd point, note that even for CREATE SINK sink_users FROM users, which might be the simplest form I can tell, it's still a streaming job comprising 2 or 3 executors:

MaterializeExecutor --> [ExhcnageExecutor]   -->    SinkExecutor
(table)          (only if sink_key != table_pk)

When a new column is added, all the channels and executors in the pipeline need to be aware and change accordingly.

This problem is exactly altering MV (no rebuilding) that we have discussed a lot before, but the boundary of the problem can be well-defined and fairly limited. cc. @BugenZhao

neverchanje · 2024-05-31T09:10:08Z

@fuyufjh Do you think it's technically feasible?
Since the problem scope is greatly reduced, I think it is quite different from the alter-mv that we discussed before.

fuyufjh · 2024-06-04T08:57:07Z

@fuyufjh Do you think it's technically feasible? Since the problem scope is greatly reduced, I think it is quite different from the alter-mv that we discussed before.

From my understanding, the essential work is similar: changing the definition of the existing actors. But the scope is much more limited.

neverchanje mentioned this issue May 20, 2024

Discussion: End-to-end schema change #16834

Open

2 tasks

github-actions bot added this to the release-1.10 milestone May 20, 2024

fuyufjh assigned StrikeW and BugenZhao and unassigned StrikeW Jun 7, 2024

neverchanje modified the milestones: release-1.10, release-1.11 Jul 10, 2024

BugenZhao modified the milestones: release-2.0, release-2.1 Aug 19, 2024

StrikeW modified the milestones: release-2.1, future-release-2.2 Oct 11, 2024

StrikeW removed their assignment Jan 8, 2025

BugenZhao modified the milestones: release-2.2, future-release-2.4 Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically apply schema changes on sinks when the parent table changes schema. #16839

Automatically apply schema changes on sinks when the parent table changes schema. #16839

neverchanje commented May 20, 2024 •

edited

Loading

hzxa21 commented May 22, 2024

neverchanje commented May 22, 2024 •

edited

Loading

hzxa21 commented May 22, 2024

fuyufjh commented May 31, 2024 •

edited

Loading

neverchanje commented May 31, 2024

fuyufjh commented Jun 4, 2024 •

edited

Loading

Automatically apply schema changes on sinks when the parent table changes schema. #16839

Automatically apply schema changes on sinks when the parent table changes schema. #16839

Comments

neverchanje commented May 20, 2024 • edited Loading

hzxa21 commented May 22, 2024

neverchanje commented May 22, 2024 • edited Loading

hzxa21 commented May 22, 2024

fuyufjh commented May 31, 2024 • edited Loading

neverchanje commented May 31, 2024

fuyufjh commented Jun 4, 2024 • edited Loading

neverchanje commented May 20, 2024 •

edited

Loading

neverchanje commented May 22, 2024 •

edited

Loading

fuyufjh commented May 31, 2024 •

edited

Loading

fuyufjh commented Jun 4, 2024 •

edited

Loading