Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(nexmark): unify three sources into one #6800

Merged
merged 5 commits into from
Dec 9, 2022
Merged

refactor(nexmark): unify three sources into one #6800

merged 5 commits into from
Dec 9, 2022

Conversation

lmatz
Copy link
Contributor

@lmatz lmatz commented Dec 8, 2022

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

#6747

Use materialized view instead of view to re-create three sources, i.e. person, bid, auction. This is because if we use view, due to certain limitations in #6801 or #6161, RW cannot find the column referenced in nexmark queries.

Will use View to separate the nexmark source into three different views after #6817 , and thus nexmark queries no need to change.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

  • Connector (sources & sinks)

Release note

Please create a release note for your changes. In the release note, focus on the impact on users, and mention the environment or conditions where the impact may occur.

For RW's in memory data generator nexmark, it supports another mode, i.e. generating three types of events in one source (unified source).

CREATE SOURCE nexmark (
  event_type BIGINT,
  person STRUCT<"id" BIGINT,
                "name" VARCHAR,
                "email_address" VARCHAR,
                "credit_card" VARCHAR,
                "city" VARCHAR,
                "state" VARCHAR,
                "date_time" TIMESTAMP,
                "extra" VARCHAR>,
  auction STRUCT<"id" BIGINT,
                 "item_name" VARCHAR,
                 "description" VARCHAR,
                 "initial_bid" BIGINT,
                 "reserve" BIGINT,
                 "date_time" TIMESTAMP,
                 "expires" TIMESTAMP,
                 "seller" BIGINT,
                 "category" BIGINT,
                 "extra" VARCHAR>,
  bid STRUCT<"auction" BIGINT,
             "bidder" BIGINT,
             "price" BIGINT,
             "channel" VARCHAR,
             "url" VARCHAR,
             "date_time" TIMESTAMP,
             "extra" VARCHAR>
) WITH (
    connector = 'nexmark',
    nexmark.split.num = '8',
    nexmark.min.event.gap.in.ns = '1000000'
) ROW FORMAT JSON;

In this mode, nexmark.table.type = is not specified.

Refer to a related PR or issue link (optional)

@lmatz lmatz marked this pull request as draft December 8, 2022 08:28
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has totally checked 2490 files.

Valid Invalid Ignored Fixed
1191 1 1298 0
Click to see the invalid file list
  • src/connector/src/source/nexmark/source/combined_event.rs

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has totally checked 2490 files.

Valid Invalid Ignored Fixed
1191 1 1298 0
Click to see the invalid file list
  • src/connector/src/source/nexmark/source/combined_event.rs

@BugenZhao
Copy link
Member

This sounds like a good idea. 🤔 When we want a single source for some trivial tests, we can directly specify the name and ingest it at full speed. When we need to mimic the real case in production, we can use this unified source to give precious control of the event ratio, just like what Flink does.

@lmatz
Copy link
Contributor Author

lmatz commented Dec 9, 2022

This sounds like a good idea. 🤔 When we want a single source for some trivial tests, we can directly specify the name and ingest it at full speed. When we need to mimic the real case in production, we can use this unified source to give precious control of the event ratio, just like what Flink does.

I will still use the single type source generator for the simulation tests.

I tried to integrate the unified source generator into it, but:

  1. there seems to be some bug as shown in https://buildkite.com/risingwavelabs/pull-request/builds/13500#0184f29a-39f0-490c-9279-f8ea35723996. More specifically, one is local stats lost! which I have zero clue. The other one is https://github.com/risingwavelabs/risingwave/blob/main/src/tests/simulation/tests/nexmark_chaos.rs#L57 fails. I suspect it is due to using Materialized View instead of View(All/most of the rows are just flushed into the nexmark MV all at once/in a short period of fake time). I will create an issue for this.

@BugenZhao
Copy link
Member

BugenZhao commented Dec 9, 2022

  1. More specifically, one is local stats lost! which I have zero clue.

I think we can ignore this. This is a notice in production and should be normal in tests. 🤔

I'll fix here.

https://github.com/risingwavelabs/risingwave/pull/6758/files#diff-84f1e2deaafccffea31065447d255aa66db7ee28a7b134681ce25567007d7454

@codecov
Copy link

codecov bot commented Dec 9, 2022

Codecov Report

Merging #6800 (523ae0e) into main (5d0ebdf) will decrease coverage by 0.01%.
The diff coverage is 29.41%.

@@            Coverage Diff             @@
##             main    #6800      +/-   ##
==========================================
- Coverage   73.23%   73.21%   -0.02%     
==========================================
  Files        1025     1026       +1     
  Lines      164193   164212      +19     
==========================================
- Hits       120252   120235      -17     
- Misses      43941    43977      +36     
Flag Coverage Δ
rust 73.21% <29.41%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...nector/src/source/nexmark/source/combined_event.rs 0.00% <0.00%> (ø)
src/connector/src/source/nexmark/source/message.rs 53.12% <7.69%> (-31.88%) ⬇️
src/connector/src/source/nexmark/mod.rs 93.84% <83.33%> (-2.59%) ⬇️
src/connector/src/source/base.rs 77.88% <100.00%> (ø)
src/connector/src/source/nexmark/source/reader.rs 85.91% <100.00%> (+0.62%) ⬆️
...frontend/src/scheduler/hummock_snapshot_manager.rs 58.29% <0.00%> (-0.51%) ⬇️
src/storage/src/memory.rs 93.27% <0.00%> (+0.17%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@lmatz lmatz marked this pull request as ready for review December 9, 2022 07:38
@lmatz lmatz added the user-facing-changes Contains changes that are visible to users label Dec 9, 2022
@lmatz lmatz requested a review from KeXiangWang December 9, 2022 07:51
Copy link
Contributor

@wangrunji0408 wangrunji0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lmatz
Copy link
Contributor Author

lmatz commented Dec 9, 2022

On my MacBook with default configurations(./risedev d), release build,

Using the unified source

CREATE SOURCE nexmark (
  event_type BIGINT,
  person STRUCT<"id" BIGINT,
                "name" VARCHAR,
                "email_address" VARCHAR,
                "credit_card" VARCHAR,
                "city" VARCHAR,
                "state" VARCHAR,
                "date_time" TIMESTAMP,
                "extra" VARCHAR>,
  auction STRUCT<"id" BIGINT,
                 "item_name" VARCHAR,
                 "description" VARCHAR,
                 "initial_bid" BIGINT,
                 "reserve" BIGINT,
                 "date_time" TIMESTAMP,
                 "expires" TIMESTAMP,
                 "seller" BIGINT,
                 "category" BIGINT,
                 "extra" VARCHAR>,
  bid STRUCT<"auction" BIGINT,
             "bidder" BIGINT,
             "price" BIGINT,
             "channel" VARCHAR,
             "url" VARCHAR,
             "date_time" TIMESTAMP,
             "extra" VARCHAR>
) WITH (
    connector = 'nexmark',
    nexmark.split.num = '8',
    nexmark.min.event.gap.in.ns = '0'
) ROW FORMAT JSON;

and

create sink s1 as select * from nexmark with ( connector = 'blackhole' );

We have:
SCR-20221209-m76

Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@mergify
Copy link
Contributor

mergify bot commented Dec 9, 2022

Hey @lmatz, this pull request failed to merge and has been dequeued from the merge train. If you believe your PR failed in the merge train because of a flaky test, requeue it by clicking "Update branch" or pushing an empty commit with git commit --allow-empty -m "rerun" && git push.

@mergify mergify bot merged commit 279fb77 into main Dec 9, 2022
@mergify mergify bot deleted the lz/nexmark branch December 9, 2022 11:02
@hengm3467
Copy link
Contributor

@lmatz Do you think we should document the nexmark connector for generating mock data? We already documented datagen.

@lmatz
Copy link
Contributor Author

lmatz commented Dec 20, 2022

@lmatz Do you think we should document the nexmark connector for generating mock data? We already documented datagen.

It's fine, not a must right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/refactor user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants