Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main, view: Pair view builder drain with its start #21909

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dawmd
Copy link
Contributor

@dawmd dawmd commented Dec 12, 2024

This PR reverts a patch that solved a problem related to an initialization
order. Unfortunately, after merging it, we ran into another one -- after
changing the place where group0_service was initialized, it ended up
being destroyed AFTER the CDC generation service would. Since CDC
generations are accessed in storage_service::topology_state_load():

for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
    rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
    co_await _cdc_gens.local().handle_cdc_generation(gen_id);

we started getting the following failure:

Service &seastar::sharded<cdc::generation_service>::local() [Service = cdc::generation_service]: Assertion `local_is_initialized()' failed.

To solve the issue, we revert the changes and approach the problem
in another way: before that patch, we only drained the view builder
when stopping it, which was paired with its construction. However,
since the view builder depends on group0_service, which is initialized
AFTER the view builder is constructed, that could lead to segmentation
faults. To amend it, we pair draining the view builder with its start, while
leaving the lambda destroying it paired with its construction. This way
we should have a proper initialization order, avoid bugs, and reinforce
our understanding of how Scylla starts up.

Backport: The patch I'm reverting made it to 6.2, so we want to backport this one there too.

Fixes #20772
Fixes #21534

The patch solved a problem related to an initialization order, but we
ran into another one -- after moving the initialization of group0_service,
it ended up being destroyed AFTER the CDC generation service would.
Since CDC generations are accessed in `storage_service::topology_state_load()`:

```
for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
    rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
    co_await _cdc_gens.local().handle_cdc_generation(gen_id);
```

we started getting the following failure:

```
Service &seastar::sharded<cdc::generation_service>::local() [Service = cdc::generation_service]: Assertion `local_is_initialized()' failed.
```

We're reverting the patch to go back to a more stable version of Scylla
and in the following commit, we'll solve the original issue in a more
systematic way.

This reverts commit 7bad837.
@dawmd dawmd requested review from nyh and piodul as code owners December 12, 2024 18:50
@dawmd dawmd requested a review from xemul December 12, 2024 18:50
@dawmd dawmd self-assigned this Dec 12, 2024
@dawmd dawmd added the backport/6.2 should be backported to 6.2 label Dec 12, 2024
@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

Build Details:

  • Duration: 28 min
  • Builder: i-0648fc04d5a267f56 (m5d.8xlarge)

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Docker Test
✅ - dtest with topology changes
✅ - dtest
✅ - dtest with tablets
✅ - Offline-installer Artifact Tests
❌ - Unit Tests

Failed Tests (1/36079):

Build Details:

  • Duration: 3 hr 53 min
  • Builder: spider8.cloudius-systems.com

@dawmd
Copy link
Contributor Author

dawmd commented Dec 16, 2024

Looks like #21628. Rerunning CI

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Docker Test
✅ - dtest with tablets
✅ - dtest
✅ - Offline-installer Artifact Tests
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

  • Duration: 5 hr 34 min
  • Builder: i-05fc31621ab76222d (m5d.8xlarge)

Copy link
Contributor

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have only one comment.

db/view/view.cc Outdated
Comment on lines 2098 to 2104
future<> view_builder::stop() {
vlogger.info("Stopping view builder");
return drain();
co_return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: put an assertion here which checks that drain() was indeed run (it needs to be run in order for the view builder to be properly stopped) - for example, you could check if _as.abort_requested() is true

Before these changes, we only drained the view builder when stopping it,
which was paired with its *construction*. However, since the view
builder depends on group0_service, which is initialized AFTER the view
builder is constructed, that could lead to segmentation faults.
To amend it, we pair draining the view builder with its start, while
leaving the lambda destroying it paired with its construction. This way
we should have a proper initialization order, avoid bugs, and reinforce
our understanding of how Scylla starts up.

Fixes scylladb#20772
@dawmd
Copy link
Contributor Author

dawmd commented Dec 17, 2024

Version 2:

  • Added an assert checking if we've drained a view builder before stopping it.

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Docker Test
✅ - dtest with topology changes
✅ - Offline-installer Artifact Tests
✅ - dtest with tablets
✅ - dtest
❌ - Unit Tests

Failed Tests (2/36177):

Build Details:

  • Duration: 8 hr 4 min
  • Builder: i-004ca5bdcc17620cf (m5ad.8xlarge)

@dawmd
Copy link
Contributor Author

dawmd commented Dec 19, 2024

* [test_mv_topology_change.dev.2](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/13897/testReport/junit/topology_custom/test_mv_topology_change/Tests___Unit_Tests___test_mv_topology_change_dev_2) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_mv_topology_change.dev.2)

* [test_mv_topology_change.dev.3](https://jenkins.scylladb.com//job/scylla-master/job/scylla-ci/13897/testReport/junit/topology_custom/test_mv_topology_change/Tests___Unit_Tests___test_mv_topology_change_dev_3) [🔍](https://github.com/scylladb/scylladb/issues?q=is:issue+is:open+test_mv_topology_change.dev.3)

#21912

Rerunning CI

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Docker Test
✅ - dtest with tablets
✅ - Offline-installer Artifact Tests
✅ - dtest with topology changes
❌ - dtest
✅ - Unit Tests

Failed Tests (1/36344):

Build Details:

  • Duration: 8 hr 13 min
  • Builder: spider4.cloudius-systems.com

@piodul
Copy link
Contributor

piodul commented Dec 20, 2024

Failed Tests (1/36344):

I believe it's #15715. Retriggering CI

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Docker Test
✅ - dtest
✅ - Offline-installer Artifact Tests
✅ - dtest with tablets
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

  • Duration: 3 hr 47 min
  • Builder: spider2.cloudius-systems.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/6.2 should be backported to 6.2
Projects
None yet
4 participants