storage: adopt offset_translator #16892

andrwng · 2024-03-05T02:23:38Z

In the effort to make our local storage implementation more easily evolvable,
this PR aims to clarify the complex dance between Raft and storage as it
pertains to offset translation. To that end, this PR moves ownership of the
offset_translator into the disk_log_impl, refactoring some of the interactions
between Raft and storage in a way that encapsulates Rafts requirements on
translation state explicitly.

The log now has a start() interface that callers can use to provide additional
metadata to truncate, reset, and sync the log with respect to offset
translation state, as was being done by Raft on startup.

I also began removing dependencies on the offset_translator_state by exposing
delta, to_log_offset, and from_log_offset as methods of the log. I stopped
short though of ridding the log public interface of the offset_translator_state
since it will require deeper changes to the read path (the offset translator
state is a part of the translating_reader to allow tiered storage to expose
translation state built from the remote segments).

Backports Required

Release Notes

none

andrwng · 2024-03-05T02:25:37Z

/ci-repeat

vbotbuildovich · 2024-03-05T04:56:02Z

new failures in https://buildkite.com/redpanda/redpanda/builds/45667#018e0cb5-f264-47f1-9f95-c6ad22c346a2:

"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_and_upgrade"

new failures in https://buildkite.com/redpanda/redpanda/builds/45843#018e1c3d-62ec-41e4-8834-1fdd7267114e:

"rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_segment_deletion.cloud_storage_enabled=True.truncate_point=one_below_high_watermark"

new failures in https://buildkite.com/redpanda/redpanda/builds/45917#018e215e-fa46-4003-bd6e-443b712a76ed:

"rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_segment_deletion.cloud_storage_enabled=False.truncate_point=one_below_high_watermark"

new failures in https://buildkite.com/redpanda/redpanda/builds/45927#018e2219-dcc2-4a43-9ea9-699d60941fd5:

"rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_segment_deletion.cloud_storage_enabled=True.truncate_point=one_below_high_watermark"

new failures in https://buildkite.com/redpanda/redpanda/builds/46097#018e374e-2a5d-4237-857f-0f5e6189ba52:

"rptest.tests.consumer_group_balancing_test.ConsumerGroupBalancingTest.test_coordinator_nodes_balance"

new failures in https://buildkite.com/redpanda/redpanda/builds/46550#018e6052-3701-48ce-8789-2eee633bb705:

"rptest.tests.control_character_flag_test.ControlCharacterPermittedAfterUpgrade.test_upgrade_from_pre_v23_2.initial_version=.23.1.1"

new failures in https://buildkite.com/redpanda/redpanda/builds/46550#018e6061-7c88-4f22-865f-93a621325640:

"rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_relaxed_acks.write_caching=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/46584#018e627e-4ae9-4401-b4a8-85c1db6ae276:

"rptest.tests.retention_policy_test.ShadowIndexingCloudRetentionTest.test_cloud_time_based_retention.cloud_storage_type=CloudStorageType.S3"

andrwng · 2024-03-08T02:41:38Z

/ci-repeat

andrwng · 2024-03-09T02:40:03Z

/ci-repeat

andrwng · 2024-03-09T06:29:46Z

/ci-repeat

andrwng · 2024-03-13T01:26:46Z

/ci-repeat

andrwng · 2024-03-13T08:58:04Z

/ci-repeat

vbotbuildovich · 2024-03-13T11:38:00Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46097#018e374e-2a5d-4237-857f-0f5e6189ba52

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46235#018e3f9a-7703-4e16-b4dd-5f551a9d89d4

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46317#018e4517-684c-452a-816d-cf11e61d9e08

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46550#018e6052-3701-48ce-8789-2eee633bb705

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46550#018e6176-5349-4dbe-95b9-ad71c39d14ce

andrwng · 2024-03-13T19:16:42Z

/ci-repeat

andrwng · 2024-03-13T19:58:40Z

/ci-repeat

andrwng · 2024-03-14T15:40:52Z

/cdt

dotnwat · 2024-03-15T22:40:49Z

(the offset translator
state is a part of the translating_reader to allow tiered storage to expose
translation state built from the remote segments).

is TS the only remaining place that will need cleand up?

andrwng · 2024-03-15T23:02:34Z

(the offset translator
state is a part of the translating_reader to allow tiered storage to expose
translation state built from the remote segments).

is TS the only remaining place that will need cleand up?

TLDR pretty much yes.

My end-goal is to replace translating_reader with just an option in log_reader that is able to return runs translated offsets. But translating_reader's translation state is used also for point-translations of some transactional metadata along the fetch path. TS exposes its translation state for this reason, and that's what I'd like to clean up.

dotnwat

wow, awesome pr.

i'm still trying to digest the commit near the end, storage: migrate Raft ownership of offset translation there was a lot going on in that. but generally yeh this change set is great.

definitely want to get reviews from @ztlpn and @mmaslankaprv and @bharathv too.

src/v/model/record_batch_types.h

src/v/storage/offset_translator.h

src/v/raft/consensus.cc

src/v/storage/offset_translator.cc

src/v/storage/log_manager.h

src/v/raft/consensus.cc

src/v/redpanda/admin/debug.cc

src/v/archival/ntp_archiver_service.cc

The offset translator currently uses the storage::api to interact with the storage layer. With it moving to the storage layer, it's more convenient to have it instead use storage classes directly. For instance, I intend on moving offset_translator into the disk_log_impl, where storage_api doesn't exist. This commit makes the translator use the kvstore and storage_resources directly, and introduces a new constructor that avoids storage_api.

This check is already done by Raft to tell whether it should reset offset translator state. With the offset translator moving into the storage layer, it makes sense for this check to be done directly in storage. Note that Raft also previously checked the value of `_last_snapshot_index`. This does not seem required, as the last snapshot index is only hydrated in hydrate_snapshot() during do_start() after Raft checks whether the log is new.

Raft currently has the capability of force truncating the offset translator state when there is a gap in the log, using offset_translator::prefix_truncate_reset(). It does this in the unexpected event that the kvstore doesn't match with the contents of the log. Rather than having this special case of offset-translator truncation, this commit introduces an truncation option to allow for this forced truncation, expecting the caller to supply an offset delta to use. This commit replaces the implementation so that prefix_truncate_reset() is just a call to the more expected prefix_truncate(). This will make it easier to have storage::disk_log_impl own the offset translator, without having to worry about special Raft-specific usage.

andrwng · 2024-03-21T17:43:36Z

Latest push was another rebase to fix a conflict

This commit instantiates a currently unused instance of a raft::offset_translator in the disk_log_impl. This will be required as offset translation is pulled into the storage layer. This change alone, while mostly non-functional, at least allows subsequent commits to make further changes that depend on the new storage-layer translator without switching over to using it.

Long term, this will help us avoid exposing implementation details like the offset_translator_state to users of the log. Note that since the storage-layer translator isn't plugged into anything, this commit alone means that the new interface always returns incorrect results.

The storage layer shouldn't make decisions about what types are translated and what arent. This context currently comes from Raft; as such this adds some plumbing for that information to make its way down to storage.

Some tests use the translator in the log builder to do translation, and will need to be able to supply a reasonable set of batch types to translate. Since the on-disk storage of the offset translator is in the kvstore keyed by group_id, this also adds an optional group_id to the builder.

Adds a new start() method to the disk_log_impl intended for use by Raft. For the disk_log_impl, this initializes the offset_translator with any state that may exist from the Raft layer (snapshots, kv-store, etc). For reference, follow the code paths in consensus::do_start() for all calls into the offset translator and log, and that is what disk_log_impl::start() is meant to encapsulate.

An upcoming change will rewrite the initial snapshot hydration at startup to use the new log::start() interface. To that end, this pulls some logic from snapshot hydration into some reusable methods. Note, there is a mechanical difference in behavior that in practice will result in the same behavior: Previously, we always truncated the offset translator after adding the latest config to the config manager, and then conditionally truncated the log (roughly, we wouldn't truncate if this is the controller partition). With this change, the truncation of the offset translator will only happen if the log is also going to be truncated. In practice, this behavioral change doesn't matter because not truncating the log implies this is the controller partition, which doesn't translate offsets anyway (its translated batch types is empty).

This commit moves the management code of the lifecycle of the offset_translator from raft::consensus to storage::disk_log_impl. Truncation and prefix truncation are straightforward: corresponding calls into the offset translator are simply now done in the storage layer. What's trickier is the initial loading of offset translator state, which happens at Raft start time, and pulls metadata from the Raft snapshot and the kv store. To accommodate this, this commit refactors some methods for reuse to rejigger consensus::start() in a way that preserves the original behavior, but defers offset translator initialization and any initial truncation to the storage layer via log::start(), introduced in an earlier commit. Most callers still access the offset translator via Raft. A subsequent commit will make the shift to begin using the storage interface.

Replaces the computation of a delta for truncation with the generation of the truncation config that would be created with that delta. This is more intuitive, since it's clearer the logic is for a truncation.

In the effort to have storage own offset translation, this commit removes many direct usages of offset_translator_state, opting instead for the storage::log::delta and to/from_log_offset interfaces. By not exposting the translator as an implementation detail, this will allow the storage implementation to evolve more easily. Also removes the direct interface available in Raft. Note, this doesn't remove all usages of the offset translator state: offset_translator_state is still baked into the interfaces required to translate aborted transaction. Until those are adapted to use the storage interfaces, the offset_translator_state will remain as a part of the storage::log interface.

The test previously manually truncated the log without truncating Raft. Now that the offset translator actions are tied to the log, this meant that offset translator state couldn't translate the Raft start offset. This updates the test to use the more complete partition prefix truncation, which also truncates Raft.

Offset translator state is stored in the kvstore and keyed by group id. A default argument is provided for logs' group ids, which makes this somewhat a footgun. To mitigate, this adds a check that when a set of types are provided to be translated, a valid group id is also passed. Note that when there are no translated types, offset translation doesn't occur and this shouldn't matter.

Just delta() isn't necessarily clear what it is a delta of.

andrwng · 2024-03-21T18:34:01Z

Latest push was to add raft/fundamental.h to storage/log.h, now that the impl needs a group_id

dotnwat

this looks like it is pretty much ready to go. just some questions here, but nothing that i'm worried about

src/v/raft/consensus.cc

src/v/storage/disk_log_impl.cc

dotnwat · 2024-03-21T20:24:39Z

src/v/storage/disk_log_impl.cc

@@ -2509,6 +2516,11 @@ ss::future<> disk_log_impl::truncate_prefix(truncate_prefix_config cfg) {
          })
          .discard_result();
    });
+    // We truncate the segments before truncating offset translator to wait for
+    // readers that started reading from the start of the log before we advanced
+    // the start offset and thus can still need offset translation info.


why doesn't the same logic about failure scenarios (described in the comment below in ::truncate() apply here?

I think the truncate() comment is related to how we sync the offset translator state with the log at startup. Basically, we'll read through the log starting right after the end of what we have in our offset translator state, and then use that to rebuild state. Truncating the offset translator state is required to allow us to detect this state, since I think it's much harder to detect if the offset translation state passes the end of the log.

For prefix truncation OTOH, prefix truncating the offset translator state last means that our translation state extends below the beginning of the log, which doesn't affect correctness

dotnwat · 2024-03-21T21:29:58Z

src/v/raft/consensus.cc

    return _log
      ->truncate_prefix(storage::truncate_prefix_config(
-        model::next_offset(_last_snapshot_index), _scheduling.default_iopc))
+        model::next_offset(_last_snapshot_index),


this is invoked by hydrate_snapshot, but there the prefix truncate reset was being called with _last_snapshot_index rather than next_offset(_last_snapshot_index). is that a material difference?

Note that the log impl's prefix truncate will call offset_translator_reset() with prev(truncate_at), which accounts for the difference, unless you noticed that and there's still a material difference

src/v/storage/log_manager.h

dotnwat · 2024-03-21T21:57:15Z

src/v/raft/consensus.cc

+        std::optional<storage::truncate_prefix_config> start_truncate_cfg;
+        auto snapshot_units = co_await _snapshot_lock.get_units();
+        auto metadata = co_await read_snapshot_metadata();
+        if (metadata.has_value()) {
+            load_from_metadata(metadata.value());
+            co_await _configuration_manager.add(
+              _last_snapshot_index, std::move(metadata->latest_configuration));
+            _probe->configuration_update();
+
+            auto delta = delta_for_truncation(metadata.value());
+            if (delta.has_value()) {
+                start_truncate_cfg = storage::truncate_prefix_config(
+                  model::next_offset(_last_snapshot_index),
+                  _scheduling.default_iopc,
+                  delta);
+
+                _flushed_offset = std::max(
+                  _last_snapshot_index, _flushed_offset);
+                co_await _configuration_manager.prefix_truncate(
+                  _last_snapshot_index);
+            }
+            _snapshot_size = co_await _snapshot_mgr.get_snapshot_size();
+        }
+        co_await _log->start(start_truncate_cfg);
+        snapshot_units.return_all();


regarding the commit message from the pervious commit

An upcoming change will rewrite the initial snapshot hydration at
startup to use the new log::start() interface. To that end, this pulls
some logic from snapshot hydration into some reusable methods.

this is the user of those reusable methods? i guess i was a bit surprised that the logic here is different than what hydrate_snapshot had been doing. what is core difference? is it just that you need to do some of the rehydration before the log starts, because starting the log depends on state from the snapshot?

That's right. We hydrate a bit more state and do a few more things, notably starting and syncing the offset translator state in this call to start().

andrwng · 2024-03-21T22:51:15Z

CI failure: #11269

Now that the build pieces have landed for storage:: taking ownership of the offset_translator, this updates its namespace to follow suit. This is follow-up to redpanda-data#16892

github-actions bot added the area/redpanda label Mar 5, 2024

andrwng force-pushed the storage-offset-translation branch from 5551e91 to b39752c Compare March 8, 2024 02:26

andrwng force-pushed the storage-offset-translation branch from b39752c to 8aea219 Compare March 9, 2024 02:28

andrwng force-pushed the storage-offset-translation branch from 8aea219 to 28a0dac Compare March 9, 2024 06:27

andrwng force-pushed the storage-offset-translation branch 3 times, most recently from be24c9f to 38c747d Compare March 13, 2024 08:56

andrwng force-pushed the storage-offset-translation branch 2 times, most recently from 3911a83 to 4503310 Compare March 13, 2024 19:08

andrwng force-pushed the storage-offset-translation branch from 4503310 to 24c811a Compare March 13, 2024 19:44

andrwng force-pushed the storage-offset-translation branch from 24c811a to 8ae6a97 Compare March 14, 2024 23:46

andrwng marked this pull request as ready for review March 14, 2024 23:57

andrwng requested review from ztlpn, dotnwat and mmaslankaprv March 15, 2024 04:33

dotnwat reviewed Mar 15, 2024

View reviewed changes

andrwng requested a review from bharathv March 16, 2024 00:18

andrwng added 3 commits March 21, 2024 10:43

andrwng force-pushed the storage-offset-translation branch from e5eb21f to 4f8cd41 Compare March 21, 2024 17:43

andrwng added 13 commits March 21, 2024 11:32

storage: plumb translator batch types from log_manager

90d9b2d

The storage layer shouldn't make decisions about what types are translated and what arent. This context currently comes from Raft; as such this adds some plumbing for that information to make its way down to storage.

raft: refactor delta_for_truncation() away

1050ae5

Replaces the computation of a delta for truncation with the generation of the truncation config that would be created with that delta. This is more intuitive, since it's clearer the logic is for a truncation.

raft: rename load_from_metadata -> update_offset_from_snapshot

3347fe3

storage: rename delta() to offset_delta()

ba937fd

Just delta() isn't necessarily clear what it is a delta of.

andrwng force-pushed the storage-offset-translation branch from 4f8cd41 to ba937fd Compare March 21, 2024 18:32

dotnwat reviewed Mar 21, 2024

View reviewed changes

dotnwat approved these changes Mar 21, 2024

View reviewed changes

andrwng requested a review from ztlpn March 22, 2024 06:20

ztlpn approved these changes Mar 22, 2024

View reviewed changes

andrwng merged commit 70ffeb0 into redpanda-data:dev Mar 22, 2024
17 checks passed

andrwng mentioned this pull request Mar 26, 2024

storage: adopt offset_translator in namespace #17382

Merged

6 tasks

andrwng mentioned this pull request Apr 2, 2024

Move the offset translator state into disk_log_impl #16812

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: adopt offset_translator #16892

storage: adopt offset_translator #16892

andrwng commented Mar 5, 2024 •

edited

Loading

andrwng commented Mar 5, 2024

vbotbuildovich commented Mar 5, 2024 •

edited

Loading

andrwng commented Mar 8, 2024

andrwng commented Mar 9, 2024

andrwng commented Mar 9, 2024

andrwng commented Mar 13, 2024

andrwng commented Mar 13, 2024

vbotbuildovich commented Mar 13, 2024 •

edited

Loading

andrwng commented Mar 13, 2024

andrwng commented Mar 13, 2024

andrwng commented Mar 14, 2024

dotnwat commented Mar 15, 2024

andrwng commented Mar 15, 2024

dotnwat left a comment

andrwng commented Mar 21, 2024

andrwng commented Mar 21, 2024

dotnwat left a comment

dotnwat Mar 21, 2024

andrwng Mar 21, 2024

dotnwat Mar 21, 2024

andrwng Mar 21, 2024

dotnwat Mar 21, 2024

andrwng Mar 21, 2024

andrwng commented Mar 21, 2024

storage: adopt offset_translator #16892

storage: adopt offset_translator #16892

Conversation

andrwng commented Mar 5, 2024 • edited Loading

Backports Required

Release Notes

andrwng commented Mar 5, 2024

vbotbuildovich commented Mar 5, 2024 • edited Loading

andrwng commented Mar 8, 2024

andrwng commented Mar 9, 2024

andrwng commented Mar 9, 2024

andrwng commented Mar 13, 2024

andrwng commented Mar 13, 2024

vbotbuildovich commented Mar 13, 2024 • edited Loading

andrwng commented Mar 13, 2024

andrwng commented Mar 13, 2024

andrwng commented Mar 14, 2024

dotnwat commented Mar 15, 2024

andrwng commented Mar 15, 2024

dotnwat left a comment

Choose a reason for hiding this comment

andrwng commented Mar 21, 2024

andrwng commented Mar 21, 2024

dotnwat left a comment

Choose a reason for hiding this comment

dotnwat Mar 21, 2024

Choose a reason for hiding this comment

andrwng Mar 21, 2024

Choose a reason for hiding this comment

dotnwat Mar 21, 2024

Choose a reason for hiding this comment

andrwng Mar 21, 2024

Choose a reason for hiding this comment

dotnwat Mar 21, 2024

Choose a reason for hiding this comment

andrwng Mar 21, 2024

Choose a reason for hiding this comment

andrwng commented Mar 21, 2024

andrwng commented Mar 5, 2024 •

edited

Loading

vbotbuildovich commented Mar 5, 2024 •

edited

Loading

vbotbuildovich commented Mar 13, 2024 •

edited

Loading