-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: adopt offset_translator #16892
storage: adopt offset_translator #16892
Conversation
/ci-repeat |
new failures in https://buildkite.com/redpanda/redpanda/builds/45667#018e0cb5-f264-47f1-9f95-c6ad22c346a2:
new failures in https://buildkite.com/redpanda/redpanda/builds/45843#018e1c3d-62ec-41e4-8834-1fdd7267114e:
new failures in https://buildkite.com/redpanda/redpanda/builds/45917#018e215e-fa46-4003-bd6e-443b712a76ed:
new failures in https://buildkite.com/redpanda/redpanda/builds/45927#018e2219-dcc2-4a43-9ea9-699d60941fd5:
new failures in https://buildkite.com/redpanda/redpanda/builds/46097#018e374e-2a5d-4237-857f-0f5e6189ba52:
new failures in https://buildkite.com/redpanda/redpanda/builds/46550#018e6052-3701-48ce-8789-2eee633bb705:
new failures in https://buildkite.com/redpanda/redpanda/builds/46550#018e6061-7c88-4f22-865f-93a621325640:
new failures in https://buildkite.com/redpanda/redpanda/builds/46584#018e627e-4ae9-4401-b4a8-85c1db6ae276:
|
5551e91
to
b39752c
Compare
/ci-repeat |
b39752c
to
8aea219
Compare
/ci-repeat |
8aea219
to
28a0dac
Compare
/ci-repeat |
1 similar comment
/ci-repeat |
be24c9f
to
38c747d
Compare
/ci-repeat |
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46097#018e374e-2a5d-4237-857f-0f5e6189ba52 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46235#018e3f9a-7703-4e16-b4dd-5f551a9d89d4 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46317#018e4517-684c-452a-816d-cf11e61d9e08 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46550#018e6052-3701-48ce-8789-2eee633bb705 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46550#018e6176-5349-4dbe-95b9-ad71c39d14ce |
3911a83
to
4503310
Compare
/ci-repeat |
4503310
to
24c811a
Compare
/ci-repeat |
/cdt |
24c811a
to
8ae6a97
Compare
is TS the only remaining place that will need cleand up? |
TLDR pretty much yes. My end-goal is to replace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, awesome pr.
i'm still trying to digest the commit near the end, storage: migrate Raft ownership of offset translation
there was a lot going on in that. but generally yeh this change set is great.
definitely want to get reviews from @ztlpn and @mmaslankaprv and @bharathv too.
The offset translator currently uses the storage::api to interact with the storage layer. With it moving to the storage layer, it's more convenient to have it instead use storage classes directly. For instance, I intend on moving offset_translator into the disk_log_impl, where storage_api doesn't exist. This commit makes the translator use the kvstore and storage_resources directly, and introduces a new constructor that avoids storage_api.
This check is already done by Raft to tell whether it should reset offset translator state. With the offset translator moving into the storage layer, it makes sense for this check to be done directly in storage. Note that Raft also previously checked the value of `_last_snapshot_index`. This does not seem required, as the last snapshot index is only hydrated in hydrate_snapshot() during do_start() after Raft checks whether the log is new.
Raft currently has the capability of force truncating the offset translator state when there is a gap in the log, using offset_translator::prefix_truncate_reset(). It does this in the unexpected event that the kvstore doesn't match with the contents of the log. Rather than having this special case of offset-translator truncation, this commit introduces an truncation option to allow for this forced truncation, expecting the caller to supply an offset delta to use. This commit replaces the implementation so that prefix_truncate_reset() is just a call to the more expected prefix_truncate(). This will make it easier to have storage::disk_log_impl own the offset translator, without having to worry about special Raft-specific usage.
e5eb21f
to
4f8cd41
Compare
Latest push was another rebase to fix a conflict |
This commit instantiates a currently unused instance of a raft::offset_translator in the disk_log_impl. This will be required as offset translation is pulled into the storage layer. This change alone, while mostly non-functional, at least allows subsequent commits to make further changes that depend on the new storage-layer translator without switching over to using it.
Long term, this will help us avoid exposing implementation details like the offset_translator_state to users of the log. Note that since the storage-layer translator isn't plugged into anything, this commit alone means that the new interface always returns incorrect results.
The storage layer shouldn't make decisions about what types are translated and what arent. This context currently comes from Raft; as such this adds some plumbing for that information to make its way down to storage.
Some tests use the translator in the log builder to do translation, and will need to be able to supply a reasonable set of batch types to translate. Since the on-disk storage of the offset translator is in the kvstore keyed by group_id, this also adds an optional group_id to the builder.
Adds a new start() method to the disk_log_impl intended for use by Raft. For the disk_log_impl, this initializes the offset_translator with any state that may exist from the Raft layer (snapshots, kv-store, etc). For reference, follow the code paths in consensus::do_start() for all calls into the offset translator and log, and that is what disk_log_impl::start() is meant to encapsulate.
An upcoming change will rewrite the initial snapshot hydration at startup to use the new log::start() interface. To that end, this pulls some logic from snapshot hydration into some reusable methods. Note, there is a mechanical difference in behavior that in practice will result in the same behavior: Previously, we always truncated the offset translator after adding the latest config to the config manager, and then conditionally truncated the log (roughly, we wouldn't truncate if this is the controller partition). With this change, the truncation of the offset translator will only happen if the log is also going to be truncated. In practice, this behavioral change doesn't matter because not truncating the log implies this is the controller partition, which doesn't translate offsets anyway (its translated batch types is empty).
This commit moves the management code of the lifecycle of the offset_translator from raft::consensus to storage::disk_log_impl. Truncation and prefix truncation are straightforward: corresponding calls into the offset translator are simply now done in the storage layer. What's trickier is the initial loading of offset translator state, which happens at Raft start time, and pulls metadata from the Raft snapshot and the kv store. To accommodate this, this commit refactors some methods for reuse to rejigger consensus::start() in a way that preserves the original behavior, but defers offset translator initialization and any initial truncation to the storage layer via log::start(), introduced in an earlier commit. Most callers still access the offset translator via Raft. A subsequent commit will make the shift to begin using the storage interface.
Replaces the computation of a delta for truncation with the generation of the truncation config that would be created with that delta. This is more intuitive, since it's clearer the logic is for a truncation.
In the effort to have storage own offset translation, this commit removes many direct usages of offset_translator_state, opting instead for the storage::log::delta and to/from_log_offset interfaces. By not exposting the translator as an implementation detail, this will allow the storage implementation to evolve more easily. Also removes the direct interface available in Raft. Note, this doesn't remove all usages of the offset translator state: offset_translator_state is still baked into the interfaces required to translate aborted transaction. Until those are adapted to use the storage interfaces, the offset_translator_state will remain as a part of the storage::log interface.
The test previously manually truncated the log without truncating Raft. Now that the offset translator actions are tied to the log, this meant that offset translator state couldn't translate the Raft start offset. This updates the test to use the more complete partition prefix truncation, which also truncates Raft.
Offset translator state is stored in the kvstore and keyed by group id. A default argument is provided for logs' group ids, which makes this somewhat a footgun. To mitigate, this adds a check that when a set of types are provided to be translated, a valid group id is also passed. Note that when there are no translated types, offset translation doesn't occur and this shouldn't matter.
Just delta() isn't necessarily clear what it is a delta of.
4f8cd41
to
ba937fd
Compare
Latest push was to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like it is pretty much ready to go. just some questions here, but nothing that i'm worried about
@@ -2509,6 +2516,11 @@ ss::future<> disk_log_impl::truncate_prefix(truncate_prefix_config cfg) { | |||
}) | |||
.discard_result(); | |||
}); | |||
// We truncate the segments before truncating offset translator to wait for | |||
// readers that started reading from the start of the log before we advanced | |||
// the start offset and thus can still need offset translation info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why doesn't the same logic about failure scenarios (described in the comment below in ::truncate()
apply here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the truncate()
comment is related to how we sync the offset translator state with the log at startup. Basically, we'll read through the log starting right after the end of what we have in our offset translator state, and then use that to rebuild state. Truncating the offset translator state is required to allow us to detect this state, since I think it's much harder to detect if the offset translation state passes the end of the log.
For prefix truncation OTOH, prefix truncating the offset translator state last means that our translation state extends below the beginning of the log, which doesn't affect correctness
src/v/raft/consensus.cc
Outdated
return _log | ||
->truncate_prefix(storage::truncate_prefix_config( | ||
model::next_offset(_last_snapshot_index), _scheduling.default_iopc)) | ||
model::next_offset(_last_snapshot_index), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is invoked by hydrate_snapshot, but there the prefix truncate reset was being called with _last_snapshot_index
rather than next_offset(_last_snapshot_index)
. is that a material difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the log impl's prefix truncate will call offset_translator_reset() with prev(truncate_at)
, which accounts for the difference, unless you noticed that and there's still a material difference
src/v/raft/consensus.cc
Outdated
std::optional<storage::truncate_prefix_config> start_truncate_cfg; | ||
auto snapshot_units = co_await _snapshot_lock.get_units(); | ||
auto metadata = co_await read_snapshot_metadata(); | ||
if (metadata.has_value()) { | ||
load_from_metadata(metadata.value()); | ||
co_await _configuration_manager.add( | ||
_last_snapshot_index, std::move(metadata->latest_configuration)); | ||
_probe->configuration_update(); | ||
|
||
auto delta = delta_for_truncation(metadata.value()); | ||
if (delta.has_value()) { | ||
start_truncate_cfg = storage::truncate_prefix_config( | ||
model::next_offset(_last_snapshot_index), | ||
_scheduling.default_iopc, | ||
delta); | ||
|
||
_flushed_offset = std::max( | ||
_last_snapshot_index, _flushed_offset); | ||
co_await _configuration_manager.prefix_truncate( | ||
_last_snapshot_index); | ||
} | ||
_snapshot_size = co_await _snapshot_mgr.get_snapshot_size(); | ||
} | ||
co_await _log->start(start_truncate_cfg); | ||
snapshot_units.return_all(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding the commit message from the pervious commit
An upcoming change will rewrite the initial snapshot hydration at
startup to use the new log::start() interface. To that end, this pulls
some logic from snapshot hydration into some reusable methods.
this is the user of those reusable methods? i guess i was a bit surprised that the logic here is different than what hydrate_snapshot had been doing. what is core difference? is it just that you need to do some of the rehydration before the log starts, because starting the log depends on state from the snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. We hydrate a bit more state and do a few more things, notably starting and syncing the offset translator state in this call to start()
.
CI failure: #11269 |
Now that the build pieces have landed for storage:: taking ownership of the offset_translator, this updates its namespace to follow suit. This is follow-up to redpanda-data#16892
Now that the build pieces have landed for storage:: taking ownership of the offset_translator, this updates its namespace to follow suit. This is follow-up to redpanda-data#16892
In the effort to make our local storage implementation more easily evolvable,
this PR aims to clarify the complex dance between Raft and storage as it
pertains to offset translation. To that end, this PR moves ownership of the
offset_translator into the disk_log_impl, refactoring some of the interactions
between Raft and storage in a way that encapsulates Rafts requirements on
translation state explicitly.
The log now has a start() interface that callers can use to provide additional
metadata to truncate, reset, and sync the log with respect to offset
translation state, as was being done by Raft on startup.
I also began removing dependencies on the offset_translator_state by exposing
delta, to_log_offset, and from_log_offset as methods of the log. I stopped
short though of ridding the log public interface of the offset_translator_state
since it will require deeper changes to the read path (the offset translator
state is a part of the translating_reader to allow tiered storage to expose
translation state built from the remote segments).
Backports Required
Release Notes