Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: adopt offset_translator #16892

Merged
merged 20 commits into from
Mar 22, 2024

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Mar 5, 2024

In the effort to make our local storage implementation more easily evolvable,
this PR aims to clarify the complex dance between Raft and storage as it
pertains to offset translation. To that end, this PR moves ownership of the
offset_translator into the disk_log_impl, refactoring some of the interactions
between Raft and storage in a way that encapsulates Rafts requirements on
translation state explicitly.

The log now has a start() interface that callers can use to provide additional
metadata to truncate, reset, and sync the log with respect to offset
translation state, as was being done by Raft on startup.

I also began removing dependencies on the offset_translator_state by exposing
delta, to_log_offset, and from_log_offset as methods of the log. I stopped
short though of ridding the log public interface of the offset_translator_state
since it will require deeper changes to the read path (the offset translator
state is a part of the translating_reader to allow tiered storage to expose
translation state built from the remote segments).

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@andrwng
Copy link
Contributor Author

andrwng commented Mar 5, 2024

/ci-repeat

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 5, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/45667#018e0cb5-f264-47f1-9f95-c6ad22c346a2:

"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_and_upgrade"

new failures in https://buildkite.com/redpanda/redpanda/builds/45843#018e1c3d-62ec-41e4-8834-1fdd7267114e:

"rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_segment_deletion.cloud_storage_enabled=True.truncate_point=one_below_high_watermark"

new failures in https://buildkite.com/redpanda/redpanda/builds/45917#018e215e-fa46-4003-bd6e-443b712a76ed:

"rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_segment_deletion.cloud_storage_enabled=False.truncate_point=one_below_high_watermark"

new failures in https://buildkite.com/redpanda/redpanda/builds/45927#018e2219-dcc2-4a43-9ea9-699d60941fd5:

"rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_segment_deletion.cloud_storage_enabled=True.truncate_point=one_below_high_watermark"

new failures in https://buildkite.com/redpanda/redpanda/builds/46097#018e374e-2a5d-4237-857f-0f5e6189ba52:

"rptest.tests.consumer_group_balancing_test.ConsumerGroupBalancingTest.test_coordinator_nodes_balance"

new failures in https://buildkite.com/redpanda/redpanda/builds/46550#018e6052-3701-48ce-8789-2eee633bb705:

"rptest.tests.control_character_flag_test.ControlCharacterPermittedAfterUpgrade.test_upgrade_from_pre_v23_2.initial_version=.23.1.1"

new failures in https://buildkite.com/redpanda/redpanda/builds/46550#018e6061-7c88-4f22-865f-93a621325640:

"rptest.tests.simple_e2e_test.SimpleEndToEndTest.test_relaxed_acks.write_caching=False"

new failures in https://buildkite.com/redpanda/redpanda/builds/46584#018e627e-4ae9-4401-b4a8-85c1db6ae276:

"rptest.tests.retention_policy_test.ShadowIndexingCloudRetentionTest.test_cloud_time_based_retention.cloud_storage_type=CloudStorageType.S3"

@andrwng andrwng force-pushed the storage-offset-translation branch from 5551e91 to b39752c Compare March 8, 2024 02:26
@andrwng
Copy link
Contributor Author

andrwng commented Mar 8, 2024

/ci-repeat

@andrwng andrwng force-pushed the storage-offset-translation branch from b39752c to 8aea219 Compare March 9, 2024 02:28
@andrwng
Copy link
Contributor Author

andrwng commented Mar 9, 2024

/ci-repeat

@andrwng andrwng force-pushed the storage-offset-translation branch from 8aea219 to 28a0dac Compare March 9, 2024 06:27
@andrwng
Copy link
Contributor Author

andrwng commented Mar 9, 2024

/ci-repeat

1 similar comment
@andrwng
Copy link
Contributor Author

andrwng commented Mar 13, 2024

/ci-repeat

@andrwng andrwng force-pushed the storage-offset-translation branch 3 times, most recently from be24c9f to 38c747d Compare March 13, 2024 08:56
@andrwng
Copy link
Contributor Author

andrwng commented Mar 13, 2024

/ci-repeat

@andrwng andrwng force-pushed the storage-offset-translation branch 2 times, most recently from 3911a83 to 4503310 Compare March 13, 2024 19:08
@andrwng
Copy link
Contributor Author

andrwng commented Mar 13, 2024

/ci-repeat

@andrwng andrwng force-pushed the storage-offset-translation branch from 4503310 to 24c811a Compare March 13, 2024 19:44
@andrwng
Copy link
Contributor Author

andrwng commented Mar 13, 2024

/ci-repeat

@andrwng
Copy link
Contributor Author

andrwng commented Mar 14, 2024

/cdt

@andrwng andrwng force-pushed the storage-offset-translation branch from 24c811a to 8ae6a97 Compare March 14, 2024 23:46
@andrwng andrwng marked this pull request as ready for review March 14, 2024 23:57
@dotnwat
Copy link
Member

dotnwat commented Mar 15, 2024

(the offset translator
state is a part of the translating_reader to allow tiered storage to expose
translation state built from the remote segments).

is TS the only remaining place that will need cleand up?

@andrwng
Copy link
Contributor Author

andrwng commented Mar 15, 2024

(the offset translator
state is a part of the translating_reader to allow tiered storage to expose
translation state built from the remote segments).

is TS the only remaining place that will need cleand up?

TLDR pretty much yes.

My end-goal is to replace translating_reader with just an option in log_reader that is able to return runs translated offsets. But translating_reader's translation state is used also for point-translations of some transactional metadata along the fetch path. TS exposes its translation state for this reason, and that's what I'd like to clean up.

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, awesome pr.

i'm still trying to digest the commit near the end, storage: migrate Raft ownership of offset translation there was a lot going on in that. but generally yeh this change set is great.

definitely want to get reviews from @ztlpn and @mmaslankaprv and @bharathv too.

src/v/model/record_batch_types.h Show resolved Hide resolved
src/v/storage/offset_translator.h Outdated Show resolved Hide resolved
src/v/raft/consensus.cc Show resolved Hide resolved
src/v/storage/offset_translator.cc Outdated Show resolved Hide resolved
src/v/storage/log_manager.h Outdated Show resolved Hide resolved
src/v/raft/consensus.cc Outdated Show resolved Hide resolved
src/v/raft/consensus.cc Show resolved Hide resolved
src/v/raft/consensus.cc Outdated Show resolved Hide resolved
src/v/redpanda/admin/debug.cc Show resolved Hide resolved
src/v/archival/ntp_archiver_service.cc Outdated Show resolved Hide resolved
@andrwng andrwng requested a review from bharathv March 16, 2024 00:18
andrwng added 3 commits March 21, 2024 10:43
The offset translator currently uses the storage::api to interact with
the storage layer. With it moving to the storage layer, it's more
convenient to have it instead use storage classes directly. For
instance, I intend on moving offset_translator into the disk_log_impl,
where storage_api doesn't exist.

This commit makes the translator use the kvstore and storage_resources
directly, and introduces a new constructor that avoids storage_api.
This check is already done by Raft to tell whether it should reset
offset translator state. With the offset translator moving into the
storage layer, it makes sense for this check to be done directly in
storage.

Note that Raft also previously checked the value of
`_last_snapshot_index`. This does not seem required, as the last
snapshot index is only hydrated in hydrate_snapshot() during do_start()
after Raft checks whether the log is new.
Raft currently has the capability of force truncating the offset
translator state when there is a gap in the log, using
offset_translator::prefix_truncate_reset(). It does this in the
unexpected event that the kvstore doesn't match with the contents of the
log.

Rather than having this special case of offset-translator truncation,
this commit introduces an truncation option to allow for this forced
truncation, expecting the caller to supply an offset delta to use.

This commit replaces the implementation so that prefix_truncate_reset()
is just a call to the more expected prefix_truncate(). This will make it
easier to have storage::disk_log_impl own the offset translator, without
having to worry about special Raft-specific usage.
@andrwng andrwng force-pushed the storage-offset-translation branch from e5eb21f to 4f8cd41 Compare March 21, 2024 17:43
@andrwng
Copy link
Contributor Author

andrwng commented Mar 21, 2024

Latest push was another rebase to fix a conflict

andrwng added 13 commits March 21, 2024 11:32
This commit instantiates a currently unused instance of a
raft::offset_translator in the disk_log_impl. This will be required as
offset translation is pulled into the storage layer.

This change alone, while mostly non-functional, at least allows
subsequent commits to make further changes that depend on the new
storage-layer translator without switching over to using it.
Long term, this will help us avoid exposing implementation details like
the offset_translator_state to users of the log.

Note that since the storage-layer translator isn't plugged into
anything, this commit alone means that the new interface always returns
incorrect results.
The storage layer shouldn't make decisions about what types are
translated and what arent. This context currently comes from Raft; as
such this adds some plumbing for that information to make its way down
to storage.
Some tests use the translator in the log builder to do translation, and
will need to be able to supply a reasonable set of batch types to
translate.

Since the on-disk storage of the offset translator is in the kvstore
keyed by group_id, this also adds an optional group_id to the builder.
Adds a new start() method to the disk_log_impl intended for use by Raft.
For the disk_log_impl, this initializes the offset_translator with any
state that may exist from the Raft layer (snapshots, kv-store, etc).

For reference, follow the code paths in consensus::do_start() for all
calls into the offset translator and log, and that is what
disk_log_impl::start() is meant to encapsulate.
An upcoming change will rewrite the initial snapshot hydration at
startup to use the new log::start() interface. To that end, this pulls
some logic from snapshot hydration into some reusable methods.

Note, there is a mechanical difference in behavior that in practice will
result in the same behavior:

Previously, we always truncated the offset translator after adding the
latest config to the config manager, and then conditionally truncated
the log (roughly, we wouldn't truncate if this is the controller
partition).

With this change, the truncation of the offset translator will only
happen if the log is also going to be truncated. In practice, this
behavioral change doesn't matter because not truncating the log implies
this is the controller partition, which doesn't translate offsets
anyway (its translated batch types is empty).
This commit moves the management code of the lifecycle of the
offset_translator from raft::consensus to storage::disk_log_impl.

Truncation and prefix truncation are straightforward: corresponding
calls into the offset translator are simply now done in the storage
layer.

What's trickier is the initial loading of offset translator state, which
happens at Raft start time, and pulls metadata from the Raft snapshot
and the kv store. To accommodate this, this commit refactors some
methods for reuse to rejigger consensus::start() in a way that preserves
the original behavior, but defers offset translator initialization and
any initial truncation to the storage layer via log::start(), introduced
in an earlier commit.

Most callers still access the offset translator via Raft. A subsequent
commit will make the shift to begin using the storage interface.
Replaces the computation of a delta for truncation with the generation
of the truncation config that would be created with that delta. This is
more intuitive, since it's clearer the logic is for a truncation.
In the effort to have storage own offset translation, this commit
removes many direct usages of offset_translator_state, opting instead
for the storage::log::delta and to/from_log_offset interfaces. By not
exposting the translator as an implementation detail, this will allow
the storage implementation to evolve more easily.

Also removes the direct interface available in Raft.

Note, this doesn't remove all usages of the offset translator state:
offset_translator_state is still baked into the interfaces required to
translate aborted transaction. Until those are adapted to use the
storage interfaces, the offset_translator_state will remain as a part of
the storage::log interface.
The test previously manually truncated the log without truncating Raft.
Now that the offset translator actions are tied to the log, this meant
that offset translator state couldn't translate the Raft start offset.

This updates the test to use the more complete partition prefix
truncation, which also truncates Raft.
Offset translator state is stored in the kvstore and keyed by group id.
A default argument is provided for logs' group ids, which makes this
somewhat a footgun. To mitigate, this adds a check that when a set of
types are provided to be translated, a valid group id is also passed.

Note that when there are no translated types, offset translation doesn't
occur and this shouldn't matter.
Just delta() isn't necessarily clear what it is a delta of.
@andrwng andrwng force-pushed the storage-offset-translation branch from 4f8cd41 to ba937fd Compare March 21, 2024 18:32
@andrwng
Copy link
Contributor Author

andrwng commented Mar 21, 2024

Latest push was to add raft/fundamental.h to storage/log.h, now that the impl needs a group_id

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like it is pretty much ready to go. just some questions here, but nothing that i'm worried about

src/v/raft/consensus.cc Show resolved Hide resolved
src/v/raft/consensus.cc Outdated Show resolved Hide resolved
src/v/storage/disk_log_impl.cc Show resolved Hide resolved
@@ -2509,6 +2516,11 @@ ss::future<> disk_log_impl::truncate_prefix(truncate_prefix_config cfg) {
})
.discard_result();
});
// We truncate the segments before truncating offset translator to wait for
// readers that started reading from the start of the log before we advanced
// the start offset and thus can still need offset translation info.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why doesn't the same logic about failure scenarios (described in the comment below in ::truncate() apply here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the truncate() comment is related to how we sync the offset translator state with the log at startup. Basically, we'll read through the log starting right after the end of what we have in our offset translator state, and then use that to rebuild state. Truncating the offset translator state is required to allow us to detect this state, since I think it's much harder to detect if the offset translation state passes the end of the log.

For prefix truncation OTOH, prefix truncating the offset translator state last means that our translation state extends below the beginning of the log, which doesn't affect correctness

return _log
->truncate_prefix(storage::truncate_prefix_config(
model::next_offset(_last_snapshot_index), _scheduling.default_iopc))
model::next_offset(_last_snapshot_index),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is invoked by hydrate_snapshot, but there the prefix truncate reset was being called with _last_snapshot_index rather than next_offset(_last_snapshot_index). is that a material difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the log impl's prefix truncate will call offset_translator_reset() with prev(truncate_at), which accounts for the difference, unless you noticed that and there's still a material difference

src/v/storage/log_manager.h Outdated Show resolved Hide resolved
Comment on lines 1389 to 1413
std::optional<storage::truncate_prefix_config> start_truncate_cfg;
auto snapshot_units = co_await _snapshot_lock.get_units();
auto metadata = co_await read_snapshot_metadata();
if (metadata.has_value()) {
load_from_metadata(metadata.value());
co_await _configuration_manager.add(
_last_snapshot_index, std::move(metadata->latest_configuration));
_probe->configuration_update();

auto delta = delta_for_truncation(metadata.value());
if (delta.has_value()) {
start_truncate_cfg = storage::truncate_prefix_config(
model::next_offset(_last_snapshot_index),
_scheduling.default_iopc,
delta);

_flushed_offset = std::max(
_last_snapshot_index, _flushed_offset);
co_await _configuration_manager.prefix_truncate(
_last_snapshot_index);
}
_snapshot_size = co_await _snapshot_mgr.get_snapshot_size();
}
co_await _log->start(start_truncate_cfg);
snapshot_units.return_all();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the commit message from the pervious commit

An upcoming change will rewrite the initial snapshot hydration at
startup to use the new log::start() interface. To that end, this pulls
some logic from snapshot hydration into some reusable methods.

this is the user of those reusable methods? i guess i was a bit surprised that the logic here is different than what hydrate_snapshot had been doing. what is core difference? is it just that you need to do some of the rehydration before the log starts, because starting the log depends on state from the snapshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. We hydrate a bit more state and do a few more things, notably starting and syncing the offset translator state in this call to start().

@andrwng
Copy link
Contributor Author

andrwng commented Mar 21, 2024

CI failure: #11269

@andrwng andrwng requested a review from ztlpn March 22, 2024 06:20
@andrwng andrwng merged commit 70ffeb0 into redpanda-data:dev Mar 22, 2024
17 checks passed
andrwng added a commit to andrwng/redpanda that referenced this pull request Mar 26, 2024
Now that the build pieces have landed for storage:: taking ownership of
the offset_translator, this updates its namespace to follow suit.

This is follow-up to redpanda-data#16892
andrwng added a commit to andrwng/redpanda that referenced this pull request Mar 26, 2024
Now that the build pieces have landed for storage:: taking ownership of
the offset_translator, this updates its namespace to follow suit.

This is follow-up to redpanda-data#16892
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants