Retry tasks during schedule phase of active listener #1349

allada · 2024-09-13T05:52:29Z

If an item is in the schedule phase too long, it will now retry after a fixed amount of time to ensure any down stream database didn't forget about it. So far this is a known issue in redis, where the secondary indexing might not index a task if the index was created while the task was being inserted (bug in redis). This should be mostly benign for cases that do not have these kinds of issues.

This change is

If an item is in the schedule phase too long, it will now retry after a fixed amount of time to ensure any down stream database didn't forget about it. So far this is a known issue in redis, where the secondary indexing might not index a task if the index was created while the task was being inserted (bug in redis). This should be mostly benine for cases that do not have these kinds of issues.

adam-singer

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and all files reviewed, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, NativeLink.com Cloud / Remote Cache (Legacy Dockerfile Test), asan / ubuntu-22.04, docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, and 2 discussions need to be resolved

nativelink-scheduler/src/simple_scheduler_state_manager.rs line 253 at r1 (raw file):

                _ = (self.now_fn)().sleep(self.no_event_action_timeout) => {
                    // Timeout happened, do additional checks below.
                    timeouts_triggered += 1;

Would a trace level log statement here be too much overhead? Concerned if we ever needed to debug that we timed out that we might not have visibility into it, would that state percolate out anywhere else?

nativelink-scheduler/src/simple_scheduler_state_manager.rs line 416 at r1 (raw file):

                .checked_add(self.no_event_action_timeout)
                .err_tip(|| {
                    "Timestamp too big in SimpleSchedulerStateManager::timeout_operation_id"

Lets capture the operation id value with this error tip

allada had a problem deploying to production September 13, 2024 05:52 — with GitHub Actions Failure

adam-singer approved these changes Sep 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry tasks during schedule phase of active listener #1349

Retry tasks during schedule phase of active listener #1349

allada commented Sep 13, 2024 •

edited by MarcusSorealheis

Loading

adam-singer left a comment

Retry tasks during schedule phase of active listener #1349

Are you sure you want to change the base?

Retry tasks during schedule phase of active listener #1349

Conversation

allada commented Sep 13, 2024 • edited by MarcusSorealheis Loading

adam-singer left a comment

Choose a reason for hiding this comment

allada commented Sep 13, 2024 •

edited by MarcusSorealheis

Loading