Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry tasks during schedule phase of active listener #1349

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

allada
Copy link
Member

@allada allada commented Sep 13, 2024

If an item is in the schedule phase too long, it will now retry after a fixed amount of time to ensure any down stream database didn't forget about it. So far this is a known issue in redis, where the secondary indexing might not index a task if the index was created while the task was being inserted (bug in redis). This should be mostly benign for cases that do not have these kinds of issues.


This change is Reviewable

If an item is in the schedule phase too long, it will now retry after a
fixed amount of time to ensure any down stream database didn't forget
about it. So far this is a known issue in redis, where the secondary
indexing might not index a task if the index was created while the task
was being inserted (bug in redis). This should be mostly benine for
cases that do not have these kinds of issues.
Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and all files reviewed, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, NativeLink.com Cloud / Remote Cache (Legacy Dockerfile Test), asan / ubuntu-22.04, docker-compose-compiles-nativelink (22.04), macos-13, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, and 2 discussions need to be resolved


nativelink-scheduler/src/simple_scheduler_state_manager.rs line 253 at r1 (raw file):

                _ = (self.now_fn)().sleep(self.no_event_action_timeout) => {
                    // Timeout happened, do additional checks below.
                    timeouts_triggered += 1;

Would a trace level log statement here be too much overhead? Concerned if we ever needed to debug that we timed out that we might not have visibility into it, would that state percolate out anywhere else?


nativelink-scheduler/src/simple_scheduler_state_manager.rs line 416 at r1 (raw file):

                .checked_add(self.no_event_action_timeout)
                .err_tip(|| {
                    "Timestamp too big in SimpleSchedulerStateManager::timeout_operation_id"

Lets capture the operation id value with this error tip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants