Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce errors sent to broker #919

Closed
ansoncfit opened this issue Dec 19, 2023 · 2 comments
Closed

Reduce errors sent to broker #919

ansoncfit opened this issue Dec 19, 2023 · 2 comments

Comments

@ansoncfit
Copy link
Member

When workers encounter an exception on a regional task, they send a String error message back to the broker: https://github.com/conveyal/r5/blob/v7.0/src/main/java/com/conveyal/r5/analyst/cluster/AnalysisWorker.java#L294-L300

Many workers sending many error messages at once can overwhelm the broker. If there is an exception, it will likely affect or invalidate all origins, so most of these messages are duplicative. Some ideas for revising worker behavior:

  • Only report one error per batch of tasks received
  • Stop returning results on a job after a certain number of errors have been reported
@abyrd
Copy link
Member

abyrd commented Dec 20, 2023

It would be possible to limit error reporting on the worker side. But the workers are intended to report at most one batch (4*n_cores) of errored tasks before the backend stops delivering tasks to them. The fact that worker error reporting became a problem revealed an underlying problem in job cancellation (see #887 and #921).

Workers do not currently associate tasks with the batch they were received in, so the first suggestion would add a bit of complexity. The second suggestion would only involve keeping a set of errored jobs, which could be global to the worker or just one set local to each polling operation (I'm now realizing maybe this is what you meant by your first suggestion). This would be relatively simple, but it does add one more step to the reporting process (thus one more thing that can go wrong).

I'm inclined to just get the job cancellation working properly, in which case we should only ever get one batch of errors before an errored job stops. We could then reinforce it with the second suggestion (perhaps limited to the scope of a single polling operation) if we feel there's still risk to be managed.

@abyrd
Copy link
Member

abyrd commented Jan 3, 2024

Closing this because we merged #918 to the default branch. In the future I'll try to more regularly set the associated issues on pull requests, which should auto-close them upon merge.

@abyrd abyrd closed this as completed Jan 3, 2024
christophfink added a commit to DigitalGeographyLab/r5 that referenced this issue Jan 16, 2024
* fix comment on NETWORK_FORMAT_VERSION

* Do not record opportunities 120 minutes away

* Fix freeform guardrails

Correct logic for checking oneToOne analyses
Check limit on number of destinations for path analyses in broker

* only record job when assembler created, fixes conveyal#887

* record errors (stopping job) before other checks

addresses conveyal#887

* report worker errors as much shorter stack trace

move filterStackTrace method into shared utility class

* record only one error per job in broker

addresses conveyal#919 and conveyal#887

* Add final modifier to PathResult constants

* Use specific AnalysisServerException

BAD_REQUEST instead of UNKNOWN

* Filter stack traces sent to UI

* build filtered stack trace directly from throwable

conveyal#918 (comment)

* Update to 2020 Census geometries

* Update seamless-census test fixtures

with 2020 Census geometries

* do not include stacktrace in message

rethrow AnalysisServerException since they already have clear messages

* Implement Timo Jaakkonen’s crossing delays (#4), upload package to DGL GitHub organisation

* Jaakkonen (2013)’s values

* float -> int delays

* let’s try this

* enough for today

* too many changes for such a small patch 😬

* fixed test

* shield against undefined date

* 😬

* am I allowed to push to DGL’s packages?

* linted, moved package destination to my own fork

* moved repo to fork

* build package on pull request, publish to dgl’s repo

* build package on pull request, publish to dgl’s repo

* build package on pull request, publish to dgl’s repo

* build package on pull request, publish to dgl’s repo

* build package on pull request, publish to dgl’s repo

* Run on older image, as ubuntu-latest has gradle-8, which fails (#5)

* Freeze gradle version (#6)

* Run on older image, as ubuntu-latest has gradle-8, which fails

* freeze gradle version

* fix build workflow  (#7)

* Run on older image, as ubuntu-latest has gradle-8, which fails

* freeze gradle version

* refactor workflow, correct repo package address

* fix build workflow (#8)

* Run on older image, as ubuntu-latest has gradle-8, which fails

* freeze gradle version

* refactor workflow, correct repo package address

* clean-up

* clean-up

* token

* print debug info

* configurable (per-repo) package registry (#9)

* Run on older image, as ubuntu-latest has gradle-8, which fails

* freeze gradle version

* refactor workflow, correct repo package address

* clean-up

* clean-up

* token

* print debug info

* flexible package repo address

* remove debug statement

* configurable package registry

* debugging GH actions

* debugging ii

* fix build system (#10)

* Run on older image, as ubuntu-latest has gradle-8, which fails

* freeze gradle version

* refactor workflow, correct repo package address

* clean-up

* clean-up

* token

* print debug info

* flexible package repo address

* remove debug statement

* configurable package registry

* debugging GH actions

* debugging ii

* asfd

* remove debug statements

* read custom OSM tags, use them for bike routing (#11)

* read custom OSM tags, use them for bike routing

* allocate memory

* write on correct edge

* Latest 7.x.x gradle

* Fix bicycle speed tag reading

* Allow publish step to fail (in pull requests)

* Add default speeds that represent the speed limits on Finnish roads (#13)

* revert Jaakkola implementation (#16)

* disable turning penalties for bicycle and car, do not apply perceived length multiplier for bicycle routing (#17)

* Use per-edge speeds for cycling (#19)

* Use per-edge speeds for cycling

* Timo Jakkola’s times (#21)

* save shapes of GTFS patterns (#22)

* 23 roll back jaakkola congestion timings (#24)

* save shapes of GTFS patterns

* Roll back Jaakkola congestion penalties

* make MAX_PATH_DESTINATIONS *not* final

---------

Co-authored-by: abyrd <[email protected]>
Co-authored-by: ansons <[email protected]>
Co-authored-by: Anson Stewart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants