Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sink to BigQuery error: Request 'AppendRows' from role 'cloud-dataengine-globalrouting' throttled: Task is overloaded (memory-protection) go/tr-t. #17214

Closed
BugenZhao opened this issue Jun 12, 2024 · 3 comments
Assignees
Milestone

Comments

@BugenZhao
Copy link
Member

BugenZhao commented Jun 12, 2024

A user encountered the following error frequently when backfilling the historical data with high throughput to sink them into the downstream BigQuery:

Actor 114514 exited unexpectedly: Executor error: Sink error:
  BigQuery error:
    status: Unavailable,
    message: "Request 'AppendRows' from role 'cloud-dataengine-globalrouting' throttled:
      Task is overloaded (memory-protection) go/tr-t.",
    details: [],
    metadata: MetadataMap { headers: {} };

The error indicates that the external system is being throttled. Instead of throwing an error and causing the actor to fail, we should retry the request.

By enabling sink_decouple, the error got retried by the log store writer and not triggering the cluster recovery any more. However, it turns out that the effective write throughput is quite limited. The user needs to frequently rescale or pause and resume the cluster to increase the throughput, which is weird.


Questions:

  1. Shall we have connector-specific retrying logic for such throttling error?
  2. Why pausing and resuming increases the throughput?
  3. There seems to be no pressure test that reflects the workload in production. Shall we improve the situation? Also applies to other connectors.
  4. Based on the documentation of BigQuery Write API...
    • Shall we adopt the "pending type" stream instead of the default stream (to potentially improve the performance)?
    • Why can we do to minimize the possibility to reach the quota / rate limit?
@fuyufjh
Copy link
Member

fuyufjh commented Jun 12, 2024

Perhaps our write throughput/frequency is too high for these OLAP systems. I think we may always use sink_decouple and batching writes to reduce the load of target system.

Also +1 for retrying on these transient errors.

@xxhZs
Copy link
Contributor

xxhZs commented Jun 12, 2024

So it seems that implementing decouple commit for bg can solve this problem?

@xxhZs
Copy link
Contributor

xxhZs commented Jun 19, 2024

add retry in #17237

@xxhZs xxhZs closed this as completed Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants