You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sink to BigQuery error: Request 'AppendRows' from role 'cloud-dataengine-globalrouting' throttled: Task is overloaded (memory-protection) go/tr-t.#17214
Closed
BugenZhao opened this issue
Jun 12, 2024
· 3 comments
A user encountered the following error frequently when backfilling the historical data with high throughput to sink them into the downstream BigQuery:
Actor 114514 exited unexpectedly: Executor error: Sink error:
BigQuery error:
status: Unavailable,
message: "Request 'AppendRows' from role 'cloud-dataengine-globalrouting' throttled:
Task is overloaded (memory-protection) go/tr-t.",
details: [],
metadata: MetadataMap { headers: {} };
The error indicates that the external system is being throttled. Instead of throwing an error and causing the actor to fail, we should retry the request.
By enabling sink_decouple, the error got retried by the log store writer and not triggering the cluster recovery any more. However, it turns out that the effective write throughput is quite limited. The user needs to frequently rescale or pause and resume the cluster to increase the throughput, which is weird.
Questions:
Shall we have connector-specific retrying logic for such throttling error?
Why pausing and resuming increases the throughput?
There seems to be no pressure test that reflects the workload in production. Shall we improve the situation? Also applies to other connectors.
Perhaps our write throughput/frequency is too high for these OLAP systems. I think we may always use sink_decouple and batching writes to reduce the load of target system.
A user encountered the following error frequently when backfilling the historical data with high throughput to sink them into the downstream BigQuery:
The error indicates that the external system is being throttled. Instead of throwing an error and causing the actor to fail, we should retry the request.
By enabling
sink_decouple
, the error got retried by the log store writer and not triggering the cluster recovery any more. However, it turns out that the effective write throughput is quite limited. The user needs to frequently rescale or pause and resume the cluster to increase the throughput, which is weird.Questions:
The text was updated successfully, but these errors were encountered: