Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: frontend need restart to recover from breakdown #9239

Closed
zwang28 opened this issue Apr 17, 2023 · 6 comments
Closed

bug: frontend need restart to recover from breakdown #9239

zwang28 opened this issue Apr 17, 2023 · 6 comments
Assignees
Labels
component/batch Batch related related issue. type/bug Something isn't working

Comments

@zwang28
Copy link
Contributor

zwang28 commented Apr 17, 2023

Describe the bug

We encounter frontend breakdown, as below:

dev=> set visibility_mode to all;
SET_VARIABLE
dev=> select * from xxx limit 1;
ERROR:  QueryError: internal error: error trying to connect: deadline has elapsed
dev=> set visibility_mode to all;
SET_VARIABLE
dev=> select * from xxx limit 1;
ERROR:  QueryError: internal error: error trying to connect: dns error: failed to lookup address information: Name or service not known

After restarting frontend, the breakdown is gone.

To Reproduce

No response

Expected behavior

No response

Additional context

There has been compute node leaving/joining the cluster. Can it be related to frontend's compute node client pool?

@zwang28 zwang28 added type/bug Something isn't working component/batch Batch related related issue. labels Apr 17, 2023
@github-actions github-actions bot added this to the release-0.19 milestone Apr 17, 2023
@ZENOTME
Copy link
Contributor

ZENOTME commented Apr 19, 2023

Is possible following case happen🤔:
A compute node leaving but the worker_node_manager didn't update yet. And then the query try to get a rpc client using the outdated worker node so that the connect failed.

@zwang28
Copy link
Contributor Author

zwang28 commented Apr 19, 2023

Is possible following case happen🤔:
A compute node leaving but the worker_node_manager didn't update yet. And then the query try to get a rpc client using the outdated worker node so that the connect failed.

The error remains for a long time until restart.

@ZENOTME

This comment was marked as resolved.

@ZENOTME
Copy link
Contributor

ZENOTME commented Apr 21, 2023

Do we have more concrate log to reproduce this bug? Such as what reschedule happen before compute join or leave the cluster.

@hzxa21
Copy link
Collaborator

hzxa21 commented May 19, 2023

Do we have more concrate log to reproduce this bug? Such as what reschedule happen before compute join or leave the cluster.

cc @zwang28 Any more information can be provided?

Assign to @ZENOTME first. Feel free to reassign.

@zwang28
Copy link
Contributor Author

zwang28 commented May 19, 2023

Any more information can be provided?

Informed @ZENOTME when it occurred once again recently.
No other information from my side.

@zwang28 zwang28 removed this from the release-0.19 milestone Jul 14, 2023
@zwang28 zwang28 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/batch Batch related related issue. type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants