Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The first query failed right after the RW started #15730

Closed
cyliu0 opened this issue Mar 18, 2024 · 8 comments
Closed

The first query failed right after the RW started #15730

cyliu0 opened this issue Mar 18, 2024 · 8 comments

Comments

@cyliu0
Copy link
Collaborator

cyliu0 commented Mar 18, 2024

Describe the bug

We have hit this error for the first queries in our daily and weekly pipelines a lot of times.
image

I don't think it's an environmental issue because all the pipelines failed only at the first query in the pipelines right after the RW clusters started. And it's highly reproducible. It happened in our many builds.

We have filed an issue for this #15650. And it's fixed for one or two days.

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3300
https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3302
https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3304
https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3303
https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3307
https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3311

Error message/log

ERROR:  Failed to run the query
Caused by these errors (recent errors listed first):
  1: gRPC request to meta service failed: Internal error
  2: failed to inject barrier

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20240314

Additional context

No response

@fuyufjh
Copy link
Member

fuyufjh commented Mar 18, 2024

Might be caused by source_manager: new discovered splits and the consequent recovery.

Frontend error happens at 2024-03-18T01:22:11

2024-03-18T01:22:11+08:00	WITH ( connector = 'blackhole', type = 'append-only' );}: risingwave_frontend::session: failed to handle sql error=gRPC request to meta service failed: Internal error: failed to inject barrier sql=CREATE SINK nexmark_q5 AS SELECT AuctionBids.auction, AuctionBids.num FROM (SELECT bid.auction, count(*) AS num, window_start AS starttime FROM HOP(bid, date_time, INTERVAL '2' SECOND, INTERVAL '10' SECOND) GROUP BY bid.auction, window_start) AS AuctionBids JOIN (SELECT max(CountBids.num) AS maxn, CountBids.starttime_c FROM (SELECT count(*) AS num, window_start AS starttime_c FROM HOP(bid, date_time, INTERVAL '2' SECOND, INTERVAL '10' SECOND) GROUP BY bid.auction, window_start) AS CountBids GROUP BY CountBids.starttime_c) AS MaxBids ON AuctionBids.starttime = MaxBids.starttime_c AND AuctionBids.num >= MaxBids.maxn EMIT ON WINDOW CLOSE WITH (connector = 'blackhole', type = 'append-only')

Meanwhile, in Meta:

2024-03-17T17:22:11.147377535Z  INFO risingwave_meta::stream::source_manager: spawning new watcher for source 1
2024-03-17T17:22:11.147484269Z  INFO risingwave_connector::source::kafka: kafka client starts with: ClientConfig { conf_map: {"bootstrap.servers": "benchmark-kafka.nexmark-bs-watermark-daily-20240317:9092", "isolation.level": "read_committed", "enable.sasl.oauthbearer.unsecure.jwt": "true"}, log_level: Debug }
2024-03-17T17:22:11.147503376Z  INFO risingwave_connector::source::kafka::private_link: [consumer] rewrite map None
2024-03-17T17:22:11.250509584Z  INFO risingwave_common_metrics::monitor::connection: monitoring connector hyper::client::connect::http::HttpConnector<risingwave_common_metrics::monitor::connection::MonitoredGaiResolver> with type grpc-stream-client
2024-03-17T17:22:11.251822713Z DEBUG hyper::client::connect::http: connecting to 10.0.32.69:5688
2024-03-17T17:22:11.252193991Z DEBUG hyper::client::connect::http: connected to 10.0.32.69:5688
2024-03-17T17:22:11.265197369Z  INFO risingwave_meta::stream::source_manager: new discovered splits fragment_id=4 new_discovered_splits={"0", "1", "2", "3", "4", "5", "6", "7"}
2024-03-17T17:22:11.265804758Z  INFO failure_recovery{error=unconnected worker node: Some(HostAddress { host: "benchmark-risingwave-compute-c-0.benchmark-risingwave-compute", port: 5688 }) prev_epoch=6125055504023552}: risingwave_meta::barrier::recovery: recovery start!
2024-03-17T17:22:11.267691005Z  INFO risingwave_meta::manager::catalog::fragment: cleaning dirty downstream merge nodes for table sink
2024-03-17T17:22:11.269470439Z  INFO risingwave_meta::manager::sink_coordination::manager: sink manager worker start cleaning up
2024-03-17T17:22:11.269478734Z  INFO risingwave_meta::manager::sink_coordination::manager: sink manager worker finished cleaning up
2024-03-17T17:22:11.269482556Z  INFO failure_recovery{error=unconnected worker node: Some(HostAddress { host: "benchmark-risingwave-compute-c-0.benchmark-risingwave-compute", port: 5688 }) prev_epoch=6125055504023552}: risingwave_meta::manager::sink_coordination::manager: successfully stop coordinator: None
2024-03-17T17:22:11.269487417Z  INFO failure_recovery{error=unconnected worker node: Some(HostAddress { host: "benchmark-risingwave-compute-c-0.benchmark-risingwave-compute", port: 5688 }) prev_epoch=6125055504023552}: risingwave_meta::barrier::recovery: recovering mview progress
2024-03-17T17:22:11.288739243Z  INFO failure_recovery{error=unconnected worker node: Some(HostAddress { host: "benchmark-risingwave-compute-c-0.benchmark-risingwave-compute", port: 5688 }) prev_epoch=6125055504023552}: risingwave_meta::barrier::recovery: recovered mview progress
2024-03-17T17:22:11.289951092Z  INFO failure_recovery{error=unconnected worker node: Some(HostAddress { host: "benchmark-risingwave-compute-c-0.benchmark-risingwave-compute", port: 5688 }) prev_epoch=6125055504023552}: risingwave_meta::barrier::recovery: recovery success epoch=6125055592955904 paused=None

Explore-logs-2024-03-18 14_00_50.txt
Explore-logs-2024-03-18 14_00_18.txt

Explore more in Grafana

@fuyufjh
Copy link
Member

fuyufjh commented Mar 18, 2024

Just confirmed with @tabVersion that new discovered splits should not cause recovery. However, there isn't any other errors in the log. 🤔

Also note that the new discovered splits was outputted by source_manager, which means the problem might not be related to the currently-executing CREATE SINK, but just accidentally interrupted this query.

cc. @shanicky as well

@tabVersion
Copy link
Contributor

Just confirmed with @tabVersion that new discovered splits should not cause recovery. However, there isn't any other errors in the log. 🤔

Also note that the new discovered splits was outputted by source_manager, which means the problem might not be related to the currently-executing CREATE SINK, but just accidentally interrupted this query.

cc. @shanicky as well

One more strange thing is that new_discovered_splits contains {"0", "1", "2", "3", "4", "5", "6", "7"} which seems to be all splits belonging to the topic. It means RisingWave did not discover ANY split when creating the source.

@wenym1
Copy link
Contributor

wenym1 commented Mar 18, 2024


2024-03-16T16:24:17.632358367Z ERROR risingwave_meta::barrier::rpc: fail to resolve worker node address attempt=1 backoff_delay=500ms err=failed to create RPC client to benchmark-risingwave-compute-c-0.benchmark-risingwave-compute:5688: transport error: error trying to connect: dns error: failed to lookup address information: Name or service not known node_host=HostAddress { host: "benchmark-risingwave-compute-c-0.benchmark-risingwave-compute", port: 5688 } |  
-- | --
  | 2024-03-16T16:24:18.134110238Z  INFO risingwave_common_metrics::monitor::connection: monitoring connector hyper::client::connect::http::HttpConnector<risingwave_common_metrics::monitor::connection::MonitoredGaiResolver> with type grpc-stream-client |  
  | 2024-03-16T16:24:18.135005588Z ERROR risingwave_meta::barrier::rpc: fail to resolve worker node address attempt=2 backoff_delay=3s err=failed to create RPC client to benchmark-risingwave-compute-c-0.benchmark-risingwave-compute:5688: transport error: error trying to connect: dns error: failed to lookup address information: Name or service not known node_host=HostAddress { host: "benchmark-risingwave-compute-c-0.benchmark-risingwave-compute", port: 5688 } |  
  | 2024-03-16T16:24:21.135379123Z  INFO risingwave_common_metrics::monitor::connection: monitoring connector hyper::client::connect::http::HttpConnector<risingwave_common_metrics::monitor::connection::MonitoredGaiResolver> with type grpc-stream-client |  
  | 2024-03-16T16:24:21.136371008Z ERROR risingwave_meta::barrier::rpc: fail to resolve worker node address attempt=3 backoff_delay=3s err=failed to create RPC client to benchmark-risingwave-compute-c-0.benchmark-risingwave-compute:5688: transport error: error trying to connect: dns error: failed to lookup address information: Name or service not known node_host=HostAddress { host: "benchmark-risingwave-compute-c-0.benchmark-risingwave-compute", port: 5688 } |  
  | 2024-03-16T16:24:24.138360427Z  INFO risingwave_common_metrics::monitor::connection: monitoring connector hyper::client::connect::http::HttpConnector<risingwave_common_metrics::monitor::connection::MonitoredGaiResolver> with type grpc-stream-client |  
  | 2024-03-16T16:24:24.141879918Z ERROR risingwave_meta::barrier::rpc: fail to resolve worker node address attempt=4 backoff_delay=3s err=failed to create RPC client to benchmark-risingwave-compute-c-0.benchmark-risingwave-compute:5688: transport error: error trying to connect: dns error: failed to lookup address information: Name or service not known node_host=HostAddress { host: "benchmark-risingwave-compute-c-0.benchmark-risingwave-compute", port: 5688 } |  
  | 2024-03-16T16:24:27.142651872Z  INFO risingwave_common_metrics::monitor::connection: monitoring connector hyper::client::connect::http::HttpConnector<risingwave_common_metrics::monitor::connection::MonitoredGaiResolver> with type grpc-stream-client |  
  | 2024-03-16T16:24:27.143557051Z ERROR risingwave_meta::barrier::rpc: fail to resolve worker node address attempt=5 backoff_delay=3s err=failed to create RPC client to benchmark-risingwave-compute-c-0.benchmark-risingwave-compute:5688: transport error: error trying to connect: dns error: failed to lookup address information: Name or service not known node_host=HostAddress { host: "benchmark-risingwave-compute-c-0.benchmark-risingwave-compute", port: 5688 } |  
  | 2024-03-16T16:24:30.144338505Z ERROR risingwave_meta::barrier::rpc: fail to create worker node after retry node_host=HostAddress { host: "benchmark-risingwave-compute-c-0.benchmark-risingwave-compute", port: 5688 }

Get the same dns error.

@fuyufjh
Copy link
Member

fuyufjh commented Mar 18, 2024

Get the same dns error.

DNS error only means a node was down and other nodes can't reach it. Please seek for logs before the Node went down.

@wenym1
Copy link
Contributor

wenym1 commented Mar 18, 2024

Get the same dns error.

DNS error only means a node was down and other nodes can't reach it. Please seek for logs before the Node went down.

This happened during cluster initialization when CNs register to meta and meta tried to create a control stream with CN. There seems to be some latency on dns propagation in the cluster, and then the meta failed to resolve the dns of newly registered CN. It might be the same problem as #15650

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Mar 20, 2024

@wenym1 All the scheduled pipelines failed at starting the RW cluster last night like this one https://buildkite.com/risingwave-test/backfill/builds/407#018e584d-7aa2-4415-923b-5f17de0cb641.

The compute node restarted several times. And the whole RW cluster can't be ready.

According to the Grafana logs

The compute node got some warnings.
image

The meta node got some errors.
image

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Mar 21, 2024

Close this one since we didn't hit this last night. Did we have any additional fixes merged for this one? @wenym1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants