Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to retrieve the new metasrv leader in a safe time window #2496

Closed
WenyXu opened this issue Sep 26, 2023 · 1 comment
Closed

Failed to retrieve the new metasrv leader in a safe time window #2496

WenyXu opened this issue Sep 26, 2023 · 1 comment
Labels
C-bug Category Bugs O-chaos Found by chaos tests

Comments

@WenyXu
Copy link
Member

WenyXu commented Sep 26, 2023

What type of bug is this?

Incorrect result, Unexpected error

What subsystems are affected?

Datanode, Meta

What happened?

2023-09-26T07:15:47.457664476Z stdout F �[2m2023-09-26T07:15:47.457491Z�[0m �[31mERROR�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Error while reading heartbeat response �[3merr.msg�[0m�[2m=�[0mh2 protocol error: error reading a body from connection: stream closed because of a broken pipe �[3merr.code�[0m�[2m=�[0mInternal
2023-09-26T07:15:47.457671422Z stdout F �[2m2023-09-26T07:15:47.457612Z�[0m �[32m INFO�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Heartbeat handling loop exit.
2023-09-26T07:15:47.508332234Z stdout F �[2m2023-09-26T07:15:47.508200Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_alter�[0m�[2m:�[0m Schema of region 5527622909952(1287, 0) is altered from 0 to 1
2023-09-26T07:15:47.508417632Z stdout F �[2m2023-09-26T07:15:47.508341Z�[0m �[32m INFO�[0m �[2mmito2::compaction::twcs�[0m�[2m:�[0m Compaction window for region 5527622909952(1287, 0) is not present, inferring from files: 3600
2023-09-26T07:15:47.544975462Z stdout F �[2m2023-09-26T07:15:47.544889Z�[0m �[31mERROR�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Failed to send heartbeat to metasrv, error: SendHeartbeat { err_msg: "channel closed", location: Location { file: "/home/weny/Projects/greptimedb-chaos/greptimedb/src/meta-client/src/client/heartbeat.rs", line: 56, column: 14 } }
2023-09-26T07:15:47.549124992Z stdout F �[2m2023-09-26T07:15:47.549091Z�[0m �[31mERROR�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Failed to reconnect to metasrv! �[3merr.msg�[0m�[2m=�[0mFailed to initialize meta client �[3merr.code�[0m�[2m=�[0mInternal �[3merr.source�[0m�[2m=�[0merror trying to connect: tcp connect error: Connection refused (os error 111)
2023-09-26T07:15:50.546725726Z stdout F �[2m2023-09-26T07:15:50.546539Z�[0m �[31mERROR�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Failed to send heartbeat to metasrv, error: SendHeartbeat { err_msg: "channel closed", location: Location { file: "/home/weny/Projects/greptimedb-chaos/greptimedb/src/meta-client/src/client/heartbeat.rs", line: 56, column: 14 } }
2023-09-26T07:15:50.713988271Z stdout F �[2m2023-09-26T07:15:50.713890Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_alter�[0m�[2m:�[0m Try to alter region: 5527622909952(1287, 0), request: RegionAlterRequest { schema_version: 0, kind: AddColumns { columns: [AddColumn { column_metadata: ColumnMetadata { column_schema: ColumnSchema { name: "i_1", data_type: Int32(Int32Type), is_nullable: true, is_time_index: false, default_constraint: None, metadata: {} }, semantic_type: Field, column_id: 3 }, location: Some(After { column_name: "i" }) }] } }
2023-09-26T07:15:50.713992315Z stdout F �[2m2023-09-26T07:15:50.713972Z�[0m �[33m WARN�[0m �[2mmito2::worker::handle_alter�[0m�[2m:�[0m Ignores alter request, region id:5527622909952(1287, 0), region schema version 1 is greater than request schema version 0
2023-09-26T07:15:54.54444871Z stdout F �[2m2023-09-26T07:15:54.544382Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_close�[0m�[2m:�[0m Try to close region 5484673236992(1277, 0)
2023-09-26T07:15:54.544497386Z stdout F �[2m2023-09-26T07:15:54.544478Z�[0m �[32m INFO�[0m �[2mmito2::region�[0m�[2m:�[0m Stopped region manifest manager, region_id: 5484673236992(1277, 0)
2023-09-26T07:15:54.544555129Z stdout F �[2m2023-09-26T07:15:54.544537Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_close�[0m�[2m:�[0m Region 5484673236992(1277, 0) closed
2023-09-26T07:15:54.544640585Z stdout F �[2m2023-09-26T07:15:54.544620Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_close�[0m�[2m:�[0m Try to close region 5506148073473(1282, 1)
2023-09-26T07:15:54.544667096Z stdout F �[2m2023-09-26T07:15:54.544645Z�[0m �[32m INFO�[0m �[2mdatanode::region_server�[0m�[2m:�[0m Region 5484673236992(1277, 0) is deregistered from engine mito
2023-09-26T07:15:54.544681142Z stdout F �[2m2023-09-26T07:15:54.544665Z�[0m �[32m INFO�[0m �[2mmito2::region�[0m�[2m:�[0m Stopped region manifest manager, region_id: 5506148073473(1282, 1)
2023-09-26T07:15:54.544728511Z stdout F �[2m2023-09-26T07:15:54.544713Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_close�[0m�[2m:�[0m Region 5506148073473(1282, 1) closed
2023-09-26T07:15:54.544732371Z stdout F �[2m2023-09-26T07:15:54.544725Z�[0m �[32m INFO�[0m 
�[2mdatanode::alive_keeper�[0m�[2m:�[0m Region 5484673236992(1277, 0) is closed, result: true. RegionAliveKeeper out.


...

2023-09-26T07:15:55.550208828Z stdout F �[2m2023-09-26T07:15:55.550037Z�[0m �[31mERROR�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Failed to reconnect to metasrv! �[3merr.msg�[0m�[2m=�[0mFailed to initialize meta client �[3merr.code�[0m�[2m=�[0mInternal �[3merr.source�[0m�[2m=�[0merror trying to connect: deadline has elapsed
2023-09-26T07:15:55.55032233Z stdout F �[2m2023-09-26T07:15:55.550184Z�[0m �[31mERROR�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Failed to send heartbeat to metasrv, error: SendHeartbeat { err_msg: "channel closed", location: Location { file: "/home/weny/Projects/greptimedb-chaos/greptimedb/src/meta-client/src/client/heartbeat.rs", line: 56, column: 14 } }
2023-09-26T07:15:55.593959989Z stdout F �[2m2023-09-26T07:15:55.593812Z�[0m �[32m INFO�[0m �[2mmeta_client::client::heartbeat�[0m�[2m:�[0m Success to create heartbeat stream to server: HeartbeatResponse {
2023-09-26T07:15:55.593970437Z stdout F     header: Some(
2023-09-26T07:15:55.593972115Z stdout F         ResponseHeader {
2023-09-26T07:15:55.593973356Z stdout F             protocol_version: 1,
2023-09-26T07:15:55.593974354Z stdout F             cluster_id: 0,
2023-09-26T07:15:55.593975327Z stdout F             error: None,
2023-09-26T07:15:55.593976316Z stdout F         },
2023-09-26T07:15:55.5939773Z stdout F     ),
2023-09-26T07:15:55.593978337Z stdout F     mailbox_message: None,
2023-09-26T07:15:55.593979401Z stdout F     region_lease: None,
2023-09-26T07:15:55.593980424Z stdout F }

Events:

  1. 2023-09-26T07:15:47.457664476Z Heartbeat stream is broken
  2. 2023-09-26T07:15:47.457671422Z Heartbeat handling loop exit (it didn't re-create heartbeat stream until the next req failed)
  3. 2023-09-26T07:15:47.549124992Z Failed to reconnect to metasrv! Failed to initialize meta client (it seems like it reads the dead leader address, and didn't re-create heartbeat stream until the next req failed)
  4. 2023-09-26T07:15:54.544732371Z AliveKeeper Timeout
  5. 2023-09-26T07:15:55.550208828Z Same as step 3
  6. 2023-09-26T07:15:55.593959989Z Stream is reconnected

Problems

  1. Retrieve the old metasrv address
  2. Only retry once after the heartbeat stream is broken

What operating system did you use?

Ubuntu 23.04

Relevant log output and stack trace

2023-09-26T07:15:47.457664476Z stdout F �[2m2023-09-26T07:15:47.457491Z�[0m �[31mERROR�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Error while reading heartbeat response �[3merr.msg�[0m�[2m=�[0mh2 protocol error: error reading a body from connection: stream closed because of a broken pipe �[3merr.code�[0m�[2m=�[0mInternal
2023-09-26T07:15:47.457671422Z stdout F �[2m2023-09-26T07:15:47.457612Z�[0m �[32m INFO�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Heartbeat handling loop exit.
2023-09-26T07:15:47.508332234Z stdout F �[2m2023-09-26T07:15:47.508200Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_alter�[0m�[2m:�[0m Schema of region 5527622909952(1287, 0) is altered from 0 to 1
2023-09-26T07:15:47.508417632Z stdout F �[2m2023-09-26T07:15:47.508341Z�[0m �[32m INFO�[0m �[2mmito2::compaction::twcs�[0m�[2m:�[0m Compaction window for region 5527622909952(1287, 0) is not present, inferring from files: 3600
2023-09-26T07:15:47.544975462Z stdout F �[2m2023-09-26T07:15:47.544889Z�[0m �[31mERROR�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Failed to send heartbeat to metasrv, error: SendHeartbeat { err_msg: "channel closed", location: Location { file: "/home/weny/Projects/greptimedb-chaos/greptimedb/src/meta-client/src/client/heartbeat.rs", line: 56, column: 14 } }
2023-09-26T07:15:47.549124992Z stdout F �[2m2023-09-26T07:15:47.549091Z�[0m �[31mERROR�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Failed to reconnect to metasrv! �[3merr.msg�[0m�[2m=�[0mFailed to initialize meta client �[3merr.code�[0m�[2m=�[0mInternal �[3merr.source�[0m�[2m=�[0merror trying to connect: tcp connect error: Connection refused (os error 111)
2023-09-26T07:15:50.546725726Z stdout F �[2m2023-09-26T07:15:50.546539Z�[0m �[31mERROR�[0m �[2mdatanode::heartbeat�[0m�[2m:�[0m Failed to send heartbeat to metasrv, error: SendHeartbeat { err_msg: "channel closed", location: Location { file: "/home/weny/Projects/greptimedb-chaos/greptimedb/src/meta-client/src/client/heartbeat.rs", line: 56, column: 14 } }
2023-09-26T07:15:50.713988271Z stdout F �[2m2023-09-26T07:15:50.713890Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_alter�[0m�[2m:�[0m Try to alter region: 5527622909952(1287, 0), request: RegionAlterRequest { schema_version: 0, kind: AddColumns { columns: [AddColumn { column_metadata: ColumnMetadata { column_schema: ColumnSchema { name: "i_1", data_type: Int32(Int32Type), is_nullable: true, is_time_index: false, default_constraint: None, metadata: {} }, semantic_type: Field, column_id: 3 }, location: Some(After { column_name: "i" }) }] } }
2023-09-26T07:15:50.713992315Z stdout F �[2m2023-09-26T07:15:50.713972Z�[0m �[33m WARN�[0m �[2mmito2::worker::handle_alter�[0m�[2m:�[0m Ignores alter request, region id:5527622909952(1287, 0), region schema version 1 is greater than request schema version 0
2023-09-26T07:15:54.54444871Z stdout F �[2m2023-09-26T07:15:54.544382Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_close�[0m�[2m:�[0m Try to close region 5484673236992(1277, 0)
2023-09-26T07:15:54.544497386Z stdout F �[2m2023-09-26T07:15:54.544478Z�[0m �[32m INFO�[0m �[2mmito2::region�[0m�[2m:�[0m Stopped region manifest manager, region_id: 5484673236992(1277, 0)
2023-09-26T07:15:54.544555129Z stdout F �[2m2023-09-26T07:15:54.544537Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_close�[0m�[2m:�[0m Region 5484673236992(1277, 0) closed
2023-09-26T07:15:54.544640585Z stdout F �[2m2023-09-26T07:15:54.544620Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_close�[0m�[2m:�[0m Try to close region 5506148073473(1282, 1)
2023-09-26T07:15:54.544667096Z stdout F �[2m2023-09-26T07:15:54.544645Z�[0m �[32m INFO�[0m �[2mdatanode::region_server�[0m�[2m:�[0m Region 5484673236992(1277, 0) is deregistered from engine mito
2023-09-26T07:15:54.544681142Z stdout F �[2m2023-09-26T07:15:54.544665Z�[0m �[32m INFO�[0m �[2mmito2::region�[0m�[2m:�[0m Stopped region manifest manager, region_id: 5506148073473(1282, 1)
2023-09-26T07:15:54.544728511Z stdout F �[2m2023-09-26T07:15:54.544713Z�[0m �[32m INFO�[0m �[2mmito2::worker::handle_close�[0m�[2m:�[0m Region 5506148073473(1282, 1) closed
2023-09-26T07:15:54.544732371Z stdout F �[2m2023-09-26T07:15:54.544725Z�[0m �[32m INFO�[0m �[2mdatanode::alive_keeper�[0m�[2m:�[0m Region 5484673236992(1277, 0) is closed, result: true. RegionAliveKeeper out.

How can we reproduce the bug?

Follows the internal chaos guides, case: kill-meta-leader

@WenyXu WenyXu added C-bug Category Bugs O-chaos Found by chaos tests labels Sep 26, 2023
@fengjiachun
Copy link
Collaborator

Can we close this issue now? @WenyXu

@WenyXu WenyXu closed this as completed Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category Bugs O-chaos Found by chaos tests
Projects
None yet
Development

No branches or pull requests

2 participants