Test flake: silent failure to initialize cockroachdb? #3889

jgallagher · 2023-08-16T20:10:30Z

https://github.com/oxidecomputer/omicron/pull/3887/checks?check_run_id=15946171931 failed; the immediate symptom in the deploy job a timeout after trying to log in for 10 minutes:

673	2023-08-16T15:51:53.331Z	2023-08-16 15:51:52.586682700 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
674	2023-08-16T15:51:54.332Z	2023-08-16 15:51:53.587485600 UTC: attempting to log into API
675	2023-08-16T15:52:09.366Z	2023-08-16 15:52:08.621092329 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
676	2023-08-16T15:52:10.368Z	2023-08-16 15:52:09.622786274 UTC: attempting to log into API
677	2023-08-16T15:52:25.402Z	2023-08-16 15:52:24.656701751 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
678	2023-08-16T15:52:26.403Z	2023-08-16 15:52:25.657578343 UTC: attempting to log into API
679	2023-08-16T15:52:41.436Z	2023-08-16 15:52:40.691037367 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
680	2023-08-16T15:52:42.439Z	Error: logging in
681	2023-08-16T15:52:42.464Z	
682	2023-08-16T15:52:42.489Z	Caused by:
683	2023-08-16T15:52:43.040Z	    timed out after 609.325828065s

All three nexus logs have many warning that seem to indicate the database wasn't set up correctly, I think?

WARN	nexus: Cannot look up rack: Object (of type ById(24ed7902-3649-4e0e-8635-cd082ae9b0c0)) not found: rack
    file = nexus/src/app/rack.rs:555

but the sled-agent logs do not report any issues with initial CRDB setup:

328	2023-08-16T15:43:19.640Z	INFO	SledAgent (ServiceManager): Formatting CRDB
    file = sled-agent/src/services.rs:2248
...
330	2023-08-16T15:43:22.432Z	INFO	SledAgent (ServiceManager): Formatting CRDB - Completed
    file = sled-agent/src/services.rs:2269

The text was updated successfully, but these errors were encountered:

davepacheco · 2023-08-16T20:39:45Z

I wanted to look at the CockroachDB SMF log files, but they're all empty. I compared with dogfood and they're empty there, too. There are no rotated files in /var/svc/log. I found #3713 and found some useful log files on dogfood (sled 9):

# find /pool/ext/*/crypt/debug/oxz_cockroachdb* -type f -name '*oxide-cockroach*' -not -size 0
/pool/ext/53460a32-da24-4921-9f57-1d35203741ba/crypt/debug/oxz_cockroachdb_dccfea99-7030-4730-b8f9-f701e9124e7c/oxide-cockroachdb:default.log.1690918415
/pool/ext/790dc7c3-7fcf-41ba-b5ac-83adead857cd/crypt/debug/oxz_cockroachdb_dccfea99-7030-4730-b8f9-f701e9124e7c/oxide-cockroachdb:default.log.1691451373
/pool/ext/f040f5e2-146a-48ef-8abe-610af6dba74f/crypt/debug/oxz_cockroachdb_dccfea99-7030-4730-b8f9-f701e9124e7c/oxide-cockroachdb:default.log.1691799001

(is there a better way to find archived log files?)

These do have the stuff I expected to be there (svc.startd messages, xtrace output from the start method, etc.). We don't have these files from the failed job, unfortunately. It seems like we should add those to the pattern of logs that we have buildomat save on failure.

davepacheco · 2023-08-16T20:46:38Z

All three nexus logs have many warning that seem to indicate the database wasn't set up correctly, I think?
WARN	nexus: Cannot look up rack: Object (of type ById(24ed7902-3649-4e0e-8635-cd082ae9b0c0)) not found: rack
    file = nexus/src/app/rack.rs:555

I would expect that the "rack" objects are inserted during rack initialization at Nexus (i.e., handoff to Nexus), not when the database is initialized. It's possible that the database was initialized but handoff never happened for some reason. n the Sled Agent log, I do see a lot of failures to handoff to Nexus, including this last one:

15:54:16.561Z INFO SledAgent (RSS): Failed to handoff to nexus: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0", "content-length": "124", "date": "Wed, 16 Aug 2023 15:54:16 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0" }
    file = sled-agent/src/rack_setup/service.rs:625

and I see no indication after that of a successful handoff. So I think that explains the warnings and the ultimate failure after the 10-minute timeout. Next will be to look at the Nexus logs to see why these requests are failing.

davepacheco · 2023-08-16T21:01:19Z

It looks like of the initialization-complete requests went to the Nexus oxz_nexus_2c05b297-cc92-4a0a-8378-946af887ec91. Here's a summary of all of that Nexus's requests:

4286e358-034e-46fa-8cba-b5f3595dbaa2 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/f4b4dc87-ab46-49fb-a4b4-d361ae214c03 404 not found: physical-disk with id "vendor = "synthetic-vendor", serial = "synthetic-serial-f4b4dc87-ab46-49fb-a4b4-d361ae214c03", model = "synthetic-model""
65196cac-d1de-4e42-9e3b-ca60fc699abe PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/24b4dc87-ab46-49fb-a4b4-d361ae214c03 200 
f8d78c80-0b74-4886-a9cd-a900ec0abb90 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/e4b4dc87-ab46-49fb-a4b4-d361ae214c03 404 not found: physical-disk with id "vendor = "synthetic-vendor", serial = "synthetic-serial-e4b4dc87-ab46-49fb-a4b4-d361ae214c03", model = "synthetic-model""
17ffc61a-dbc6-4542-879f-1fe94c0b02bd PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/d462a7f7-b628-40fe-80ff-4e4189e2d62b 404 not found: physical-disk with id "vendor = "synthetic-vendor", serial = "synthetic-serial-d462a7f7-b628-40fe-80ff-4e4189e2d62b", model = "synthetic-model""
cb1121fa-d100-4cfd-91ae-941a87e4064d PUT /physical-disk 200 
e8360586-d6dc-479e-aee3-c7658489af10 PUT /physical-disk 200 
133bf25a-5946-4042-b46d-d8cbd2362e8b PUT /physical-disk 200 
ede72848-d90e-4ea3-b45e-5bd642164579 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
f1e6b7cb-8637-41b0-9e34-40d34465781a PUT /physical-disk 200 
78d16e4b-9ec2-442b-9be5-e2c8b2c12d7f PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/a462a7f7-b628-40fe-80ff-4e4189e2d62b 200 
d1e3f827-eb08-449e-957f-ee6d903c00a1 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/b462a7f7-b628-40fe-80ff-4e4189e2d62b 200 
93f31b07-9730-4ef2-a72b-b3c7c923901e PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/14b4dc87-ab46-49fb-a4b4-d361ae214c03 200 
90a5e166-21dc-4c90-a4b1-9b3391bfba0f PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/f4b4dc87-ab46-49fb-a4b4-d361ae214c03 200 
3abb72f6-5538-42da-ab4f-35cfc47cfb85 PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/e4b4dc87-ab46-49fb-a4b4-d361ae214c03 200 
2c2481cf-670c-4dcb-9872-36736483502f PUT /sled-agents/56509f70-ee84-4467-a2fd-2019974332e8/zpools/d462a7f7-b628-40fe-80ff-4e4189e2d62b 200 
c5d66dda-ce35-44bb-b952-58d99f9b3a2d PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
3b671939-d2c9-4c52-84d3-7b1e2d3c2edb PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
b70ed9ef-6644-4a1b-813a-f93e0a8c6b0b PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
68eff423-65e6-4338-adaa-b474553377c9 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
258e5501-b9a1-4489-8910-91f2a9dd4bba PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
d7794f35-efa9-4678-ab90-d44558e8ba77 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
ebb29939-3499-4f31-8f07-4ee7fba58a49 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
7e8d8543-fde6-46e1-8b4f-c566310742d3 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
81f11842-46d3-465f-aef7-ac50cde78cfc PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
9105ccab-cfb5-4e81-b5fb-9ccc500d8038 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
85cd68b3-d8bb-43c0-9452-b760d088a4ae PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
a42ac22f-5e98-4696-9d9e-1696ee8a4a7d PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
fbb18155-5ed3-436f-bd32-21380e404198 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
33d16fbc-206a-456f-8453-875400285df5 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found
6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0 PUT /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete 500 unexpected database error: Record not found

and here's the details of one of those 500s on initialization-complete:

15:54:16.560Z INFO 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = unexpected database error: Record not found
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/8ef2bd2/dropshot/src/server.rs:841
    latency_us = 26103
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    response_code = 500
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete

There's not much to go on here. This looks like an error path during rack initialization that's not annotating the error object it got back from the database with anything useful about what it was doing. The string "Record not found" does not appear in Omicron. I guessed this was coming from use of Diesel's first(), which does appear to be the case:
https://github.com/diesel-rs/diesel/blob/v2.1.0/diesel/src/result.rs#L310

davepacheco · 2023-08-16T21:38:21Z

I started looking around for calls to first() to inspect those, but that wasn't very fruitful. There were a bunch and it was hard to tell if they were called from the rack initialization code path.

So I started with the rack initialization code path, which opens here:

omicron/nexus/db-queries/src/db/datastore/rack.rs

Lines 398 to 416 in 496b854

    
               /// Update a rack to mark that it has been initialized 
        
               pub async fn rack_set_initialized( 
        
                   &self, 
        
                   opctx: &OpContext, 
        
                   rack_init: RackInit, 
        
               ) -> UpdateResult<Rack> { 
        
                   use db::schema::rack::dsl as rack_dsl; 
        
                   opctx.authorize(authz::Action::CreateChild, &authz::FLEET).await?; 
        
                   let rack_id = rack_init.rack_id; 
        
                   let services = rack_init.services; 
        
                   let datasets = rack_init.datasets; 
        
                   let service_ip_pool_ranges = rack_init.service_ip_pool_ranges; 
        
                   let internal_dns = rack_init.internal_dns; 
        
                   let external_dns = rack_init.external_dns; 
        
                   let (authz_service_pool, service_pool) = 
        
                       self.ip_pools_service_lookup(&opctx).await?;

We can infer something about where the error was generated based on the log entries that we do have for one of the failed requests. I grabbed the request id from the last 500 error and filtered all the messages in the Nexus log for that req_id:

$ grep 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0 nexus.log | looker
15:54:16.535Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.535Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.537Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.539Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Modify
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Fleet
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.539Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.539Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.540Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.542Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = CreateChild
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Fleet
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.542Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.542Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.547Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.547Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.548Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.550Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = CreateChild
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Fleet
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.550Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.550Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.552Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.552Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.553Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.555Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = CreateChild
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Fleet
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.555Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.555Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.557Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin")} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.559Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = ListChildren
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = IpPoolList
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.559Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): roles
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    roles = RoleSet { roles: {} }
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.559Z DEBG 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000002, .. })
    actor_id = 001de000-05e4-4000-8000-000000000002
    authenticated = true
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    resource = Database
    result = Ok(())
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete
15:54:16.560Z INFO 2c05b297-cc92-4a0a-8378-946af887ec91 (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = unexpected database error: Record not found
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/8ef2bd2/dropshot/src/server.rs:841
    latency_us = 26103
    local_addr = [fd00:1122:3344:101::b]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:101::1]:44358
    req_id = 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0
    response_code = 500
    uri = /racks/24ed7902-3649-4e0e-8635-cd082ae9b0c0/initialization-complete

We can summarize the actions and results:

$ grep 6b0ecee8-b7f4-4bbd-b0f4-313e56a5cbd0 nexus.log | grep '"authorize result"' | json -ga resource action result
Database Query Ok(())
Fleet Modify Ok(())
Database Query Ok(())
Fleet CreateChild Ok(())
Database Query Ok(())
Database Query Ok(())
Fleet CreateChild Ok(())
Database Query Ok(())
Database Query Ok(())
Fleet CreateChild Ok(())
Database Query Ok(())
IpPoolList ListChildren Ok(())
Database Query Ok(())

That ListChildren on IpPoolList is interesting. That is checked here (possibly other places too, I didn't look):

omicron/nexus/db-queries/src/db/datastore/ip_pool.rs

Lines 108 to 151 in 496b854

    
               /// Looks up an IP pool intended for internal services. 
        
               /// 
        
               /// This method may require an index by Availability Zone in the future. 
        
               pub async fn ip_pools_service_lookup( 
        
                   &self, 
        
                   opctx: &OpContext, 
        
               ) -> LookupResult<(authz::IpPool, IpPool)> { 
        
                   use db::schema::ip_pool::dsl; 
        
                   // Ensure the caller has the ability to look up these IP pools. 
        
                   // If they don't, return "not found" instead of "forbidden". 
        
                   opctx 
        
                       .authorize(authz::Action::ListChildren, &authz::IP_POOL_LIST) 
        
                       .await 
        
                       .map_err(|e| match e { 
        
                           Error::Forbidden => { 
        
                               LookupType::ByCompositeId("Service IP Pool".to_string()) 
        
                                   .into_not_found(ResourceType::IpPool) 
        
                           } 
        
                           _ => e, 
        
                       })?; 
        
                   // Look up this IP pool by rack ID. 
        
                   let (authz_pool, pool) = dsl::ip_pool 
        
                       .filter(dsl::internal.eq(true)) 
        
                       .filter(dsl::time_deleted.is_null()) 
        
                       .select(IpPool::as_select()) 
        
                       .get_result_async(self.pool_authorized(opctx).await?) 
        
                       .await 
        
                       .map_err(|e| public_error_from_diesel_pool(e, ErrorHandler::Server)) 
        
                       .map(|ip_pool| { 
        
                           ( 
        
                               authz::IpPool::new( 
        
                                   authz::FLEET, 
        
                                   ip_pool.id(), 
        
                                   LookupType::ByCompositeId( 
        
                                       "Service IP Pool".to_string(), 
        
                                   ), 
        
                               ), 
        
                               ip_pool, 
        
                           ) 
        
                       })?; 
        
                   Ok((authz_pool, pool)) 
        
               }

which is called early in rack_set_initialized. And right after the authz check, we grab a database connection (explaining the last Database Query entry above) and then issue a query that uses get_result(), which, like first(), produces this "Record not found" error if there were no rows returned by the query. I think it's a good bet this is where we failed. This code is assuming that an internal services IP pool will be present, but maybe it's not?

What's responsible for setting that up? I didn't find anything in nexus/db-queries/src/db/fixed_data, but I did find:

omicron/nexus/db-queries/src/db/datastore/rack.rs

Lines 550 to 588 in 496b854

    
           pub async fn load_builtin_rack_data( 
        
               &self, 
        
               opctx: &OpContext, 
        
               rack_id: Uuid, 
        
           ) -> Result<(), Error> { 
        
               use omicron_common::api::external::Name; 
        
               self.rack_insert(opctx, &db::model::Rack::new(rack_id)).await?; 
        
               let params = external_params::IpPoolCreate { 
        
                   identity: IdentityMetadataCreateParams { 
        
                       name: SERVICE_IP_POOL_NAME.parse::<Name>().unwrap(), 
        
                       description: String::from("IP Pool for Oxide Services"), 
        
                   }, 
        
               }; 
        
               self.ip_pool_create(opctx, &params, /*internal=*/ true) 
        
                   .await 
        
                   .map(|_| ()) 
        
                   .or_else(|e| match e { 
        
                       Error::ObjectAlreadyExists { .. } => Ok(()), 
        
                       _ => Err(e), 
        
                   })?; 
        
               let params = external_params::IpPoolCreate { 
        
                   identity: IdentityMetadataCreateParams { 
        
                       name: "default".parse::<Name>().unwrap(), 
        
                       description: String::from("default IP pool"), 
        
                   }, 
        
               }; 
        
               self.ip_pool_create(opctx, &params, /*internal=*/ false) 
        
                   .await 
        
                   .map(|_| ()) 
        
                   .or_else(|e| match e { 
        
                       Error::ObjectAlreadyExists { .. } => Ok(()), 
        
                       _ => Err(e), 
        
                   })?; 
        
               Ok(()) 
        
           }

. This is being invoked by the "populate" mini-framework in nexus/src/populate.rs. Did that run successfully? I'm not clear. There are no log messages in this function, unfortunately. The populate framework does not log each populator, either, or when "populate" finishes. However, when creating Nexus we do spawn a background task to wait for this and that does log a message when it finishes or fails:

omicron/nexus/src/app/mod.rs

Lines 360 to 375 in 496b854

    
           tokio::spawn(async move { 
        
               match task_nexus.wait_for_populate().await { 
        
                   Ok(_) => { 
        
                       info!( 
        
                           task_log, 
        
                           "populate complete; activating background tasks" 
        
                       ); 
        
                       for task in task_nexus.background_tasks.driver.tasks() { 
        
                           task_nexus.background_tasks.driver.activate(task); 
        
                       } 
        
                   } 
        
                   Err(_) => { 
        
                       error!(task_log, "populate failed"); 
        
                   } 
        
               } 
        
           });

I don't see that log message in this log. It seems like the "populate" step maybe got stuck? More investigation is needed here.

davepacheco · 2023-08-16T22:19:55Z

I just noticed that the Nexus log that I've been looking at from buildomat doesn't start at the beginning of Nexus's lifetime. Like the CockroachDB log, it must have been rotated and archived already and we don't have that earlier file. That makes it hard to say anything about which populate steps may have run, failed, finished, etc.

There are several bugs/improvements that would help us debug this kind of problem the next time it happens:

the buildomat job should save the archived log files on failure
the populate framework could log start/end of each populator
this populator (load_builtin_rack_data) could log what it's doing like some of the others do
we could use Error::internal_context() in more places, especially uses of first(), first_async(), get_result() and get_result_async() to add context about why it expected something to be there
more generally, it sure would be nice if we could automatically annotate unexpected database errors (500-level ones, probably) with context about what function (and maybe call stack) generated the error or maybe the query it tried to run. This might be useful for other errors, too.

We should definitely fix the buildomat job. I'm not sure if it's worth debugging this further until we do that. We might be able to find more information but it'll probably be a lot clearer with the logs in hand.

bnaecker · 2023-08-16T22:28:30Z

(is there a better way to find archived log files?)

These were archived out to the U.2s as part of #3713, I believe. They're still named by zone, so filtering by that would help. I think the dataset they end up in is unpredictable, though the archiver continues to use the same location until it runs out of space.

It unfortunately does not help post-mortem, but if you're debugging a live system, the zone-bundle tool will fish those log files out for you as well.

$ zone-bundle --host <SLED_AGENT_IP> get --create oxz_nexus_2c05b297-cc92-4a0a-8378-946af887ec91

That command will create a tarball with all the log files, either still in /var/svc/log or out on those U.2s.

jgallagher added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Aug 16, 2023

This was referenced Aug 16, 2023

deploy job should save rotated and archived log files #3890

Merged

service IP pool lookup returns wrong authz error #3933

Merged

smklein added the database Related to database access label Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test flake: silent failure to initialize cockroachdb? #3889

Test flake: silent failure to initialize cockroachdb? #3889

jgallagher commented Aug 16, 2023

davepacheco commented Aug 16, 2023

davepacheco commented Aug 16, 2023

davepacheco commented Aug 16, 2023

davepacheco commented Aug 16, 2023

davepacheco commented Aug 16, 2023

bnaecker commented Aug 16, 2023

Test flake: silent failure to initialize cockroachdb? #3889

Test flake: silent failure to initialize cockroachdb? #3889

Comments

jgallagher commented Aug 16, 2023

davepacheco commented Aug 16, 2023

davepacheco commented Aug 16, 2023

davepacheco commented Aug 16, 2023

davepacheco commented Aug 16, 2023

davepacheco commented Aug 16, 2023

bnaecker commented Aug 16, 2023