Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oximeter producer deletion resulted in many 404 errors with request latencies of 10+ seconds #5364

Closed
askfongjojo opened this issue Mar 31, 2024 · 12 comments
Milestone

Comments

@askfongjojo
Copy link

askfongjojo commented Mar 31, 2024

While investigating #5363 and customer-support#116, I saw a number of 500 errors resulted from oximeter producer removal errors, e.g.

23:27:59.435Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = failed to delete producer from collector: Error Response: status: 404 Not Found; headers: {"content-type": "application/json", "x-request-id": "baf56a1b-7d8a-431d-87e2-9e4671d6c5bb", "content-length": "84", "date": "Sat, 30 Mar 2024 23:27:59 GMT"}; value: Error { error_code: None, message: "Not Found", request_id: "baf56a1b-7d8a-431d-87e2-9e4671d6c5bb" }
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/29ae98d/dropshot/src/server.rs:837
    latency_us = 9951084
    local_addr = [fd00:1122:3344:104::3]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:10b::1]:54030
    req_id = 04a46873-ca82-47d2-aef0-60c726d20c19
    response_code = 500
    uri = /instances/b91a5c67-077e-4e05-a8c5-d85b997de150

They might have changed the overall responsiveness of stop-instance requests. It's unclear how they might cause other request timeouts.

@askfongjojo
Copy link
Author

There are more examples of this same error in catacomb:/staff/core/omicron-5363/oxide-nexus_default.log.1711843200. There are 54 occurrences of them when I was deleting 75 instances with Terraform (which used 10 concurrent threads by default).

@jgallagher
Copy link
Contributor

jgallagher commented Apr 1, 2024

23:27:59.435Z INFO 20b100d0-84c3-4119-aa9b-0c632b0b6a3a (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = failed to delete producer from collector: Error Response: status: 404 Not Found; headers: {"content-type": "application/json", "x-request-id": "baf56a1b-7d8a-431d-87e2-9e4671d6c5bb", "content-length": "84", "date": "Sat, 30 Mar 2024 23:27:59 GMT"}; value: Error { error_code: None, message: "Not Found", request_id: "baf56a1b-7d8a-431d-87e2-9e4671d6c5bb" }
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/29ae98d/dropshot/src/server.rs:837
    latency_us = 9951084
    local_addr = [fd00:1122:3344:104::3]:12221
    method = PUT
    remote_addr = [fd00:1122:3344:10b::1]:54030
    req_id = 04a46873-ca82-47d2-aef0-60c726d20c19
    response_code = 500
    uri = /instances/b91a5c67-077e-4e05-a8c5-d85b997de150

This might be caused by #5326: Oximeter now periodically refreshes its collectors and deletes those that are gone, which means we could get this sequence from unassign_producer:

  1. Nexus: self.db_datastore.producer_endpoint_delete(opctx, id) completes, and removes the producer <-> collector assignment in CRDB
  2. Oximeter: refreshes its collectors, sees that collector id is gone, so removes it
  3. Nexus: gets to client.producer_delete(&id) on line 185; Oximeter already removed it, so this returns a 404

I think we'd expect this to be a relatively rare race, since Oximeter would have to refresh its collectors in the window between 1 and 3, but it's very possible. I think we should do one of two things:

  1. Oximeter's producer_delete endpoint returns a 200-something code for deleting a nonexistent endpoint
  2. Change Nexus to catch the 404 here and not return an error to its caller

I have a moderate preference for 1, because otherwise all callers of producer_delete need to know or remember to do this (since Oximeter can asynchronously delete collectors at any time, so you never know if the collector you want to delete will still be there). But I'll defer to @bnaecker.

@askfongjojo
Copy link
Author

I've uploaded the oximeter log file that corresponds to the Mar 30 22:30 to 23:30 time-window to catacomb:/staff/core/omicron-5363.

@jgallagher
Copy link
Contributor

This might be caused by #5326: Oximeter now periodically refreshes its collectors and deletes those are gone

It looks like this is what happened, although maybe my analysis of this having to happen inside a window of unassign_producer is wrong, because these timestamps are pretty far part. Trimming down to just the relevant log lines:

  1. Oximeter starts refreshing producers
2024-03-30 23:21:30.003Z INFO oximeter/8852 (oximeter-agent) on oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b: refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    file = oximeter/collector/src/agent.rs:752
  1. One of the producers removed is b91a5c67-077e-4e05-a8c5-d85b997de150 (the instance ID that produced the 404)
2024-03-30 23:21:30.082Z DEBG oximeter/8852 (oximeter-agent) on oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b: removed collection task from set
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    producer_id = b91a5c67-077e-4e05-a8c5-d85b997de150
2024-03-30 23:21:30.082Z DEBG oximeter/8852 (oximeter-agent) on oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b: shut down collection task
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    producer_id = b91a5c67-077e-4e05-a8c5-d85b997de150
2024-03-30 23:21:30.084Z DEBG oximeter/8852 (oximeter-agent) on oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b: collection task received shutdown request
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    producer_id = b91a5c67-077e-4e05-a8c5-d85b997de150
  1. Oximeter finishes refreshing producers, pruning 82 (!). This is ~0.08 seconds from when it started in 1.
2024-03-30 23:21:30.085Z INFO oximeter/8852 (oximeter-agent) on oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b: refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    file = oximeter/collector/src/agent.rs:800
    n_current_tasks = 100
    n_pruned_tasks = 82
  1. Six and a half minutes later, Nexus tries to delete this same producer and triggers the 404:
2024-03-30 23:27:59.433Z INFO oximeter/8852 (dropshot) on oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b: request completed
    error_message_external = Not Found
    error_message_internal = No such producer: b91a5c67-077e-4e05-a8c5-d85b997de150
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/29ae98d/dropshot/src/server.rs:837
    latency_us = 44
    local_addr = [fd00:1122:3344:10a::3]:12223
    method = DELETE
    remote_addr = [fd00:1122:3344:104::3]:33603
    req_id = baf56a1b-7d8a-431d-87e2-9e4671d6c5bb
    response_code = 404
    uri = /producers/b91a5c67-077e-4e05-a8c5-d85b997de150

I think this means my proposed fixes above would be correct, but this does make me ask two more questions:

  1. Why did we prune 82 producers at once? That seems like a lot if Nexus is trying to tell us to unregister instances as they're deleted.
  2. Why was the gap between 3 and 4 above 6.5 minutes? (This could explain the previous question too, if the time between "producer deleted from CRDB" and "Nexus tells Oximeter about it" is measured in minutes - I thought this all happened within unassigned_producer, so either I'm wrong or that method was taking forever.)

@askfongjojo
Copy link
Author

The oximeter communication errors (the one encountered during customer-support#116) did not show up during mass instance deletion. I'll file a separate issue so that it's not mixed with @jgallagher's analysis above.

@askfongjojo askfongjojo changed the title Oximeter producer deletion errors could result in request latencies of 10+ seconds Oximeter producer deletion resulted in many 404 errors with request latencies of 10+ seconds Apr 1, 2024
@bnaecker
Copy link
Collaborator

bnaecker commented Apr 1, 2024

Oximeter's producer_delete endpoint returns a 200-something code for deleting a nonexistent endpoint

I agree we should do this. I thought it already was doing this, honestly!

@jgallagher
Copy link
Contributor

I agree we should do this. I thought it already was doing this, honestly!

Fair and same, until I ran into on the GC background task recently. I think what it's doing now is more correctly from an HTTP point of view; trying to DELETE a resource that doesn't exist should 404. But it's pretty annoying when none of our clients care whether they were the ones that deleted it, only whether it's definitely gone. I think given that a 204 is fine; I could also see an argument for a 200 with a body response that indicates whether or not it was present before the delete, if we think any clients might care about that.

@bnaecker
Copy link
Collaborator

bnaecker commented Apr 1, 2024

Why did we prune 82 producers at once? That seems like a lot if Nexus is trying to tell us to unregister instances as they're deleted.

I agree this is confusing. It does look like Nexus stalled out for a lot of them, and they were all picked up during the periodic oximeter refresh. The confusing bit for me is why the requests from Nexus timed out. Deleting a producer in oximeter is pretty speedy, so I'm not sure what the hold up was.

@jgallagher
Copy link
Contributor

  1. Why did we prune 82 producers at once? That seems like a lot if Nexus is trying to tell us to unregister instances as they're deleted.
  2. Why was the gap between 3 and 4 above 6.5 minutes? (This could explain the previous question too, if the time between "producer deleted from CRDB" and "Nexus tells Oximeter about it" is measured in minutes - I thought this all happened within unassigned_producer, so either I'm wrong or that method was taking forever.)

Ben and I looked at this more, and believe both are explained by oxidecomputer/progenitor#755. When Oximeter refreshes its list of producers from Nexus, that bug is causing it to only see 100 producers, so it prunes all but 100 (erroneously!). The gap of 6.5 minutes is because the producer should not have been deleted by Oximeter in the first place.

@bnaecker
Copy link
Collaborator

bnaecker commented Apr 1, 2024

It looks like at least part of the issue here is oxidecomputer/progenitor#755. As of #5326, oximeter now refreshes its entire list of producers every 10 minutes from Nexus. The linked bug implies that oximeter will only ever get the first page of producers. This is why we see lines like this in oximeter's logs:

23:21:30.085Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    file = oximeter/collector/src/agent.rs:800
    n_current_tasks = 100
    n_pruned_tasks = 82

First, it's fetching only 100 tasks, which is suspiciously the same as the page limit. Second, it's removing 82 tasks! Those should be removed as the producer itself goes away, such as when we destroy a VM instance. Removing 82 at once is possible, but definitely surprising. This leads to a bunch of thrashing, where oximeter continually prunes producers that it should still be collecting from, only because their UUIDs sort them after the first page of assigned producers.

@morlandi7 morlandi7 added this to the 7 milestone Apr 1, 2024
@bnaecker
Copy link
Collaborator

bnaecker commented Apr 1, 2024

The 404's here have been explained, and are harmless and not a release 7 blocker.

In the attached oximeter log files, we see many lines like this:

17:30:02.892Z WARN oximeter (oximeter-agent): failed to receive metric results from producer
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    file = oximeter/collector/src/agent.rs:119
    producer_id = 597e0689-8e6d-40f9-a82a-f7ae378b40d7
    status_code = 404

That is pretty surprising, since it means that there is a Dropshot server listening at the address but that we're not hitting the right endpoint to fetch metric data. So what producer is that? Looking from the CRDB zone on the Dogfood rack, we see:

root@[fd00:1122:3344:105::3]:32221/omicron> select * from metric_producer where id = '597e0689-8e6d-40f9-a82a-f7ae378b40d7';
                   id                  |         time_created          |         time_modified         |  kind   |          ip           | port | interval | base_route |             oximeter_id
---------------------------------------+-------------------------------+-------------------------------+---------+-----------------------+------+----------+------------+---------------------------------------
  597e0689-8e6d-40f9-a82a-f7ae378b40d7 | 2024-03-22 14:02:34.903187+00 | 2024-03-22 14:02:34.903187+00 | service | fd00:1122:3344:107::1 | 8001 |        1 | /collect   | da510a57-3af1-4d2b-b2ed-2e8849f27d8b
(1 row)


Time: 3ms total (execution 3ms / network 0ms)

root@[fd00:1122:3344:105::3]:32221/omicron>

There isn't a service with that ID in the service table, but that's because this address is the main underlay IP for the global zone of the sled. The port of 8001 is also where ddm listens, so we're likely looking at the GZ instance of ddm. John found that this address is assigned to the sled in cubby 10:

root@oxz_switch0:~# pilot host login 10
Warning: Permanently added 'fe80::aa40:25ff:fe04:3d4%tfportrear10_0' (ED25519) to the list of known hosts.

    #####
   ##   ##
  ##   # ##  ##   ##
  ##  #  ##   ## ##     Oxide Computer Company
  ## #   ##    ###      Engineering
   ##   ##    ## ##
    #####    ##   ##    Gimlet

BRM42220009 # ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
cxgbe0/ll         addrconf ok           fe80::aa40:25ff:fe04:3d4%cxgbe0/10
cxgbe1/ll         addrconf ok           fe80::aa40:25ff:fe04:3dc%cxgbe1/10
bootstrap0/ll     addrconf ok           fe80::8:20ff:fe9e:d2ec%bootstrap0/10
bootstrap0/bootstrap6 static ok         fdb0:a840:2504:3d4::1/64
underlay0/ll      addrconf ok           fe80::8:20ff:feb1:dce7%underlay0/10
underlay0/sled6   static   ok           fd00:1122:3344:107::1/64

Looking for files that ddm has open, we can see it's listening on 8001:

BRM42220009 # pgrep -l ddm
  684 ddmd
BRM42220009 # pfiles 684
684:    /opt/oxide/mg-ddm/ddmd --with-stats --admin-port 8000 --admin-addr ::1
...
  16: S_IFSOCK mode:0666 dev:525,0 ino:2114184514 uid:0 gid:0 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        SOCK_STREAM
        SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(128000)
        sockname: AF_INET6 fd00:1122:3344:107::1  port: 8001
...

And if we look at the log files, we see this:

BRM42220009 # tail -F $(svcs -L mg-ddm) | looker
Apr 01 21:23:21.134 INFO request completed, latency_us: 209, response_code: 200, uri: /collect/810f5ac9-5019-4113-af7b-333feb0429b2, method: GET, req_id: a2e90b8e-1e04-432b-b5ef-cfa32f7366e3, remote_addr: [fd00:1122:3344:10a::3]:48637, local_addr: [fd00:1122:3344:107::1]:8001, component: dropshot, file: /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/2d9b906/dropshot/src/server.rs:849
Apr 01 21:23:21.134 INFO request completed, error_message_external: Not Found, error_message_internal: Producer ID 597e0689-8e6d-40f9-a82a-f7ae378b40d7 is not valid, expected 810f5ac9-5019-4113-af7b-333feb0429b2, latency_us: 20, response_code: 404, uri: /collect/597e0689-8e6d-40f9-a82a-f7ae378b40d7, method: GET, req_id: e3216c4a-de49-473b-bb0e-5969e68510f4, remote_addr: [fd00:1122:3344:10a::3]:52596, local_addr: [fd00:1122:3344:107::1]:8001, component: dropshot, file: /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/2d9b906/dropshot/src/server.rs:837
Apr 01 21:23:22.134 INFO request completed, error_message_external: Not Found, error_message_internal: Producer ID 31507755-c97c-4279-87f9-b847deaa17a1 is not valid, expected 810f5ac9-5019-4113-af7b-333feb0429b2, latency_us: 40, response_code: 404, uri: /collect/31507755-c97c-4279-87f9-b847deaa17a1, method: GET, req_id: 9b0bd721-9da2-4ccd-a4c7-e9414222406e, remote_addr: [fd00:1122:3344:10a::3]:41829, local_addr: [fd00:1122:3344:107::1]:8001, component: dropshot, file: /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/2d9b906/dropshot/src/server.rs:837
...

That shows that (1) there are some requests for metrics on /collect/810f5ac9-5019-4113-af7b-333feb0429b2 which are succeeding, and (2) there are others failing with a "producer mismatch" error. That latter one is the source of the 404s. This is because ddm starts up with a call to ProducerRegistry::new(), which assigns a new UUID for the producer.

And that's why we have multiple producer records here, returning a 404: ddm has restarted and assigned itself a new producer ID on the same port. The Dropshot server fails any requests for the old ID, while the latest ones succeed.

This is exactly the kind of situation the forth-coming producer garbage collection RPW is designed to handle. The ddm service is completely outside the control plane at this point, and not assigned an ID by Nexus, the sled-agent, a blueprint or any other mechanism. Each time it restarts, it gets a new producer ID. The old ones will be pruned by that garbage collection.

Until then, the 404s here will consume an oximeter collector task, but are harmless. The current data is being collected, and until the GC merges we'll just live with the extra resource consumption.

@bnaecker
Copy link
Collaborator

bnaecker commented Apr 1, 2024

The request latencies here are likely explained by the bad interactions between Dogfood and racktest. We understand the 404s, which have been fixed by #5366, so I'm marking this closed.

@bnaecker bnaecker closed this as completed Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants