Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically refresh a collector's list of assigned producers #5326

Merged
merged 5 commits into from
Mar 26, 2024

Conversation

bnaecker
Copy link
Collaborator

  • Add an endpoint in Nexus's internal API for listing the assigned producers for a collector.
  • Spawn a task in the oximeter-collector which will periodically fetch the list; remove any producers not in that list; and ensure any that are.
  • Adds the time of this last refresh to the oximeter-collector server's API for fetching info about the collector
  • Remove old use of reqwest directly to register collector, opt-in to the generated Nexus client.
  • Adds some type conversions in the nexus client crate to simplify these new interfaces

@bnaecker bnaecker requested a review from jgallagher March 25, 2024 18:48
@bnaecker
Copy link
Collaborator Author

This is a part of #5284. It adds a periodic task that refreshes the list of producers that a collector is supposed to be polling. There was no Nexus-side API for this, actually, so it's a bit bigger than I expected! This task internally removes any that are not in the list and ensures any that are. That lowers to a potential update of the information about the producer in the case where the task polling it already exists.

- Add an endpoint in Nexus's internal API for listing the assigned
  producers for a collector.
- Spawn a task in the `oximeter-collector` which will periodically fetch
  the list; remove any producers not in that list; and ensure any that
  are.
- Adds the time of this last refresh to the `oximeter-collector`
  server's API for fetching info about the collector
- Remove old use of `reqwest` directly to register collector, opt-in to
  the generated Nexus client.
- Adds some type conversions in the nexus client crate to simplify these
  new interfaces
@bnaecker bnaecker force-pushed the refresh-oximeter-producer-list branch from 6e04763 to b8c720a Compare March 25, 2024 19:12
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks! None of my comments are critical (except maybe the question about Cargo.toml)

Cargo.toml Outdated Show resolved Hide resolved
clients/nexus-client/src/lib.rs Outdated Show resolved Hide resolved
nexus/db-model/src/producer_endpoint.rs Outdated Show resolved Hide resolved
nexus/src/app/oximeter.rs Show resolved Hide resolved
nexus/src/internal_api/http_entrypoints.rs Outdated Show resolved Hide resolved
oximeter/collector/src/agent.rs Outdated Show resolved Hide resolved
oximeter/collector/src/agent.rs Outdated Show resolved Hide resolved
@bnaecker
Copy link
Collaborator Author

Most things have been addressed in f44da6f. The only thing I'm still looking at is removing Nexus sending the list of producers to a collector when it registers. I think that'll be a pretty small change, but I want to add a way to start the refresh loop after we register, to reduce latency on the first startup.

The oximeter collector already requests all its producers whenever it
restarts (and periodically to stay up-to-date). Let that mechanism do
the work, rather than making many small requests from Nexus to the
collector when it registers.
@bnaecker
Copy link
Collaborator Author

@jgallagher Thanks for your eyes on this! I think I hit everything you brought up. Here is a snippet of the oximeter collector's log file after I manually restarted it, which shows that it pulls the list of producers it should have after registering with Nexus. Let me know if there's anything else!

[ Mar 26 01:35:05 Stopping because service restarting. ]
[ Mar 26 01:35:05 Executing stop method (:kill). ]
[ Mar 26 01:35:05 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/oximeter-collector/bin/oximeter run /var/svc/manifest/site/oximeter/config.toml --address [fd00:1122:3344:101::d]:12223 --id 523a2641-58cb-4525-aaf6-17e6bb2491ad &"). ]
[ Mar 26 01:35:05 Method "start" exited with status 0. ]
note: configured to log to "/dev/stdout"
01:35:05.066Z DEBG oximeter: registered DTrace probes
01:35:05.066Z INFO oximeter: starting oximeter server
    file = oximeter/collector/src/lib.rs:203
01:35:05.066Z INFO oximeter (DnsResolver): new DNS resolver
    addresses = [[fd00:1122:3344:1::1]:53, [fd00:1122:3344:2::1]:53, [fd00:1122:3344:3::1]:53, [fd00:1122:3344:4::1]:53, [fd00:1122:3344:5::1]:53]
    file = internal-dns/src/resolver.rs:60
01:35:05.066Z DEBG oximeter: creating ClickHouse client
01:35:05.066Z DEBG oximeter (DnsResolver): lookup srv
    dns_name = _clickhouse._tcp.control-plane.oxide.internal
01:35:05.131Z DEBG oximeter (dropshot): registered endpoint
    local_addr = [fd00:1122:3344:101::d]:12223
    method = GET
    path = /info
01:35:05.131Z DEBG oximeter (dropshot): registered endpoint
    local_addr = [fd00:1122:3344:101::d]:12223
    method = GET
    path = /producers
01:35:05.131Z DEBG oximeter (dropshot): registered endpoint
    local_addr = [fd00:1122:3344:101::d]:12223
    method = POST
    path = /producers
01:35:05.131Z DEBG oximeter (dropshot): registered endpoint
    local_addr = [fd00:1122:3344:101::d]:12223
    method = DELETE
    path = /producers/{producer_id}
01:35:05.131Z INFO oximeter (dropshot): listening
    file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/29ae98d/dropshot/src/server.rs:195
    local_addr = [fd00:1122:3344:101::d]:12223
01:35:05.131Z DEBG oximeter (dropshot): successfully registered DTrace USDT probes
    local_addr = [fd00:1122:3344:101::d]:12223
01:35:05.131Z DEBG oximeter: contacting nexus
01:35:05.131Z DEBG oximeter (DnsResolver): lookup_ipv6 srv
    dns_name = _nexus._tcp.control-plane.oxide.internal
01:35:05.166Z DEBG oximeter: client request
    body = Some(Body)
    method = POST
    uri = http://[fd00:1122:3344:101::a]:12221/metrics/collectors
01:35:05.184Z DEBG oximeter: client response
    result = Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(fd00:1122:3344:101::a)), port: Some(12221), path: "/metrics/collectors", query: None, fragment: None }, status: 204, headers: {"x-request-id": "54eee5c0-9545-4cab-808d-3a5270cb69b2", "date": "Tue, 26 Mar 2024 01:35:05 GMT"} })
01:35:05.185Z INFO oximeter: oximeter registered with nexus
    file = oximeter/collector/src/lib.rs:311
    id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
01:35:05.186Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    file = oximeter/collector/src/agent.rs:752
01:35:05.186Z DEBG oximeter (DnsResolver): lookup_ipv6 srv
    dns_name = _nexus._tcp.control-plane.oxide.internal
01:35:05.222Z DEBG oximeter (oximeter-agent): client request
    body = None
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    method = GET
    uri = http://[fd00:1122:3344:101::a]:12221/metrics/collectors/523a2641-58cb-4525-aaf6-17e6bb2491ad/producers?limit=100&sort_by=id_ascending
01:35:05.225Z DEBG oximeter (oximeter-agent): client response
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    result = Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(fd00:1122:3344:101::a)), port: Some(12221), path: "/metrics/collectors/523a2641-58cb-4525-aaf6-17e6bb2491ad/producers", query: Some("limit=100&sort_by=id_ascending"), fragment: None }, status: 200, headers: {"content-type": "application/json", "x-request-id": "e8d58c07-8e25-4616-9742-610ff5b0ed81", "content-length": "1005", "date": "Tue, 26 Mar 2024 01:35:05 GMT"} })
01:35:05.229Z DEBG oximeter (oximeter-agent): client request
    body = None
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    method = GET
    uri = http://[fd00:1122:3344:101::a]:12221/metrics/collectors/523a2641-58cb-4525-aaf6-17e6bb2491ad/producers?page_token=eyJ2IjoidjEiLCJwYWdlX3N0YXJ0Ijp7InNvcnRfYnkiOiJpZF9hc2NlbmRpbmciLCJsYXN0X3NlZW4iOiJjMWFkNzNhMC0wNjQ5LTQyYzctYWQ5OC03ZTJkYzc0ODc2YjYifX0%3D
01:35:05.233Z DEBG oximeter (oximeter-agent): client response
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    result = Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(fd00:1122:3344:101::a)), port: Some(12221), path: "/metrics/collectors/523a2641-58cb-4525-aaf6-17e6bb2491ad/producers", query: Some("page_token=eyJ2IjoidjEiLCJwYWdlX3N0YXJ0Ijp7InNvcnRfYnkiOiJpZF9hc2NlbmRpbmciLCJsYXN0X3NlZW4iOiJjMWFkNzNhMC0wNjQ5LTQyYzctYWQ5OC03ZTJkYzc0ODc2YjYifX0%3D"), fragment: None }, status: 200, headers: {"content-type": "application/json", "x-request-id": "70313d0c-4c81-4d94-9641-d90afcd1326d", "content-length": "29", "date": "Tue, 26 Mar 2024 01:35:05 GMT"} })
01:35:05.233Z DEBG oximeter (oximeter-agent): registered new metric producer
    address = [fd00:1122:3344:101::b]:12221
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    producer_id = 039f9110-5a07-4e8d-bf59-67e3e57dfaf0
01:35:05.233Z DEBG oximeter (oximeter-agent): registered new metric producer
    address = [fd00:1122:3344:101::1]:12345
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    producer_id = 736d8883-c2ee-4d4e-a085-cd2e0a28c457
01:35:05.233Z DEBG oximeter (oximeter-agent): registered new metric producer
    address = [fd00:1122:3344:101::c]:12221
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    producer_id = 7fd8ab65-36d7-4e50-beea-10a592cae9e8
01:35:05.233Z DEBG oximeter (oximeter-agent): registered new metric producer
    address = [fd00:1122:3344:101::a]:12221
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    producer_id = b818fd5c-f04c-4cab-9a7e-c7ba9e9d375c
01:35:05.233Z DEBG oximeter (oximeter-agent): registered new metric producer
    address = [fd00:1122:3344:101::1]:8001
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    producer_id = c1ad73a0-0649-42c7-ad98-7e2dc74876b6
01:35:05.233Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    file = oximeter/collector/src/agent.rs:800
    n_current_tasks = 5
    n_pruned_tasks = 0
01:35:05.343Z DEBG oximeter (oximeter-agent): starting oximeter collection task
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    interval = 10s
    producer_id = b818fd5c-f04c-4cab-9a7e-c7ba9e9d375c
01:35:05.351Z DEBG oximeter (oximeter-agent): starting oximeter collection task
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    interval = 10s
    producer_id = 7fd8ab65-36d7-4e50-beea-10a592cae9e8
01:35:05.426Z DEBG oximeter (oximeter-agent): starting oximeter collection task
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    interval = 10s
    producer_id = 039f9110-5a07-4e8d-bf59-67e3e57dfaf0
01:35:05.428Z DEBG oximeter (oximeter-agent): starting oximeter collection task
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    interval = 1s
    producer_id = c1ad73a0-0649-42c7-ad98-7e2dc74876b6
01:35:05.429Z DEBG oximeter (oximeter-agent): starting oximeter collection task
    collector_id = 523a2641-58cb-4525-aaf6-17e6bb2491ad
    interval = 30s
    producer_id = 736d8883-c2ee-4d4e-a085-cd2e0a28c457

@bnaecker bnaecker requested a review from jgallagher March 26, 2024 01:40
@bnaecker
Copy link
Collaborator Author

This test failure is real, and I think we can actually delete the test now. It's a regression for #4498, in which I fumbled the pagination and only sent the first page of producers to a newly-registered collector. That's not relevant any more since the collector fetches things from Nexus, not the other way around.

@bnaecker
Copy link
Collaborator Author

Talked offline with @jgallagher, removing the outdated Nexus test and setting to automerge.

@bnaecker bnaecker enabled auto-merge (squash) March 26, 2024 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants