Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logs for failed requests are missing important context (like remote server IP) #4598

Open
davepacheco opened this issue Dec 1, 2023 · 0 comments

Comments

@davepacheco
Copy link
Collaborator

I've got a system with #4595, where some crucible server components are mismatched with their clients. I managed to import an image by URL:

$ echo '{"name": "my-image", "description": "dummy", "os": "debian", "version": "11", "source": { "type": "url", "url": "http://172.20.0.5/media/cloud/debian-11-genericcloud-amd64.raw", "block_size": 512 } }' | oxide image create --name my-image --description my-description --os debian --version 11 --json-body /dev/stdin
success
Image {
    block_size: ByteCount(
        512,
    ),
    description: "dummy",
    digest: None,
    id: f99cebe2-3664-417a-8188-bec06c58c6e6,
    name: Name(
        "my-image",
    ),
    os: "debian",
    project_id: None,
    size: ByteCount(
        2147483648,
    ),
    time_created: 2023-12-01T17:42:05.524401Z,
    time_modified: 2023-12-01T17:42:05.524401Z,
    url: Some(
        "http://172.20.0.5/media/cloud/debian-11-genericcloud-amd64.raw",
    ),
    version: "11",
}

but when I tried to create an instance, it failed. That's probably not surprising given the mismatches. But I'm trying to confirm it's a mismatch issue. First I found these Nexus log entry:

root@oxz_nexus_f45dcde1-e10d-440b-ade6-a9ab6f283652:~# looker -lwarn < $(svcs -L nexus)
...
17:44:04.961Z WARN f45dcde1-e10d-440b-ade6-a9ab6f283652 (ServerContext): Region requested, not yet created. Retrying in 314.103191ms
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 9d805060-833f-4b77-8a06-4334795b85c1
    saga_id = da8ba743-eb8f-4af1-81dd-5d4574fdda7f
    saga_name = instance-create
17:44:04.961Z WARN f45dcde1-e10d-440b-ade6-a9ab6f283652 (ServerContext): Region requested, not yet created. Retrying in 186.191992ms
    file = nexus/src/app/sagas/common_storage.rs:74
    region = b5afc932-36ad-4491-b6df-339291703c27
    saga_id = da8ba743-eb8f-4af1-81dd-5d4574fdda7f
    saga_name = instance-create
17:44:04.961Z WARN f45dcde1-e10d-440b-ade6-a9ab6f283652 (ServerContext): Region requested, not yet created. Retrying in 270.103584ms
    file = nexus/src/app/sagas/common_storage.rs:74
    region = 204b1e81-6c7a-49e9-a084-2ab26f3f03a4
    saga_id = da8ba743-eb8f-4af1-81dd-5d4574fdda7f
    saga_name = instance-create
17:44:28.584Z ERRO f45dcde1-e10d-440b-ade6-a9ab6f283652 (ServerContext): received error from instance PUT
    error = Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "a550b4ff-4230-4232-a58e-79c2f480c303", "content-length": "124", "date": "Fri, 01 Dec 2023 17:44:28 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "a550b4ff-4230-4232-a58e-79c2f480c303" }
    file = nexus/src/app/instance.rs:1097
    instance_id = 499b6192-97b2-42eb-85c4-8a6fa186241b
17:44:28.591Z ERRO f45dcde1-e10d-440b-ade6-a9ab6f283652 (ServerContext): attempted to set instance to Failed after bad put
    file = nexus/src/app/instance.rs:1145
    instance_id = 499b6192-97b2-42eb-85c4-8a6fa186241b
    result = Ok(true)
17:45:43.968Z WARN f45dcde1-e10d-440b-ade6-a9ab6f283652 (dropshot_external): client disconnected before response returned
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/ff87a01/dropshot/src/server.rs:927
    local_addr = 172.30.2.6:443
    method = GET
    remote_addr = 172.20.17.110:63986
    req_id = 8bcf9942-be5e-4193-a31c-fae19f2100ff
    uri = https://recovery.sys.madrid.eng.oxide.computer/v1/vpcs/841268c9-9e92-4146-87ca-33c3945d9281
17:45:43.970Z WARN f45dcde1-e10d-440b-ade6-a9ab6f283652 (dropshot_external): client disconnected before response returned
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/ff87a01/dropshot/src/server.rs:927
    local_addr = 172.30.2.6:443
    method = GET
    remote_addr = 172.20.17.110:63986
    req_id = 71ae4f24-8084-4860-833e-cd1c48911ed6
    uri = https://recovery.sys.madrid.eng.oxide.computer/v1/vpc-subnets/6d76b3c6-3d39-41f0-bba0-c00eecf591ad
17:45:43.973Z WARN f45dcde1-e10d-440b-ade6-a9ab6f283652 (dropshot_external): client disconnected before response returned
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/ff87a01/dropshot/src/server.rs:927
    local_addr = 172.30.2.6:443
    method = GET
    remote_addr = 172.20.17.110:63986
    req_id = ce31fc78-9868-4c0e-9991-e3648255c870
    uri = https://recovery.sys.madrid.eng.oxide.computer/v1/instances/test-instance/external-ips?project=test-project
17:45:45.802Z ERRO f45dcde1-e10d-440b-ade6-a9ab6f283652 (dropshot_external): Error returned from handler: ServiceUnavailable { internal_message: "instance is in state InstanceState(Stopped) and has no active serial console server" }
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/ff87a01/dropshot/src/websocket.rs:235
    local_addr = 172.30.2.6:443
    method = GET
    remote_addr = 172.20.17.110:64017
    req_id = e59d1de8-c2c3-45b6-8f8d-caeaf4c79d59
    upgrade = websocket
    uri = /v1/instances/test-instance/serial-console/stream?project=test-project&most_recent=262144

The 500 error from "instance PUT" seems like a good bet. But I can't tell from the log message what sled agent it was. I did find it by grep'ing for the req_id in all the sled agent logs and found:

17:44:28.583Z INFO SledAgent (dropshot (SledAgent)): request completed
    error_message_external = Internal Server Error
    error_message_internal = Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "e663bfe9-c174-4378-bac3-18e1583ae2d9", "content-length": "124", "date": "Fri, 01 Dec 2023 17:44:28 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "e663bfe9-c174-4378-bac3-18e1583ae2d9" }
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/ff87a01/dropshot/src/server.rs:841
    latency_us = 21004917
    local_addr = [fd00:1122:3344:104::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:102::4]:46406
    req_id = a550b4ff-4230-4232-a58e-79c2f480c303
    response_code = 500
    uri = /instances/499b6192-97b2-42eb-85c4-8a6fa186241b/state

Now I think this is saying that the sled agent returned a 500 error because it received a 500 error from something else. But I have no idea from this message what that other thing was. Short of grepping every log file in the system for e663bfe9-c174-4378-bac3-18e1583ae2d9, I'm not sure how to proceed.

omicron_common's Error's internal_context() (analogous to anyhow::Context) should make it easy to attach extra context here about what it was trying to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant