-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull in updated Propolis with new VM timeseries, and expunge the old #6169
Conversation
bnaecker
commented
Jul 27, 2024
- Pulls in Propolis [rss] Cope with configuration mismatch #724, which added sled-identifiers to the virtual machine timeseries. One part of Want sled identifiers on most (all?) timeseries #5267.
- Updates requried Crucible dependency
- Expunge previous schema and data for the virtual machine timeseries, as the new schema is incompatible.
c023c84
to
fb43a14
Compare
I tested this by setting up an instance following the steps in
Note that this commit also contains a ClickHouse "update" -- though that has no changes to the database schema, it does include a step which deletes all the previous virtual machine timeseries that this conflicts with. |
fb43a14
to
06de44a
Compare
- Pulls in oxidecomputer/propolis#724, which added sled-identifiers to the virtual machine timeseries. One part of #5267. - Updates requried Crucible dependency. - Expunge previous schema and data for the virtual machine timeseries, as the new schema is incompatible.
06de44a
to
4ec4f34
Compare
Hmm, this test failure seems concerning: https://buildomat.eng.oxide.computer/wg/0/details/01J3V4K4WQPQWNN5MT1QCN12HG/ouSGnRdQX1REdxzz8019goDLBwHPZdXjDj6GpTxcS0lxM6y2/01J3V4KF93P8XHH6R6R0Z5154Q#S4560 |
Yeah, I'm looking at that. Nothing's changed in Omicron in terms of task spawning, so I'm looking at Propolis. Both 4708ab2401aa8bfd66303515ffdf0d9b580fd52e and 0a3a26baf66b97e1666a1f566aa3e93a9a5a0c1d did have something to do with how tasks are spawned, but I'm not yet sure how or whether they're related. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with Eliza that we should look into the test failure, but the contents of this PR LGTM otherwise!
It looks like the stack trace is from sled-agent, not Propolis — note the source paths. You're right that there have been recent changes to Propolis that involve block_on and friends, but I'm not sure if I get how those changes could make sled-agent panic? |
I'm with you, and also confused. I don't see how they're really related, especially since the apparent panic site is in a call entirely within the test |
Ok, I'm good and confused. I don't see how this ever worked. There are indeed two calls to
which is the
Which is from here: omicron/sled-agent/src/instance.rs Lines 2059 to 2062 in 1bb75f2
That's been there since f8c8bb1, which is from 18 Mar 2024. How has that been succeeding, if there are indeed nested calls to I'll continue to dig. |
I'm not really sure what I'm looking at yet, but it seems like this might not have been working how we intended it on
So the test passed, but a call to cleanup some test resources panicked, I think. That panic is here: omicron/illumos-utils/src/destructor.rs Lines 137 to 151 in 24f7cc0
That comment about unwrap safety seems to imply this is never expected to panic. There is a related set of comments here: omicron/illumos-utils/src/destructor.rs Lines 170 to 174 in 24f7cc0
I don't quite see how these are related to the main issue at hand, but it does seem surprising that these have panicked at all. I have a few questions about the unwrap safety and other pieces of that code, though I'd rather not tug on that thread just yet. |
Ok, well I think I finally figured out why this was not panicking before. Although we set up an expectation here: omicron/sled-agent/src/instance.rs Lines 2059 to 2062 in 1bb75f2
We never indicate how many times it must be called! The other calls use There are at least two things I still don't understand:
|
Ok, I have a theory for what's going on. I believe it explains both why we never saw the Tokio panic when calling Previously, that attempted to resolve Nexus inside that method; and then provided that address to Propolis as an SMF property. Propolis then passed this on to the One of the main changes in this PR is that that's no longer needed. Instead, we let the producer server always use DNS internally. That uses the underlay address we create here and provide to Propolis to create a DNS resolver internally, and always re-resolves Nexus at each call to renew its registration. So, why does this matter? Well, the whole point of this test is to see what happens when booting the Propolis zone times out. It did previously time out, but not by calling the That explains why we did not see the panic about Now, how to fix this? I think the best thing would be to fix the test itself. That wasn't exactly doing what it thought it was, but it has a purpose. I am going to try to make the test work more clearly, having the |
Alright, I think this is fixed in d1ddb22. I removed the offending call to I still think there's something to investigate with the failures we continue to see, and continue to be swallowed by the test harness, like this:
That's a panic on a closed channel during the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me, thanks for all the investigation!
@lifning , based on my reading of the comments, it seems like the test_instance_create_timeout_while_creating_zone
test might still be broken, for similar reasons?
Started looking into that this morning. Frustratingly enough, i got that "channel closed" traceback to repro on my local illumos machine exactly once -- the first time I ran a full Here's my notes from trying to work out how this could be happening. The comment in But then, that means it's not that much of a clue, is it? In fact, it's even more confusing (and the In this test, the
However, in the timeout case set up by this test, Is it possible that the |
Something mkeeter pointed out: Crucible has seen a similar issue - https://github.com/oxidecomputer/crucible/blob/main/upstairs/src/client.rs#L2308-L2319 - with the apparent takeaway being that Tokio's behavior of terminating tasks in arbitrary orders can just do that sometimes. (test or not?) In light of that, it may be reasonable to just not |