Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order NextItem subquery results predictably #5067

Merged
merged 1 commit into from
Feb 16, 2024
Merged

Conversation

bnaecker
Copy link
Collaborator

@bnaecker bnaecker requested a review from gjcolombo February 14, 2024 20:32
@bnaecker
Copy link
Collaborator Author

All the test failures uncovered another bug. Looking at this log, for example, we can see we're tripping a new check I added in this PR because I was paranoid. The NextItem query takes a "min shift" and a "max shift". Those are the shifts to the minimum and to the maximum of the searched range, respectively. So the max must really be >=0 and the minimum <= 0.

The MAC address query here is passing a positive number for the min shift. We've been potentially searching kind of arbitrary ranges for a while now. Why I didn't ensure this in the type system to begin with, I can't say. I am going to look into doing that now, so that the generator requires passing in two u32s, for the maximum leftward and rightward shifts, relative to the base.

@bnaecker
Copy link
Collaborator Author

Ok, I'm going to punt on changing the logic around the shifts in this particular PR. I have fixed the bug in the system MAC allocation I found. I do think the API for expressing the range of the NextItem query is difficult to use correctly, and I'd like to change it in a follow up PR. Doing so now will be more invasive than I'd like.

@bnaecker
Copy link
Collaborator Author

@gjcolombo This is ready for a review now

@bnaecker bnaecker requested a review from jgallagher February 15, 2024 21:10
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Thanks for fixing this up! Unfortunately the query that was reproducing random disk slot orderings for me the other day is no longer doing so, so I wasn't able to test this out directly, but the overall logic of it makes sense.

A couple of notes:

  • I did manage to find the place in the CRDB docs where row ordering (or lack thereof) is specified, if you want to refer to it in here somewhere: https://www.cockroachlabs.com/docs/stable/select-clause#sorting-and-limiting-query-results ("rows are not sorted with any consistent criteria").
  • I'm probably just being pedantic, but one thing I want to make sure to note: if/when we make a subsequent change to try to fix any unexpected slot number assignments that came from the old query, we should not touch the assigned slot numbers for existing attached disks--these need to remain stable because there may be guest data (PCI slot numbers stored on the EFI system partition) that refers to them.

nexus/db-queries/src/db/queries/next_item.rs Outdated Show resolved Hide resolved
@jgallagher
Copy link
Contributor

  • if/when we make a subsequent change to try to fix any unexpected slot number assignments that came from the old query, we should not touch the assigned slot numbers for existing attached disks--these need to remain stable because there may be guest data (PCI slot numbers stored on the EFI system partition) that refers to them.

Because I've been on the scene here: #5080 fixes unexpected slot numbers only for service vNICs, because that's the only thing causing problems for Reconfigurator. I have no plans to try to fix any other unexpected slot numbers; will defer to you folks for if/when that happens.

@bnaecker
Copy link
Collaborator Author

That makes sense about keeping disk slots as-is. I don't think we need to change them for any reason. The NICs required that because RSS was explicitly picking a slot that ended up differing from the the one the DB assigned, leading to differences with the DB records. We could end up re-assigning the same slot in that case. I don't think that can happen for the disks, since those all used the existing machinery to assign a slot. That looks like it was potentially out of the expected order, but we used the one the DB assigned in any case.

Thanks for the review @gjcolombo. I'll incorporate your suggestions now and merge.

- Fixes #5055
- Add a new column, `index` to the `NextItem` subquery, which indexes
  the shifts from 0..n.
- Add an `ORDER BY index` clause to guarantee order.
- Add test

Ensure MAC address shift is correct

Simplify MAC shift logic

Clippy, and remove local env files

Review feedback
@bnaecker bnaecker force-pushed the order-next-item-subquery branch from f3653c5 to bf0a314 Compare February 16, 2024 17:18
@bnaecker bnaecker enabled auto-merge (squash) February 16, 2024 17:18
@bnaecker bnaecker disabled auto-merge February 16, 2024 19:11
@bnaecker
Copy link
Collaborator Author

I'm holding on this for a bit. I did some digging into questions on #5085 and am concerned we'll hit some of the assertions this PR adds if users add an instance with an external IPv6 address. Doing so used to fail pretty early with a 400, but that seems to have changed since I last worked on that part of the code. I'm looking now, but don't want to merge this if Nexus can panic due to a currently-valid external request.

@bnaecker
Copy link
Collaborator Author

Ok, I've convinced myself that the concerns I had were not related to this PR specifically. Merging shortly.

@bnaecker bnaecker merged commit 47a416f into main Feb 16, 2024
20 checks passed
@bnaecker bnaecker deleted the order-next-item-subquery branch February 16, 2024 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NextItem query may not return available items in series order
3 participants