Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove nexus-test-utils build.rs #4056

Closed
wants to merge 2 commits into from
Closed

Remove nexus-test-utils build.rs #4056

wants to merge 2 commits into from

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Sep 8, 2023

We previously used a build.rs script to create a "seed" directory for CRDB, so that tests wouldn't need to duplicate the work of initializing the DB for all our tests.

This worked, but unfortunately meant that the work of CRDB construction happened at build-time, which (1) could be unnecessary, in cases where we aren't running tests, and (2) meant that we used a build.rs script to do this work, which would cause problems for rust-analyzer.

This PR replaces that build.rs script with a more dynamic approach: We still construct the seed directory, but we do so "at test time", letting the first few invocations race with each other to create a directory with a digest name based on the CRDB schema and version. "Whoever wins first" gets to populate the seed directory, and all subsequent tests rely on that version.

@smklein smklein requested a review from steveklabnik September 8, 2023 01:52
@smklein
Copy link
Collaborator Author

smklein commented Sep 8, 2023

As an example of what I see when I run cargo test -p omicron-nexus now:

# A bunch of tests all trying to populate CRDB concurrently...
🏠 /tmp/crdb-base                                                                                                      
 $ la -la                                                                                                              
total 292                                                                                                              
drwxrwxr-x  18 smklein smklein   4096 Sep  7 18:54 .                                                                   
drwxrwxrwt 895 root    root    225280 Sep  7 18:54 ..                                                                  
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpabJZ2u
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpAqAaQZ
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpBDZD3U
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpdI7OzM
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpFE1NhU
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpIVTLfK
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpjC1uks
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpjWGSEw
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpLT6IgR
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpMC92vU
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpP8rpzF
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmprPeh8E
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpVVZlZy
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpw0uDZF
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpxbqg8i
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpy3Eibf

# One of them wins, and renames their .tmp directory to the digest here...
🏠 /tmp/crdb-base 
 $ la -la
total 288
drwxrwxr-x  17 smklein smklein   4096 Sep  7 18:54 .
drwxrwxrwt 895 root    root    225280 Sep  7 18:54 ..
drwxrwxr-x   4 smklein smklein   4096 Sep  7 18:54 586b14361321b0e0e48755d42040c5e5ed6b18302bdf2b2e356dec5cf466a294
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpabJZ2u
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpAqAaQZ
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpBDZD3U
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpFE1NhU
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpIVTLfK
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpjC1uks
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpjWGSEw
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpMC92vU
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpP8rpzF
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmprPeh8E
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpVVZlZy
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpw0uDZF
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpxbqg8i
drwxrwxr-x   5 smklein smklein   4096 Sep  7 18:54 .tmpy3Eibf

# All others finish, throw away their .tmp directories, and use the same seed.
# All subsequent tests don't bother creating .tmp directories, they just use the seed.
🏠 /tmp/crdb-base 
 $ la -la
total 288
drwxrwxr-x  17 smklein smklein   4096 Sep  7 18:54 .
drwxrwxrwt 895 root    root    225280 Sep  7 18:54 ..
drwxrwxr-x   4 smklein smklein   4096 Sep  7 18:54 586b14361321b0e0e48755d42040c5e5ed6b18302bdf2b2e356dec5cf466a294

@smklein smklein requested a review from davepacheco September 8, 2023 02:42

# We expect the seed CRDB to be placed here, so we explicitly remove it. There
# is no "final test", so cargo test wouldn't really know when to remove it.
rm -rf "$TEST_TMPDIR/crdb-base"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data used to be placed in:

OUT_DIR — the folder in which all output and intermediate artifacts should be placed. This folder is inside the build directory for the package being built, and it is unique for the package in question.

(Which is an environment variable cargo sets for build scripts)

Well, this isn't run by a build script any longer, so now it's going in $TEST_TMPDIR, which is why I'm removing it explicitly now.

@jordanhendricks
Copy link
Contributor

jordanhendricks commented Sep 8, 2023

I have in the past found requiring a cockroach binary as a part of the omicron build to be a bit surprising (though it makes sense to me why things ended up that way). I also think an important (and perhaps also surprising) consequence of relying on PATH as a part of the build environment for omicron could have unexpected impacts with IDE-like tools that rely on building source (namely, rust-analyzer).

I am still ramping up on omicron tests (and don't have any experience with cargo-nextest at all yet), but I am curious if there's a way the creation of the directory could be done before the invocation of any tests. It seems to me that introducing an intentional race to create a shared resource for the tests could introduce bugs in the tests themselves. I also wonder about how much work the creation of each of these seed directories is doing -- is it going to consume more time / CPU to have an arbitrary number of tests attempt to create the directory?

Comment on lines +29 to +34
// However, the process for creating these seed directories is not atomic.
// We create a temporary directory within:
//
// - /tmp/crdb-base/...
//
// And rename it to the final "digest" location once it has been fully created.
Copy link
Contributor

@jordanhendricks jordanhendricks Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more (and assuming it's hard to serialize the creation of this directory ahead of test invocations, which I assumed, but as I said, don't have a lot of experience with nextest yet and am less sure of what features it has), one way to make things more explicit could be to use a file as a lock to take ownership of setting up the directory.

So the winning test path would:

  • create the directory
  • create a lockfile in the directory
  • lock the file
  • create the database

To protect against races, I think an approach that might work is:

  • mkdir the directory; if the test gets EEXIST, then skip to trying to open the lockfile (without O_CREAT | O_EXCL and try to lock the lockfile with flock()
  • open O_CREAT | O_EXCL of the lockfile; if the test gets EEXIST, skip to flock
  • flock the lockfile, then check if the database exists
  • if the database doesn't exist, create it

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this approach, but it falls over if any test crashes (or is ^C-ed, or fails for any other reason) during DB setup.

I do understand that the flock portion of this could help mitigate that concern, but that's also pretty filesystem / platform specific (see: https://apenwarr.ca/log/20101213 - doesn't work on NFS, silently fails in some situations, and even https://docs.rs/fs2/latest/fs2/trait.FileExt.html , which appears to be a fairly popular filesystem locking crate, says "File locks are a cross-platform hazard since the file lock APIs exposed by operating system kernels vary in subtle and not-so-subtle ways.", and is not regularly tested on illumos).

Also, ultimately, if some process failed partway through creation, and we needed to hand-off creation of the "seed directory" to a different test process, we'd need additional logic to distinguish between "this directory exists because I should use it" vs "this directory exists and I should try to salvage it, or destroy it and start over".

I thought the approach of "do an atomic rename" seemed simpler -- the directory can only (atomically) exist after it has been successfully created, and if any test failed during setup, their output would be isolated. Besides, the work of "spin up a CRDB node and populate it" was the default behavior for all tests using the DB before I tried adding this "seed" optimization. At worst, it's only a small performance hit for the first couple tests that are run (and if the schema changes, that seed directory still exists, so it can be re-used for subsequent test runs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to second @smklein's concern about using file locks anywhere for tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, a few things about file locking:

  • The guidance in the article you link is not unfair from a historical perspective, but it is also more than a decade old, and describes issues with versions of Linux and Mac OS X that were old at that time
  • We would use whole-file OFD-style exclusive locks; this avoids any of the issues one might bump into with shared locks and upgrades
  • flock(3C) is robust and well-tested on illumos, FWIW. We built it to work the way it does on Linux (at the time, for our LX brand work) where it is also well-tested enough at this stage as far as I can tell. Our flock(3C) creates OFD-style locks, which we also support (again for compatibility with Linux) with F_OFD_* commands via fcntl(2).
  • We're talking about files in /tmp, which should never be NFS-backed; the era of NFS root for diskless workstations has since passed
  • As I recall the fs2 crate is long-since abandoned; we forked it for a while as fs3, but I believe the community has since moved on to fs4

if some process failed partway through creation, and we needed to hand-off creation of the "seed directory" to a different test process, we'd need additional logic to distinguish between "this directory exists because I should use it" vs "this directory exists and I should try to salvage it, or destroy it and start over".

I think this logic is relatively simple: put a marker file in the directory, or write marker data into the lock file, once the setup is complete. If the marker is not present and correct when you get to lock the file, remove the database files and start again. Otherwise, release the lock and use the database seed.

You could also use a combination of the atomic directory rename you're talking about, and the lock file: lock the lock file, and:

  • if you see the final expected directory name, say, database, you know you can use it; drop the lock and do that
  • otherwise:
    • if you see a temporary name, like, say, notyet, remove it now
    • create the database, rename the directory to database

Either way, once you drop the lock, there will be a correctly provisioned seed in database.

Here is a simple example of using flock(3C) through the fs4 crate: https://github.com/jclulow/junk-locker

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's okay with ya'll, I'd like to punt on this. Not necessarily saying we shouldn't do it, but maybe not within this PR?

  • This locking proposal is an optimization to avoid a minor amount of wasted work in tests
  • If we introduce this mechanism, I do think it's more complicated, and would want to spend more time validating and testing it...
  • ... and the benefits of that testing and optimization seem like they'd both be tossed out the window if we implement Support for setup and teardown scripts nextest-rs/nextest#683 anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and the benefits of that testing and optimization seem like they'd both be tossed out the window if we implement Support for setup and teardown scripts nextest-rs/nextest#683 anyway?

If we're going to toss it out anyway, could we just wait until we've properly assessed and prioritized whether we should do the nextest feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option I've heard people use is a Mutex<()>, almost like a barrier. See rust-lang/rust#43155 for some discussion on that. Generally this is for running tests themselves serially, but I could imagine adapting this purely for the setup instead.

The cleanest solution would be the upstream feature. If that were coming soon, then that would be ideal.

If not though, basically this is a game of "which way of emulating this feature sucks the least." Build scripts have their own annoying downsides. But I also agree with @smklein that this discussion feels like premature optimization. Is that actually the case? Does this patch slow down the test suite, and if so, by how much?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW @steveklabnik I was more worried about intentionally racing the tests against each other. Maybe that's not a fair concern. I also agree that the upstream feature is cleanest, and hence was curious about that.

@davepacheco
Copy link
Collaborator

davepacheco commented Sep 8, 2023

It seems potentially counterproductive to the goal of reducing iteration time to do a bunch of work in parallel many times over and throw most of it away. (Maybe that's the wrong way to look at it though -- it's just a cache, and we were doing this even more before the build script.)

But I don't think I understand the problem this is trying to solve. It's true that if you're just building with no intent to run tests, this is extra work. How long does it actually take? Would it make sense to use cargo check? Or avoid building nexus-test-utils in that case? (That may be impractical -- I'm not sure.)

Is this just about rust-analyzer? Is it about the PATH issue involving cockroach or also that it takes a long time?

@smklein
Copy link
Collaborator Author

smklein commented Sep 8, 2023

It seems potentially counterproductive to the goal of reducing iteration time to do a bunch of work in parallel many times over and throw most of it away. (Maybe that's the wrong way to look at it though -- it's just a cache, and we were doing this even more before the build script.)

Yes, it's just a fraction of "what we used to do before" from the testing point-of-view.

But I don't think I understand the problem this is trying to solve. It's true that if you're just building with no intent to run tests, this is extra work. How long does it actually take?

In a build, the old build script added a few seconds, depending on your machine.

Would it make sense to use cargo check? Or avoid building nexus-test-utils in that case? (That may be impractical -- I'm not sure.)

This impacts building the Nexus integration tests, or just cargo build, and gets re-run if any of the following change, as they are all dependencies of nexus/test-utils:

  • nexus-db-queries
  • nexus-types
  • omicron-common
  • omicron-sled-agent
  • oximeter
  • etc

Is this just about rust-analyzer? Is it about the PATH issue involving cockroach or also that it takes a long time?

This was a major part of it too. When I added the code to the build.rs script, I mentioned "this doesn't need to happen at build time" in a comment, and build.rs scripts are their own bag of complexity. Yes, this issue can be mitigated, but it's "just another pain point" building Nexus.

@smklein
Copy link
Collaborator Author

smklein commented Sep 27, 2023

Closing in favor of #4150

@smklein smklein closed this Sep 27, 2023
@smklein smklein deleted the less-buildrs branch September 27, 2023 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants