Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove nexus-test-utils build.rs #4056

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/buildomat/build-and-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,11 @@ ptime -m timeout 1h cargo test --doc --locked --verbose --no-fail-fast
# to check is to try to remove it with `rmdir`.
#
unset TMPDIR

# We expect the seed CRDB to be placed here, so we explicitly remove it. There
# is no "final test", so cargo test wouldn't really know when to remove it.
rm -rf "$TEST_TMPDIR/crdb-base"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data used to be placed in:

OUT_DIR — the folder in which all output and intermediate artifacts should be placed. This folder is inside the build directory for the package being built, and it is unique for the package in question.

(Which is an environment variable cargo sets for build scripts)

Well, this isn't run by a build script any longer, so now it's going in $TEST_TMPDIR, which is why I'm removing it explicitly now.


echo "files in $TEST_TMPDIR (none expected on success):" >&2
find "$TEST_TMPDIR" -ls
rmdir "$TEST_TMPDIR"
3 changes: 2 additions & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 2 additions & 5 deletions nexus/test-utils/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ dns-server.workspace = true
dns-service-client.workspace = true
dropshot.workspace = true
headers.workspace = true
hex.workspace = true
http.workspace = true
hyper.workspace = true
internal-dns.workspace = true
Expand All @@ -30,6 +31,7 @@ oximeter-client.workspace = true
oximeter-collector.workspace = true
oximeter-producer.workspace = true
parse-display.workspace = true
ring.workspace = true
serde.workspace = true
serde_json.workspace = true
serde_urlencoded.workspace = true
Expand All @@ -38,8 +40,3 @@ tempfile.workspace = true
trust-dns-proto.workspace = true
trust-dns-resolver.workspace = true
uuid.workspace = true

[build-dependencies]
dropshot.workspace = true
omicron-test-utils.workspace = true
tokio.workspace = true
37 changes: 0 additions & 37 deletions nexus/test-utils/build.rs

This file was deleted.

75 changes: 61 additions & 14 deletions nexus/test-utils/src/db.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,76 @@

//! Database testing facilities.

use camino::Utf8PathBuf;
use omicron_test_utils::dev;
use slog::Logger;
use std::path::PathBuf;

/// Path to the "seed" CockroachDB directory.
///
/// Populating CockroachDB unfortunately isn't free - creation of
/// tables, indices, and users takes several seconds to complete.
///
/// By creating a "seed" version of the database, we can cut down
/// on the time spent performing this operation. Instead, we opt
/// to copy the database from this seed location.
fn seed_dir() -> PathBuf {
PathBuf::from(concat!(env!("OUT_DIR"), "/crdb-base"))
// Creates a string identifier for the current DB schema and version.
//
// The goal here is to allow to create different "seed" directories
// for each revision of the DB.
fn digest_unique_to_schema() -> String {
let schema = include_str!("../../../schema/crdb/dbinit.sql");
let crdb_version = include_str!("../../../tools/cockroachdb_version");
let mut ctx = ring::digest::Context::new(&ring::digest::SHA256);
ctx.update(&schema.as_bytes());
ctx.update(&crdb_version.as_bytes());
let digest = ctx.finish();
hex::encode(digest.as_ref())
}

// Seed directories will be created within:
//
// - /tmp/crdb-base/<digest unique to schema>
//
// However, the process for creating these seed directories is not atomic.
// We create a temporary directory within:
//
// - /tmp/crdb-base/...
//
// And rename it to the final "digest" location once it has been fully created.
Comment on lines +29 to +34
Copy link
Contributor

@jordanhendricks jordanhendricks Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more (and assuming it's hard to serialize the creation of this directory ahead of test invocations, which I assumed, but as I said, don't have a lot of experience with nextest yet and am less sure of what features it has), one way to make things more explicit could be to use a file as a lock to take ownership of setting up the directory.

So the winning test path would:

  • create the directory
  • create a lockfile in the directory
  • lock the file
  • create the database

To protect against races, I think an approach that might work is:

  • mkdir the directory; if the test gets EEXIST, then skip to trying to open the lockfile (without O_CREAT | O_EXCL and try to lock the lockfile with flock()
  • open O_CREAT | O_EXCL of the lockfile; if the test gets EEXIST, skip to flock
  • flock the lockfile, then check if the database exists
  • if the database doesn't exist, create it

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this approach, but it falls over if any test crashes (or is ^C-ed, or fails for any other reason) during DB setup.

I do understand that the flock portion of this could help mitigate that concern, but that's also pretty filesystem / platform specific (see: https://apenwarr.ca/log/20101213 - doesn't work on NFS, silently fails in some situations, and even https://docs.rs/fs2/latest/fs2/trait.FileExt.html , which appears to be a fairly popular filesystem locking crate, says "File locks are a cross-platform hazard since the file lock APIs exposed by operating system kernels vary in subtle and not-so-subtle ways.", and is not regularly tested on illumos).

Also, ultimately, if some process failed partway through creation, and we needed to hand-off creation of the "seed directory" to a different test process, we'd need additional logic to distinguish between "this directory exists because I should use it" vs "this directory exists and I should try to salvage it, or destroy it and start over".

I thought the approach of "do an atomic rename" seemed simpler -- the directory can only (atomically) exist after it has been successfully created, and if any test failed during setup, their output would be isolated. Besides, the work of "spin up a CRDB node and populate it" was the default behavior for all tests using the DB before I tried adding this "seed" optimization. At worst, it's only a small performance hit for the first couple tests that are run (and if the schema changes, that seed directory still exists, so it can be re-used for subsequent test runs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to second @smklein's concern about using file locks anywhere for tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, a few things about file locking:

  • The guidance in the article you link is not unfair from a historical perspective, but it is also more than a decade old, and describes issues with versions of Linux and Mac OS X that were old at that time
  • We would use whole-file OFD-style exclusive locks; this avoids any of the issues one might bump into with shared locks and upgrades
  • flock(3C) is robust and well-tested on illumos, FWIW. We built it to work the way it does on Linux (at the time, for our LX brand work) where it is also well-tested enough at this stage as far as I can tell. Our flock(3C) creates OFD-style locks, which we also support (again for compatibility with Linux) with F_OFD_* commands via fcntl(2).
  • We're talking about files in /tmp, which should never be NFS-backed; the era of NFS root for diskless workstations has since passed
  • As I recall the fs2 crate is long-since abandoned; we forked it for a while as fs3, but I believe the community has since moved on to fs4

if some process failed partway through creation, and we needed to hand-off creation of the "seed directory" to a different test process, we'd need additional logic to distinguish between "this directory exists because I should use it" vs "this directory exists and I should try to salvage it, or destroy it and start over".

I think this logic is relatively simple: put a marker file in the directory, or write marker data into the lock file, once the setup is complete. If the marker is not present and correct when you get to lock the file, remove the database files and start again. Otherwise, release the lock and use the database seed.

You could also use a combination of the atomic directory rename you're talking about, and the lock file: lock the lock file, and:

  • if you see the final expected directory name, say, database, you know you can use it; drop the lock and do that
  • otherwise:
    • if you see a temporary name, like, say, notyet, remove it now
    • create the database, rename the directory to database

Either way, once you drop the lock, there will be a correctly provisioned seed in database.

Here is a simple example of using flock(3C) through the fs4 crate: https://github.com/jclulow/junk-locker

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's okay with ya'll, I'd like to punt on this. Not necessarily saying we shouldn't do it, but maybe not within this PR?

  • This locking proposal is an optimization to avoid a minor amount of wasted work in tests
  • If we introduce this mechanism, I do think it's more complicated, and would want to spend more time validating and testing it...
  • ... and the benefits of that testing and optimization seem like they'd both be tossed out the window if we implement Support for setup and teardown scripts nextest-rs/nextest#683 anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and the benefits of that testing and optimization seem like they'd both be tossed out the window if we implement Support for setup and teardown scripts nextest-rs/nextest#683 anyway?

If we're going to toss it out anyway, could we just wait until we've properly assessed and prioritized whether we should do the nextest feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option I've heard people use is a Mutex<()>, almost like a barrier. See rust-lang/rust#43155 for some discussion on that. Generally this is for running tests themselves serially, but I could imagine adapting this purely for the setup instead.

The cleanest solution would be the upstream feature. If that were coming soon, then that would be ideal.

If not though, basically this is a game of "which way of emulating this feature sucks the least." Build scripts have their own annoying downsides. But I also agree with @smklein that this discussion feels like premature optimization. Is that actually the case? Does this patch slow down the test suite, and if so, by how much?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW @steveklabnik I was more worried about intentionally racing the tests against each other. Maybe that's not a fair concern. I also agree that the upstream feature is cleanest, and hence was curious about that.

async fn ensure_seed_directory_exists(log: &Logger) -> Utf8PathBuf {
let base_seed_dir = Utf8PathBuf::from_path_buf(std::env::temp_dir())
.expect("Not a UTF-8 path")
.join("crdb-base");
std::fs::create_dir_all(&base_seed_dir).unwrap();
let desired_seed_dir = base_seed_dir.join(digest_unique_to_schema());

// If the directory didn't exist when we started, try to create it.
//
// Note that this may be executed concurrently by many tests, so
// we should consider it possible for another caller to create this
// seed directory before we finish setting it up ourselves.
if !desired_seed_dir.exists() {
let tmp_seed_dir =
camino_tempfile::Utf8TempDir::new_in(base_seed_dir).unwrap();
dev::test_setup_database_seed(log, tmp_seed_dir.path()).await;

// If we can successfully perform the rename, we made the seed directory
// faster than other tests.
//
// If we couldn't perform the rename, the directory might already exist.
// Check that this is the error we encountered -- otherwise, we're
// struggling.
if let Err(err) =
std::fs::rename(tmp_seed_dir.path(), &desired_seed_dir)
{
if !desired_seed_dir.exists() {
panic!("Cannot rename seed directory for CockroachDB: {err}");
}
}
}

desired_seed_dir
}

/// Wrapper around [`dev::test_setup_database`] which uses a a
/// seed directory provided at build-time.
/// seed directory that we construct if it does not already exist.
pub async fn test_setup_database(log: &Logger) -> dev::db::CockroachInstance {
let dir = seed_dir();
let dir = ensure_seed_directory_exists(log).await;
dev::test_setup_database(
log,
dev::StorageSource::CopyFromSeed { input_dir: dir },
dev::StorageSource::CopyFromSeed { input_dir: dir.into() },
)
.await
}
Expand Down
3 changes: 2 additions & 1 deletion test-utils/src/dev/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,8 @@ pub enum StorageSource {
///
/// This is intended to optimize subsequent calls to [`test_setup_database`]
/// by reducing the latency of populating the storage directory.
pub async fn test_setup_database_seed(log: &Logger, dir: &Path) {
pub async fn test_setup_database_seed<P: AsRef<Path>>(log: &Logger, dir: P) {
let dir = dir.as_ref();
let _ = std::fs::remove_dir_all(&dir);
std::fs::create_dir_all(&dir).unwrap();
let mut db = setup_database(
Expand Down