Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make crdb-seed more robust to missing files #4193

Closed
david-crespo opened this issue Oct 3, 2023 · 8 comments
Closed

Make crdb-seed more robust to missing files #4193

david-crespo opened this issue Oct 3, 2023 · 8 comments
Assignees
Labels
development Bugs, paper cuts, feature requests, or other thoughts on making omicron development better Testing & Analysis Tests & Analyzers

Comments

@david-crespo
Copy link
Contributor

Trying to run tests on 6bc5e60 after previously successfully running them on (95% sure) the last commit from #4154, I got the following error:

--- STDERR:              omicron-nexus::test_all integration_tests::basic::test_basic_failures ---
log file: /var/folders/lk/vsx86g0545g3g2tsmvs50py40000gn/T/test_all-e3c1289ea1f963ce-test_basic_failures.32561.0.log
note: configured to log to "/var/folders/lk/vsx86g0545g3g2tsmvs50py40000gn/T/test_all-e3c1289ea1f963ce-test_basic_failures.32561.0.log"
thread 'integration_tests::basic::test_basic_failures' panicked at 'Cannot copy storage from seed directory: Failed to copy subdirectory /var/folders/lk/vsx86g0545g3g2tsmvs50py40000gn/T/crdb-base/c57f4ced1447ac7b5e0722b644563fcf94b4998591ad01a8c2cabf9ec1af380e/logs to /var/folders/lk/vsx86g0545g3g2tsmvs50py40000gn/T/.tmp2DCWRn/data/logs

Caused by:
    0: Failed to copy file at /var/folders/lk/vsx86g0545g3g2tsmvs50py40000gn/T/crdb-base/c57f4ced1447ac7b5e0722b644563fcf94b4998591ad01a8c2cabf9ec1af380e/logs/cockroach-pebble.log to /var/folders/lk/vsx86g0545g3g2tsmvs50py40000gn/T/.tmp2DCWRn/data/logs/cockroach-pebble.log
    1: No such file or directory (os error 2)', /Users/david/oxide/omicron/test-utils/src/dev/mod.rs:156:18

The file it's mad about is a symlink that points to a nonexistent file.

~/oxide/omicron $ l /var/folders/lk/vsx86g0545g3g2tsmvs50py40000gn/T/crdb-base/c57f4ced1447ac7b5e0722b644563fcf94b4998591ad01a8c2cabf9ec1af380e/logs
drwxr-x---  - david 27 Sep 20:51 goroutine_dump
drwxr-x---  - david 27 Sep 20:51 heap_profiler
drwxr-x---  - david 27 Sep 20:51 inflight_trace_dump
lrwxr-x--- 68 david 27 Sep 20:51 cockroach-health.log -> cockroach-health.Davids-M1-MBP.david.2023-09-28T01_51_31Z.069657.log
lrwxr-x--- 68 david 27 Sep 20:51 cockroach-pebble.log -> cockroach-pebble.Davids-M1-MBP.david.2023-09-28T01_51_29Z.069657.log
lrwxr-x--- 70 david 27 Sep 20:51 cockroach-security.log -> cockroach-security.Davids-M1-MBP.david.2023-09-28T01_51_33Z.069657.log
lrwxr-x--- 72 david 27 Sep 20:51 cockroach-sql-schema.log -> cockroach-sql-schema.Davids-M1-MBP.david.2023-09-28T01_51_30Z.069657.log
lrwxr-x--- 68 david 27 Sep 20:51 cockroach-stderr.log -> cockroach-stderr.Davids-M1-MBP.david.2023-09-28T01_51_29Z.069657.log
lrwxr-x--- 61 david 27 Sep 20:51 cockroach.log -> cockroach.Davids-M1-MBP.david.2023-09-28T01_51_29Z.069657.log

Manually deleting /var/folders/lk/vsx86g0545g3g2tsmvs50py40000gn/T/crdb-base/c57f4ced1447ac7b5e0722b644563fcf94b4998591ad01a8c2cabf9ec1af380e allowed it to recreate the directory from scratch, fixing the issue.

~/oxide/omicron $ l /var/folders/lk/vsx86g0545g3g2tsmvs50py40000gn/T/crdb-base/c57f4ced1447ac7b5e0722b644563fcf94b4998591ad01a8c2cabf9ec1af380e/logs
drwxr-x---    - david  3 Oct 13:52 goroutine_dump
drwxr-x---    - david  3 Oct 13:52 heap_profiler
drwxr-x---    - david  3 Oct 13:52 inflight_trace_dump
.rw-r----- 1.4k david  3 Oct 13:52 cockroach-health.Davids-M1-MBP.david.2023-10-03T18_52_21Z.036414.log
lrwxr-x--- 1.4k david  3 Oct 13:52 cockroach-health.log -> cockroach-health.Davids-M1-MBP.david.2023-10-03T18_52_21Z.036414.log
.rw-r----- 3.6k david  3 Oct 13:52 cockroach-pebble.Davids-M1-MBP.david.2023-10-03T18_52_19Z.036414.log
lrwxr-x--- 3.6k david  3 Oct 13:52 cockroach-pebble.log -> cockroach-pebble.Davids-M1-MBP.david.2023-10-03T18_52_19Z.036414.log
.rw-r----- 1.8k david  3 Oct 13:52 cockroach-security.Davids-M1-MBP.david.2023-10-03T18_52_23Z.036414.log
lrwxr-x--- 1.8k david  3 Oct 13:52 cockroach-security.log -> cockroach-security.Davids-M1-MBP.david.2023-10-03T18_52_23Z.036414.log
.rw-r----- 106k david  3 Oct 13:52 cockroach-sql-schema.Davids-M1-MBP.david.2023-10-03T18_52_20Z.036414.log
lrwxr-x--- 106k david  3 Oct 13:52 cockroach-sql-schema.log -> cockroach-sql-schema.Davids-M1-MBP.david.2023-10-03T18_52_20Z.036414.log
.rw-r----- 1.2k david  3 Oct 13:52 cockroach-stderr.Davids-M1-MBP.david.2023-10-03T18_52_19Z.036414.log
lrwxr-x--- 1.2k david  3 Oct 13:52 cockroach-stderr.log -> cockroach-stderr.Davids-M1-MBP.david.2023-10-03T18_52_19Z.036414.log
.rw-r----- 103k david  3 Oct 13:52 cockroach.Davids-M1-MBP.david.2023-10-03T18_52_19Z.036414.log
lrwxr-x--- 103k david  3 Oct 13:52 cockroach.log -> cockroach.Davids-M1-MBP.david.2023-10-03T18_52_19Z.036414.log
@jordanhendricks jordanhendricks added Testing & Analysis Tests & Analyzers development Bugs, paper cuts, feature requests, or other thoughts on making omicron development better labels Oct 4, 2023
@sunshowers
Copy link
Contributor

It looks like this is due to macOS having a /tmp cleaner: https://apple.stackexchange.com/a/438931

So we need to probably move the crdb-seed dir out of /tmp, into target/ most likely.

@davepacheco
Copy link
Collaborator

I see that the tmp cleaner only seems to run daily and cleans stuff that hasn't been accessed in 3 days. Is the assumption that the seed was built by a test run several days earlier, the tmp cleaner ran in the meantime, then when cargo nextest was run again it skipped regenerating the seed because it assumed it was intact? I think there should be a smoking gun if that's what happened. @david-crespo do you see an entry removing that symlink in /var/log/daily.out?

@sunshowers
Copy link
Contributor

sunshowers commented Oct 4, 2023

Is the assumption that the seed was built by a test run several days earlier, the tmp cleaner ran in the meantime, then when cargo nextest was run again it skipped regenerating the seed because it assumed it was intact?

I think that's the case, yeah. Would definitely be curious about the logs. (I think it's not the symlink, it's the target that got removed if I understood the error correctly)

@david-crespo
Copy link
Contributor Author

david-crespo commented Oct 4, 2023

I don't see anything in there. Worth noting that the seed in question was in fact built 5 days before I ran into this.

~ $ cat /var/log/daily.out

Wed Sep 27 11:54:39 CDT 2023

Thu Sep 28 11:54:17 CDT 2023

Fri Sep 29 11:54:27 CDT 2023

Mon Oct  2 09:41:05 CDT 2023

Tue Oct  3 09:14:47 CDT 2023

Tue Oct  3 21:14:57 CDT 2023

Wed Oct  4 09:21:44 CDT 2023

~ $ cat /var/log/weekly.out

Wed Sep 27 07:51:56 CDT 2023

Wed Oct  4 09:22:00 CDT 2023

@davepacheco
Copy link
Collaborator

Dang. I see much more verbose stuff there:

Wed Mar 22 18:38:02 PDT 2023

Removing old temporary files:
  /tmp/MozillaUpdateLock-2656FF1E876E9973
  /tmp/powerlog

Given how sparse your output is, I'm not sure we can say this wasn't the cause, even though the file isn't listed there.

Assuming this is the explanation: what if we instead store the seed as a tarball instead of a directory tree? Then the tmp cleaner will either wipe it all or none. Either way, when we unpack it, it should be exactly what we built. If we have to stick with a tree, it'd be nice if the script could somehow validate that the tree is intact. This does seem a little hard since we don't know what the structure of it should be. We could store a manifest into it and check that when we go to use it...which is how I got to the tarball idea.

My worry about moving it to "target" is just that it could still become corrupted there, plus this feels more like a tmpfile than a build artifact. But it would probably address the immediate problem.

@sunshowers
Copy link
Contributor

Hmm the tarball idea is definitely interesting and adds atomicity.

@sunshowers
Copy link
Contributor

Part of the issue here also was that if you access a file via a symlink, the atime of the symlink changes (assuming not-noatime) but the atime of the underlying file does change.

% touch real-file
% ln -s real-file link   
  
% stat real-file 
  File: real-file
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: 28h/40d	Inode: 39176300    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/    rain)   Gid: ( 1000/    rain)
Access: 2023-10-06 12:11:20.153970486 -0700
Modify: 2023-10-06 12:11:20.153970486 -0700
Change: 2023-10-06 12:11:20.153970486 -0700
 Birth: 2023-10-06 12:11:20.153970486 -0700

% stat link     
  File: link -> real-file
  Size: 9         	Blocks: 8          IO Block: 4096   symbolic link
Device: 28h/40d	Inode: 39176304    Links: 1
Access: (0777/lrwxrwxrwx)  Uid: ( 1000/    rain)   Gid: ( 1000/    rain)
Access: 2023-10-06 12:11:45.438232814 -0700
Modify: 2023-10-06 12:11:28.742059589 -0700
Change: 2023-10-06 12:11:28.742059589 -0700
 Birth: 2023-10-06 12:11:28.742059589 -0700

% echo x > link

% stat real-file
  File: real-file
  Size: 2         	Blocks: 8          IO Block: 4096   regular file
Device: 28h/40d	Inode: 39176300    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/    rain)   Gid: ( 1000/    rain)
Access: 2023-10-06 12:11:20.153970486 -0700
Modify: 2023-10-06 12:11:56.770350386 -0700
Change: 2023-10-06 12:11:56.770350386 -0700
 Birth: 2023-10-06 12:11:20.153970486 -0700

% stat link     
  File: link -> real-file
  Size: 9         	Blocks: 8          IO Block: 4096   symbolic link
Device: 28h/40d	Inode: 39176304    Links: 1
Access: (0777/lrwxrwxrwx)  Uid: ( 1000/    rain)   Gid: ( 1000/    rain)
Access: 2023-10-06 12:11:45.438232814 -0700
Modify: 2023-10-06 12:11:28.742059589 -0700
Change: 2023-10-06 12:11:28.742059589 -0700
 Birth: 2023-10-06 12:11:28.742059589 -0700

The Apple file cleaner works based on atimes, not mtimes, and note that:

  • the atime of link changed
  • the mtime of real-file changed but not the atime.

sunshowers added a commit that referenced this issue Oct 7, 2023
Several changes:

1. In #4193, @david-crespo observed some missing files in the crdb-seed generated directory. My suspicion is that that is due to the `/tmp` cleaner that runs on macOS. @davepacheco suggested using a tarball to get atomicity (either the file exists or it doesn't), and it ended up being pretty simple to do that at the end.
2. Teach nexus-test-utils to ensure that the seed tarball exists, fixing `omicron-dev run-all` and anything else that uses nexus-test-utils (and avoiding a dependency on the environment).
3. Move `crdb-seed` to `dev-tools` (thanks Dave for pointing it out!)
4. Add a way to invalidate the cache with `CRDB_SEED_INVALIDATE=1` in the environment.
5. Add a readme for `crdb-seed`.

Fixes #4206.

Hopefully addresses #4193.
@sunshowers
Copy link
Contributor

#4208 should hopefully address this. I'm going to close this for now -- let's reopen if we run into this again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Bugs, paper cuts, feature requests, or other thoughts on making omicron development better Testing & Analysis Tests & Analyzers
Projects
None yet
Development

No branches or pull requests

4 participants