Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failed in CI: deploy: Failed to reach switch zone after 30 seconds #6802

Open
iliana opened this issue Oct 8, 2024 · 2 comments
Open
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@iliana
Copy link
Contributor

iliana commented Oct 8, 2024

This test failed on a CI run on #6764 (893980e): https://github.com/oxidecomputer/omicron/runs/31106728412

Log showing the specific test failure: https://buildomat.eng.oxide.computer/wg/0/details/01J9CX1DW3PF08TSAP6C5BR05B/Zs83GkZ6JiGbx3t2CTXBHpCoUgzG5CrV4qZnLi8qriO1jf5U/01J9CX240452EBWJPNW49B9RZK#S359

Excerpt from the log showing the failure:

buskin console login: Oct  5 00:12:08 buskin ufs: NOTICE: alloc: /: file system full

[...]

+ retry=31
+ curl --head --silent -o /dev/null 'http://[fd00:1122:3344:101::2]:12224/'
+ [[ 31 -gt 30 ]]
+ echo 'Failed to reach switch zone after 30 seconds'
+ exit 1
Failed to reach switch zone after 30 seconds
@iliana iliana added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Oct 8, 2024
@davepacheco
Copy link
Collaborator

I think this usually reflects something having tried to write to the root filesystem (the in-memory ramdisk).

@jgallagher
Copy link
Contributor

I saw Failed to reach switch zone after 30 seconds on #7307, but I don't see a file system full notice on it:

https://buildomat.eng.oxide.computer/wg/0/details/01JH3SY7T368DHWXZ8WJRZ99ZF/nmKjLk1wrPoQOe9OEtdwH1SCYwqv0u2ZkLxGG5aKmS1eUGxl/01JH3SYSESAZF95V3PR48AAF49

It looks like the switch zone never got an underlay address, maybe? The oxide-zone-network-setup-log is full of

7	2025-01-08T21:45:04.778Z	INFO	zone-setup: Ensuring there is a default route
    gateway = Ipv6(fd00:1122:3344:101::1)
8	2025-01-08T21:45:04.788Z	INFO	zone-setup: Cannot ensure there is a default route yet (retrying in 1.449710539s)
    error = failed to ensure default route via gateway fd00:1122:3344:101::1: Command [/usr/sbin/route add -inet6 default -inet6 fd00:1122:3344:101::1] executed and failed with status: exit status: 128  stdout: add net default: gateway fd00:1122:3344:101::1: Network is unreachable\n  stderr: 

and the last logs from oxide-mgd are:

36	2025-01-08T21:33:08.005Z	INFO	slog-rs: handling smf refresh
37	 	[ Jan  8 21:33:08 Method "refresh" exited with status 0. ]
38	2025-01-08T21:33:08.006Z	INFO	slog-rs: starting stats server on smf refresh
39	2025-01-08T21:33:08.007Z	ERRO	slog-rs: failed to start stats server on refresh: underlay address not found
40	2025-01-08T21:33:08.064Z	INFO	slog-rs: [tfportrear0_0] sm initialized with addr fe80::48a8:38ff:feb9:ef81 on if tfportrear0_0 index 3
41	2025-01-08T21:33:08.064Z	DEBG	slog-rs: [tfportrear0_0] starting discovery handler

However, it does look like sled-agent thinks it configured an underlay address in the switch zone prior to all of that?

242	2025-01-08T21:32:58.983Z	INFO	SledAgent (ServiceManager): Re-enabling running switch zone (new address)
    file = sled-agent/src/services.rs:4294
    new = [fd00:1122:3344:101::2, ::1]
    old = [::1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

3 participants