-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New instances in a rebooted sled are unable to reach existing instances in other sleds on their private IPs #5214
Comments
The firewall entries of the opte port for instance1 look a lot more normal.
|
So looking into the firewall stats on both sides using
It doesn't look like a firewalling issue, which is supported by the default ( Taking a look at V2P mappings kyle@KyleOxide scraps % git diff sled8.log sled16.log
diff --git a/sled8.log b/sled16.log
index 0eddfba..e41d136 100644
--- a/sled8.log
+++ b/sled16.log
@@ -4,15 +4,9 @@ VPC 1508093
IPv4 mappings
----------------------------------------------------------------------
VPC IP VPC MAC ADDR UNDERLAY IP
-172.30.0.6 A8:40:25:FA:A2:20 fd00:1122:3344:105::1
-172.30.0.8 A8:40:25:FD:E4:2F fd00:1122:3344:105::1
172.30.0.9 A8:40:25:FB:4B:4C fd00:1122:3344:106::1
172.30.0.10 A8:40:25:FC:D7:FF fd00:1122:3344:106::1
-172.30.0.11 A8:40:25:F0:F6:95 fd00:1122:3344:105::1
172.30.0.12 A8:40:25:F7:09:D1 fd00:1122:3344:103::1
-172.30.0.13 A8:40:25:F2:DD:C6 fd00:1122:3344:10a::1
-172.30.0.14 A8:40:25:FA:A8:4D fd00:1122:3344:101::1
-172.30.0.15 A8:40:25:F8:1C:AC fd00:1122:3344:106::1
172.30.0.16 A8:40:25:F3:7D:A8 fd00:1122:3344:106::1
172.30.0.17 A8:40:25:FB:E7:50 fd00:1122:3344:10a::1
172.30.0.18 A8:40:25:F1:A9:EA fd00:1122:3344:105::1
@@ -21,10 +15,6 @@ VPC IP VPC MAC ADDR UNDERLAY IP
172.30.0.21 A8:40:25:F9:F8:1B fd00:1122:3344:108::1
172.30.0.22 A8:40:25:F0:1C:50 fd00:1122:3344:105::1
172.30.0.23 A8:40:25:F4:C8:59 fd00:1122:3344:109::1
-172.30.0.24 A8:40:25:FA:29:E1 fd00:1122:3344:103::1
-172.30.0.25 A8:40:25:F9:D1:DB fd00:1122:3344:105::1
-172.30.0.26 A8:40:25:F1:A3:87 fd00:1122:3344:105::1
-172.30.0.27 A8:40:25:FB:14:E4 fd00:1122:3344:10a::1
192.168.32.5 A8:40:25:F7:90:73 fd00:1122:3344:101::1
192.168.32.6 A8:40:25:FD:AD:A7 fd00:1122:3344:106::1
192.168.32.7 A8:40:25:FC:E0:AE fd00:1122:3344:106::1
@@ -39,11 +29,11 @@ VPC IP VPC MAC ADDR UNDERLAY IP
192.168.32.16 A8:40:25:F3:F3:4C fd00:1122:3344:10b::1
192.168.32.17 A8:40:25:F9:B2:26 fd00:1122:3344:109::1
192.168.32.18 A8:40:25:F8:46:69 fd00:1122:3344:106::1
+192.168.32.19 A8:40:25:F8:55:40 fd00:1122:3344:103::1
192.168.32.20 A8:40:25:F0:B4:99 fd00:1122:3344:106::1
192.168.32.21 A8:40:25:F8:3D:31 fd00:1122:3344:10a::1
192.168.32.22 A8:40:25:F0:B0:86 fd00:1122:3344:101::1
192.168.32.23 A8:40:25:F2:F5:1C fd00:1122:3344:109::1
-192.168.32.24 A8:40:25:FA:D3:B7 fd00:1122:3344:108::1
IPv6 mappings
---------------------------------------------------------------------- Specifically, |
Here are the most recent start times of the sled-agent and dendrite services on BRM42220014 (sled 16):
BRM44220011 has not been rebooted and its sled-agent has been running since the last rack update:
Instance 2 was created after the scrimlet/service restarts:
Instance 1 was created before the scrimlet reboots and remained up and running during the scrimlet/service restarts:
|
I've checked that the v2p entries highlighted as missing on BRM42220014 correspond to instances created prior to the reboot. So apparently, the issue is more broadly a failure to backfill v2p entries that exist prior to the sled reboot. This seems to be an area for rpw so I'm reassigning the ticket to @internet-diglett. |
To be clear, this issue is not a regression and has always been there because v2p mappings are created only during an instance start event. The saga/push approach is a linear way of broadcasting information and doesn't account for exceptions such as sled reboot/panic and sled outage (#4259). The issue is masked to some extent because we usually stop all running instances prior to planned sled reboots or let them fail (and eventually get destroyed) otherwise. We/customers could have run into it in the past during random sled panics but worked around it unknowingly by stopping/starting the unreachable instances. |
For the record: I asked whether this was a blocker for R8 (for delivery of "add sled"). We determined that it's likely not. That's because in R8 we'd be doing "add sled" during the upgrade maintenance window. Because of the way updates work today, all instances would be started after that point (even those that had been running prior to the window). So we shouldn't run into this just because of "add sled". |
@davepacheco that seems correct. I don't see this causing any issues in that scenario, |
TODO --- - [x] Extend db view to include probe v2p mappings - [x] Update sagas to trigger rpw activation instead of directly configuring v2p mappings - [x] Test that the `delete` functionality cleans up v2p mappings Related --- Resolves #5214 Resolves #4259 Resolves #3107 - [x] Depends on oxidecomputer/opte#494 - [x] Depends on oxidecomputer/meta#409 - [x] Depends on oxidecomputer/maghemite#244 --------- Co-authored-by: Levon Tarver <[email protected]>
@morlandi7 this should be resolved, but I left it open until someone verifies the work done in #5568 has actually resolved this issue on dogfood. |
Confirmed that the issue can be closed. |
I noticed this issue after running a bunch of scrimlet reboot tests on rack2. One of the instances in question happens to be on a scrimlet I rebooted at the tail end of the testing. It was however created at least an hour after the reboot happened so it's unclear how it could be related.
Here are the instance details:
Instance 1 is able to reach all other instances on their private IPs within the subnet except for instance 2
But it can reach instance 2 on its external IP
The same goes with instance 2 against other instances in the subnet vs instance 1:
Instance 2 was rebooted (stopped/started) once after it was created. I didn't check the private IP connectivity between the two events so it's unclear if the connectivity was there prior to the instance reboot.
Here are the firewall opte entries from
opteadm
. The list is VERY long and I haven't been able to interpret what it means... but I'm dumping it here in case it helps.opte3-dump-layer-firewall.log
(Note: I was using the instances for some netperf and iperf3 tests. This is why there are a gazillion number of ports in use.)
The text was updated successfully, but these errors were encountered: