-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.5.0 reloads network when running userdata scripts #108
Comments
Testing shows that this issue is intermittent. The symptoms from the logs show that However repeated testing does not exhibit this behaviour leading me to think that the issue is; Timing related, related to the removal of cleanup from |
I know that cloud-init is configured differently on EL vs AL, but we encountered a version of this issue on EL (in v2.4.0). The solution I came up with was to make cloud-init configure its network in fallback mode (i.e., without using networkd/networkmanager/etc. and only using dhcpcd). Obviously, this is not something that you can implement here, but I will paste the relevant lines from our build script: sed -i "s/renderers: .*/renderers: []\n activators: ['networkd']/" /etc/cloud/cloud.cfg
echo "network: {config: disabled}" >> /etc/cloud/cloud.cfg This works because the network stack doesn't meaningfully change configuration once amazon-ec2-net-utils takes over, and cloud-init does nothing special to try to persist its "bootleg" IP configuration when configured this way. You will DHCP twice, however, and I don't think this is resolvable with this solution. The correct solution is either for amazon-ec2-net-utils to use the same files/naming conventions as cloud-init (these are not configurable in cloud-init, sadly), or to change cloud-init to make these configurable enough that this package could drop in a configuration file that switched to and configured cloud-init to use the same network units that amazon-ec2-net-utils manages. If you used the same unit names and were careful that the amazon-ec2-net-utils units were placed in a directory with higher precedence than cloud-init, then cloud-init's initial unit file would just become superseded by amazon-ec2-net-utils, and networkd wouldn't end up dropping any packets (theoretically) because the fields that change wouldn't require down/up'ing the interface. Symlinks unfortunately don't seem to work because cloud-init writes the unit files, and as the unit files differ slightly in naming and configuration, the system ends up in an undefined state (as you see with this issue), except now networkd knows about the undefined state (and that's worse because now it is conflicting). I also couldn't get a combination of Wants/After that I was happy with. If you make amazon-ec2-net-utils entirely dependent on cloud-init-local.service, you still have the timing problem with cloud-init.service, and that's arguably worse because cloud-init-local only needs a network to fetch userdata but cloud-init needs it for any user-defined behavior. So you could then make amazon-ec2-net-utils depend on cloud-init and cloud-init-final, but then the userdata configuration cannot benefit from the network stack being up to the standards being done by amazon-ec2-net-utils. If you make cloud-init-local dependent on amazon-ec2-net-utils, you really need it to be dependent on a target due to the dynamic unit instances, and now you need a target that can somehow represent "amazon-ec2-net-utils is done", which makes late attached ENIs (potentially done in user data) have to work differently than designed, and that sucks too. |
First pass at adding an e2e reboot test. This test is specifically trying to find the failure signature from #108. I observed that successful boots only reload the network once so searching for that.
That is interesting to know. I've been trying to setup some sort of reproducer so that I can catch this bug in the wild. Still investigating. Some code is parked in the Definitely worth investigating orchestration between net-utils, and cloud-init since we have seen similar timing issues between cloud-init and IMDS (which net-utils depends on) |
When an EC2 instance boots with
userdata
that downloads a file from s3,amazon-ec2-net-utils
is started, along with credential refresher loop.cloud-init
runsaws s3 cp
to get the file, and while the download is running,amazon-ec2-net-utils
reloads networkd and the download fails.The text was updated successfully, but these errors were encountered: