Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve spot HA by utilising ASG capacity rebalance #11

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

kieranbrown
Copy link
Contributor

@kieranbrown kieranbrown commented Jan 23, 2024

Capacity rebalance helps by being proactive in trying to replace Spot Instances before they are interrupted. Full docs - https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-capacity-rebalancing.html


Current Behaviour
When a spot instance receives its 2-minute interruption warning nothing happens, the instance is terminated after 2 minutes then a new instance is started after the original is terminated.

New Behaviour
When a spot instance receives its 2-minute interruption warning the ASG immediately provisions a new instance which hopefully will boot and move the floating EIP before the original is terminated. With this approach, there is minimal downtime when spot instances are terminated.


You can test this out using AWS Fault Injection, if you go to the EC2 management console then click Spot Requests in the sidebar. You have the option to select a spot request, click actions then Initiate Interruption

@gabrieleolmi
Copy link

The only thing I would change is the ha_additional_instance_types variable, would it be better to use an empty array by default? Otherwise, after an update, users would have an ASG with a behavior and an instance that they did not choose.

This pull request should fix the problems I'm having while using fck-nat. My spot instance often fails to start because there is no spot capacity available. For example these are the errors that often happen to me when using the fck-nat instance:

2024-05-21 13 20 52

@kieranbrown
Copy link
Contributor Author

kieranbrown commented May 22, 2024

@gabrieleolmi

The only thing I would change is the ha_additional_instance_types variable, would it be better to use an empty array by default? Otherwise, after an update, users would have an ASG with a behavior and an instance that they did not choose.

It's been a while since I last looked at this but IIRC capacity rebalance requires a mixed instance policy and within a mixed instance policy you need to define a minimum of 2 instance types. Adding the ha_additional_instance_types and defaulting it to the next cheapest instance type was the only sensible approach I could think of.

Defaulting ha_additional_instance_types to an empty array would cause an error if end users set use_spot_instances = true without explicitly setting ha_additional_instance_types to their preferred failover instance.

Perhaps just some documentation to clear up this behaviour in the README would be enough.

@RaJiska it would be good to hear your thoughts on this.

@kieranbrown kieranbrown closed this Aug 6, 2024
@kieranbrown kieranbrown deleted the capacity-rebalance branch August 6, 2024 16:12
@kieranbrown kieranbrown restored the capacity-rebalance branch August 6, 2024 16:16
@kieranbrown kieranbrown reopened this Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants