Skip to content

Convergence Specification

Christopher Armstrong edited this page May 18, 2015 · 9 revisions

User Model

In our view, what users want most of all out of an auto-scaling system is to always have the correct capacity without manual intervention.

There are a number of edge-cases surrounding issues of configuration - what image servers are using, what load balancers that servers are attached to, and so on. While it might be useful for autoscale to help manage the configuration of the group and make sure it is up-to-date, users will have differing expectations about what is supposed to happen and may have assumptions from the first version of autoscale.

Convergence To Capacity

Otter will always try to make sure that the current state of the scaling group, in terms of capacity, is accurately reflected by the current state of the real world. However, it will only make adjustments to the state of the world in terms of load balancer associations, server images, and so on, when it needs to otherwise adjust things to meet capacity requirements.

Desired Behavior

When a user requests a scaling group of a particular size, Otter will provision servers until the group meets that size.

Servers will always be provisioned using the newest configuration. "provisioning" includes both creating the server itself and creating the load balancer associations reflected by the scaling group's current state.

When servers disappear for any reason - hardware failures, Compute API failures, users manually deleting them - Otter should notice and re-provision servers to meet the capacity requirements specified by the scaling group.

Group Min/Max Expected Behavior

If a policy execution would cause the number of servers over or under the group limits, the policy will only be partially executed. A 200 response code will be returned, and the scaling operation will take place up to the group limit. For example, scaling up by 5 in a group with 8 servers and a limit of 10 would result in 2 added servers. If no action is possible (i.e. the group is already at max or min) the execution will return a 403:CannotExecutePolicyError. Even if no change in desired capacity was made, the attempted execution is expected to trigger convergence to fix and existing deviations in the group not related to the failed scaling policy.

Eventually Eventual Consistency

Otter will not immediately and automatically delete and reprovision old servers when a launch configuration changes. Its goal is to arrive at the desired capacity as quickly as possible, and provisioning and de-provisioning servers to adjust configuration will make it take longer to get to a quiescent state with that capacity.

However, all things being equal, Otter should prefer to move towards a more accurate and current configuration whenever it otherwise needs to make changes.

Therefore, during a scale down, it should always preferentially delete:

  1. building servers, so as to arrive at the scaled-down state as quickly as possible (since it's not clear how long a given server may remain building)
  2. the oldest servers, so as to eventually cycle through any outdated configurations.

If a launch configuration changes, and the user wishes to use autoscale to execute a "rolling update", they can simply scale up to their desired capacity and start deleting any servers with an undesirable configuration. They may execute this as a scale-up/scale down via Otter's API or simply delete servers directly from Compute.

Out-of-band misconfigurations

If a server is in an out-of-date configuration, and an out-of-band change is made with the server's load balancer configuration, Otter will do its best to correct that configuration without re-provisioning it entirely, and while maintaining its original load balancer configuration.

Deleting Load Balancer Nodes

Otter will only delete load balancer nodes when a server that it manages is deleted. Put another way, users can add their own, non-Otter-managed servers to a load balancer and Otter will leave them be. Otter does this by detecting when its own managed servers are deleted and deleting their associated nodes (mapping it by IP).

Examples

LBs     = {'lb1': ['server1', 'server2']}
Servers = 'server1', 'server2'
Desired = 2, lb=lb1

Convergence does nothing.

LBs     = {'lb1': ['server1', 'server2'], 'lb2': []}
Servers = 'server1', 'server2'
Desired = 2, lb=lb2

Convergence does nothing. server1 and server2 are left associated with lb1.

LBs     = {'lb1': ['server1', 'server2'], 'lb2': []}
Servers = 'server1', 'server2'
Desired = 3, lb=lb2

Convergence adds a server and associates it with lb2. server1 and server2 are left associated with lb1.

Autoscaling Group ERROR status

An autoscaling group will now have a status parameter that may be ERROR. An autoscaling group goes into error when it can no longer converge to the desired state.

Current possible errors that may cause Autoscale to go into error state:

  1. Invalid launch config
    • invalid Nova server args - this can happen if an image or network or ssh key or flavor is deleted after the launch config has been validated (it gets validated when the user creates the group or updates the launch config).
      • invalid image
      • invalid flavor
      • invalid networks
      • invalid ssh key
    • invalid CLB (doesn't exist, or has been deleted)
    • invalid RCv3 load balancer pool (doesn't exist or not active - we don't know if they can be deleted)

Servers from an old config

Note that since Autoscale does not do rolling updates, if there are servers from a previous launch config on the group, attempting to converge those old servers to their desired state may also result in the group going into error.

For instance, if there is a server from launch_config_v1 on the group, but the current launch config is launch_config_v2:

launch_config_v1: <server args>, <CLB1 config>, <CLB2 config>
launch_config_v2: <server args>, <CLB3 config>
group: server1_v1, server2_v2

The group can go into error state if one or more of the load balancers from launch_config_v1 (one or both of [CLB1, CLB2]) has been deleted, because Autoscale will still attempt to ensure that the old server (server1_v1) is on all the load balancers specified in launch_config_v1 (CLB1 and CLB2), as well as that the new server (server2_v2) is on all the load balancers specified in launch_config_v2 (CLB3)