Skip to content

User facing convergence changes

Manish Tomar edited this page Feb 29, 2016 · 25 revisions

For a slightly more formal specification of Otter's behavior under convergence, see Convergence Specification.

With an upcoming release of Otter, we will be changing otter's behavior to something we call "convergence". This means otter will try to continuously keep track of all servers in a scaling group, notices when a server is delete out-of-band or goes out of ACTIVE state and automatically replaces it. We call this "self-healing". Otter will also fix any CLB manipulations out-of-band.

Behavioral changes:

  • If autoscaled server is manually removed from load balancer it is supposed to be on as per scaling group config, then otter will revert that change and put the server back in configured CLB. Note that otter does not care if server is added to any other CLB. It only ensures that server is always there in configured CLB.
  • If a CLB configured in the group is not found / deleted, the group will be put in ERROR and server that was supposed to be added to CLB will remain. This is unlike current behavior where the server that couldn't be added to CLB gets deleted.

Non-exhaustive examples of how convergence improves failure modes:

  • If a server fails to start, Auto Scale will no longer eventually "forget" about the server. Instead, it will continually try to attain the desired number of ACTIVE servers. This can happen when users specify an invalid launch configuration, or when they exceed their quota in Cloud Servers, or due to transient errors in Cloud Servers.
  • If a server becomes inactive (for example, it goes into an ERROR state, or it is manually deleted), it will be replaced. For example, consider a group with a capacity of 5 servers. One server goes away, and then a policy is executed that adds 5 servers. Otter previously would have created 5 servers, but, under convergence, otter will realize that there were supposed to be 5 servers to begin with, so it will create 6 servers to meet the desired total of 10 servers.

Same behavior (no changes w.r.t current code):

  • Load Balancer (LB) change: If group's LB configuration in launch configuration is changed then only newly created servers will be configured to use the new LB. For example, say a group had 5 servers (s1 to s5) configured with CLB 1. Now changing CLB from 1 to 2 and executing policy of "scale up by 1" will create 1 new server (s6), which will be put in CLB 2, and existing servers (s1 to s5) will be left on CLB 1.
  • Server config change: When server launch config is changed then any new servers created from then on will be based on new launch config. Existing servers will not be touched. However, oldest servers will be deleted when scaling down as it is now. Please note that this is not typical rolling update feature that will keep creating new servers and deleting old ones until all the servers are as per new config. That is a feature we are considering and may be implemented in future.

REST API changes:

New converge endpoint:

A new API endpoint POST ../groups/groupId/converge is provided which will trigger convergence. This means autoscale will get latest servers and create / delete them to bring them to the desired capacity. This is useful when for some reason the group capacity reported by autoscale is not matching what is shown by Nova/CLB.

Group state changes:

One can get state of the group by GETing ../groups/groupId/state, ../groups/groupId or ../groups. It will also be returned when creating groups using POST ../groups. The state is returned as "group" in ../groups/groupId/state and as "state" in other APIs. This will contain new field called "status" that will be either "ACTIVE" or "ERROR". ACTIVE status means group is converging and everything is fine. ERROR status means autoscale has stopped converging due to some irrecoverable error that requires user attention. When this happens another field "error" is provided that will contain list of errors autoscale encountered. The error is given as JSON object that will contain "message" and possibly other fields. For example, below is active group:

{
   "group":{
      "paused":false,
      "pendingCapacity":0,
      "name":"testscalinggroup198547",
      "active":[],
      "activeCapacity":0,
      "desiredCapacity":0,
      "status": "ACTIVE"
   }
}

and below is ERROR group:

{
   "group":{
      "paused":false,
      "pendingCapacity":0,
      "name":"testscalinggroup198547",
      "active":[],
      "activeCapacity":0,
      "desiredCapacity":0,
      "status": "ERROR",
      "errors": [
         {"message": "Cloud load balancer 85621 is being deleted"},
         {"message": "Server launch configuration is invalid: Invalid SSH key"}
      ]
   }
}

Pause implemented:

Scaling group pause has been implemented. Pausing a group will disallow any executions on the group. This includes policy execution via API or scheduled and triggering convergence (explained above). Any existing convergence running will be stopped. However, group configuration changes are allowed, i.e. changing group config like cooldown, min/max and so on and changing launch config like server image, flavor and so on. The group state will return "paused": true.

Resume implemented:

Scaling group resume has been implemented. Resuming a group does opposite of pause: it will allow policy executions and convergence triggering. The group state will return "paused": false.

Cloud feeds integration:

All the actions taken by autoscale on behalf of the user will be pushed to cloud feeds "autoscale" product. The user can get all the events by GETing https://region.feeds.api.rackspacecloud.com/autoscale/events/tenantId as described in the cloud feeds docs. The list of messages that are pushed to CF is tracked here