Health Check (Preview)

Overview

The Health Check feature is used to prevent unhealthy instance(s) from serving requests, thus improving availability. You can specify the endpoint of your application that represents the health of your web app. The feature will ping the health check path on all instances every 2 minutes. If an instance does not respond within 10 minutes (5 pings), the instance is determined to be "unhealthy" and our service will stop routing requests to it.

This feature is currently in preview. You are welcome to use the feature and provide feedback to Jason<dot>Freeberg<at>Microsoft<dot>com.

Behavior

When Health Checks are enabled, App Service will ping the provided path. If a successful response code is not received after 5 pings, the site is considered "unhealthy". All "unhealthy" instances will be removed from the App Service load balancer. App Service will remove instances from the load balancer until there is only 1 instance remaining (see the roadmap below) If all instances are found to be "unhealthy", we will not remove the sites from rotation.

When an instance is removed from the load balancer rotation, App Service continues to ping the health check endpoint. The site will be returned to the load balancer if a successful response code is received in future pings.

Usage

A bug has been identified and fixed for Linux apps. The fix will be released to all Linux apps by the end of October. Until then, the HealthCheck will not report metrics for Linux apps.

Enabling

To enable the feature, open the Resource explorer from the App Service blade in the portal under Development Tools. The resource explorer will open to the top-level view of your App Service. Expand the config section and click the web tab. Add an element with the name, "healthCheckPath", whose value is the application path that our service should ping.

You must have 2 or more instances for the feature to take effect.

The Health Check Path

The health check path should check the critical components of your application. For example, if you application depends on a database and a messaging system, the health check endpoint should perform a (minimal) database query and send a quick message.

The path must respond within two minutes with a status code between 200 and 299 (inclusive). If the path does not respond within two minutes, or returns a status code outside the range, then the instance is considered "unhealthy". Health Check integrates with Easy Auth so our service will be able to ping the endpoint if Easy Auth is enabled. However if you are using your own authentication system, the health check path must be unauthenticated.

Improving performance

Your unhealthy workers can be removed from the load balancer rotation sooner if your healthcheck path responds immediately to the ping. To do this, your healthcheck path should store a "last known status" that can be immediately returned to the healthcheck request. Once the response is sent, your healthcheck endpoint should check the status of the core components (database, messaging system) and prepare the new status for the following healthcheck ping.

Monitoring and Alerts

Once the feature has been turned on, you can visualize the Health Check status of each instance in the Portal by going to Monitoring > Metrics. In the metrics dropdown list, select "Health Check Status". This will show your healthy instances as a percentage.

Health Check Visualization

Show Status by Instance

If your web app has been scaled out, you can show the statuses for each instance. Click Apply Splitting > Instance.

Split health check statuses

This will show the statuses for each instance on the graph. The instance IDs are shown on the bottom left.

Instance IDs on bottom left

Create an Alert

You can create an alert based off this metric, such as sending an SMS message or an email.

While viewing the graph, select New Alert Rule
Under Condition, click the link "Whenever the Health check status is "
In the new tab, change Operator to "Less than or equal to" or "Less than"
Set a Threshold value
Finally, configure your action. You will need to create an Action Group. For more information on Action Groups, see this article.

Metric APIs

The data is also accessible through the monitoring APIs below. You can use ARMClient to query the information.

Recent overall status

ARMClient.exe GET "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Web/sites/{site}/providers/microsoft.Insights/metrics?api-version=2018-01-01&metricnames=HealthCheckStatus&interval=FULL"

Hourly status during given period

The timespan query parameter is a string with two datetimes separated by a /.

ARMClient.exe GET "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Web/sites/{site}/providers/microsoft.Insights/metrics?api-version=2018-01-01&metricnames=HealthCheckStatus&timespan=2019-08-01T00:00:00.000Z/2019-08-02T00:00:00.000Z&interval=PT1H"

Detail for all worker instances

ARMClient.exe GET "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Web/sites/{site}/providers/microsoft.Insights/metrics?api-version=2018-01-01&metricnames=HealthCheckStatus&$filter=Instance eq '*'"

Detail of a specific worker

ARMClient.exe GET "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Web/sites/{site}/providers/microsoft.Insights/metrics?api-version=2018-01-01&metricnames=HealthCheckStatus&$filter=Instance eq 'RD00155D82AC9D'"

Roadmap

This feature is currently in preview. We plan to support the following scenarios in future milestones.

Allow users to specify a lower limit of instances to be kept in rotation
Alerts and monitoring when instances are added to or removed from the rotation
Unhealthy instances will be replaced with new one. (There will be limit on number of worker replacement per hour and per day.)
- In the meantime, users are advised to incorporate an auto-heal to try to recover the instance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly