Skip to content

Commit

Permalink
Separate alerts when cluster is in yellow state (#506)
Browse files Browse the repository at this point in the history
When a cluster is in yellow state, it's because some shards are not
active. This can be reached under heavy load, however if there are
shards unassigned and the cluster is yellow, that means that some
replicas are not allocated and it might be necessary to add new nodes in
order to host all shards.

---------

Co-authored-by: Mehdi Bendriss <[email protected]>
  • Loading branch information
gabrielcocenza and Mehdi-Bendriss authored Dec 4, 2024
1 parent 85c7868 commit 453fb0e
Showing 1 changed file with 14 additions and 4 deletions.
18 changes: 14 additions & 4 deletions src/alert_rules/prometheus/prometheus_alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,16 +35,26 @@
"labels":
"severity": "critical"

- "alert": "OpenSearchClusterYellow"
- "alert": "OpenSearchClusterYellowTemp"
"annotations":
"message": "Cluster {{ $labels.cluster }} health status has been YELLOW for at least 20m. Some cluster replicas shards are not allocated."
"summary": "Cluster health status is YELLOW"
"message": "Cluster {{ $labels.cluster }} health status has been YELLOW for at least 20m. Shards are still relocating or initializing. The cluster might be under heavy load."
"summary": "Cluster health status is temporarily YELLOW"
"expr": |
sum by (cluster) (opensearch_cluster_status == 1)
sum by (cluster) (opensearch_cluster_shards_number{type=~"relocating|initializing"}) > 0 and on(cluster) opensearch_cluster_status == 1
"for": "20m"
"labels":
"severity": "warning"

- "alert": "OpenSearchClusterYellow"
"annotations":
"message": "Cluster {{ $labels.cluster }} health status has been YELLOW. Some replica shards are unassigned."
"summary": "Number of nodes in the cluster might be too low. Consider scaling the application to ensure that it has enough nodes to host all shards."
"expr": |
sum by (cluster) (opensearch_cluster_shards_number{type="unassigned"}) > 0 and on(cluster) opensearch_cluster_status == 1
"for": "10m"
"labels":
"severity": "warning"

- "alert": "OpenSearchWriteRequestsRejectionJumps"
"annotations":
"message": "High Write Rejection Ratio at {{ $labels.node }} node in {{ $labels.cluster }} cluster. This node may not be keeping up with the indexing speed."
Expand Down

0 comments on commit 453fb0e

Please sign in to comment.