Skip to content

Commit

Permalink
add PostgreSQL replication delay monitor (#49)
Browse files Browse the repository at this point in the history
* add PostgreSQL replication delay monitor

* remove pgbouncer from another pr

* use host/port
  • Loading branch information
Aohzan authored Aug 23, 2023
1 parent e46293f commit cf8c7cf
Show file tree
Hide file tree
Showing 4 changed files with 87 additions and 0 deletions.
10 changes: 10 additions & 0 deletions database/postgresql/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Creates DataDog monitors with the following checks:

- PostgreSQL Connections
- PostgreSQL disk queue depth
- PostgreSQL replication delay on {{host}}:{{port}}
- PostgreSQL server does not respond
- PostgreSQL too many locks

Expand Down Expand Up @@ -49,6 +50,7 @@ Creates DataDog monitors with the following checks:
| [datadog_monitor.postgresql_availability](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.postgresql_connection_too_high](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.postgresql_disk_queue_depth](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.postgresql_replication_delay](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.postgresql_too_many_locks](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |

## Inputs
Expand Down Expand Up @@ -90,6 +92,13 @@ Creates DataDog monitors with the following checks:
| <a name="input_postgresql_lock_threshold_warning"></a> [postgresql\_lock\_threshold\_warning](#input\_postgresql\_lock\_threshold\_warning) | Maximum warning acceptable number of locks | `number` | `70` | no |
| <a name="input_postgresql_lock_time_aggregator"></a> [postgresql\_lock\_time\_aggregator](#input\_postgresql\_lock\_time\_aggregator) | Monitor time aggregator for PostgreSQL lock monitor [available values: min, max or avg] | `string` | `"min"` | no |
| <a name="input_postgresql_lock_timeframe"></a> [postgresql\_lock\_timeframe](#input\_postgresql\_lock\_timeframe) | Monitor timeframe for PostgreSQL lock monitor [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | `string` | `"last_5m"` | no |
| <a name="input_postgresql_replication_delay_aggregator"></a> [postgresql\_replication\_delay\_aggregator](#input\_postgresql\_replication\_delay\_aggregator) | Monitor time aggregator for PostgreSQL replication delay [available values: min, max or avg] | `string` | `"avg"` | no |
| <a name="input_postgresql_replication_delay_enabled"></a> [postgresql\_replication\_delay\_enabled](#input\_postgresql\_replication\_delay\_enabled) | Flag to enable PostgreSQL replication delay monitor | `bool` | `true` | no |
| <a name="input_postgresql_replication_delay_extra_tags"></a> [postgresql\_replication\_delay\_extra\_tags](#input\_postgresql\_replication\_delay\_extra\_tags) | Extra tags for PostgreSQL replication delay monitor | `list(string)` | `[]` | no |
| <a name="input_postgresql_replication_delay_message"></a> [postgresql\_replication\_delay\_message](#input\_postgresql\_replication\_delay\_message) | Custom message for PostgreSQL replication delay monitor | `string` | `""` | no |
| <a name="input_postgresql_replication_delay_threshold_critical"></a> [postgresql\_replication\_delay\_threshold\_critical](#input\_postgresql\_replication\_delay\_threshold\_critical) | Critical threshold in seconds | `number` | `200` | no |
| <a name="input_postgresql_replication_delay_threshold_warning"></a> [postgresql\_replication\_delay\_threshold\_warning](#input\_postgresql\_replication\_delay\_threshold\_warning) | Warning threshold in seconds | `number` | `100` | no |
| <a name="input_postgresql_replication_delay_timeframe"></a> [postgresql\_replication\_delay\_timeframe](#input\_postgresql\_replication\_delay\_timeframe) | Monitor timeframe for PostgreSQL replication delay [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`] | `string` | `"last_15m"` | no |
| <a name="input_prefix_slug"></a> [prefix\_slug](#input\_prefix\_slug) | Prefix string to prepend between brackets on every monitors names | `string` | `""` | no |
| <a name="input_timeout_h"></a> [timeout\_h](#input\_timeout\_h) | Default auto-resolving state (in hours) | `number` | `0` | no |

Expand All @@ -100,6 +109,7 @@ Creates DataDog monitors with the following checks:
| <a name="output_postgresql_availability_id"></a> [postgresql\_availability\_id](#output\_postgresql\_availability\_id) | id for monitor postgresql\_availability |
| <a name="output_postgresql_connection_too_high_id"></a> [postgresql\_connection\_too\_high\_id](#output\_postgresql\_connection\_too\_high\_id) | id for monitor postgresql\_connection\_too\_high |
| <a name="output_postgresql_disk_queue_depth_id"></a> [postgresql\_disk\_queue\_depth\_id](#output\_postgresql\_disk\_queue\_depth\_id) | id for monitor postgresql\_disk\_queue\_depth |
| <a name="output_postgresql_replication_delay_id"></a> [postgresql\_replication\_delay\_id](#output\_postgresql\_replication\_delay\_id) | id for monitor postgresql\_replication\_delay |
| <a name="output_postgresql_too_many_locks_id"></a> [postgresql\_too\_many\_locks\_id](#output\_postgresql\_too\_many\_locks\_id) | id for monitor postgresql\_too\_many\_locks |
<!-- END_TF_DOCS -->
## Related documentation
Expand Down
44 changes: 44 additions & 0 deletions database/postgresql/inputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -219,3 +219,47 @@ variable "postgresql_disk_queue_message" {
type = string
default = ""
}

########################################
### PostgreSQL replication delay ###
########################################

variable "postgresql_replication_delay_aggregator" {
description = "Monitor time aggregator for PostgreSQL replication delay [available values: min, max or avg]"
type = string
default = "avg"
}

variable "postgresql_replication_delay_timeframe" {
description = "Monitor timeframe for PostgreSQL replication delay [available values: `last_#m` (1, 5, 10, 15, or 30), `last_#h` (1, 2, or 4), or `last_1d`]"
type = string
default = "last_15m"
}

variable "postgresql_replication_delay_threshold_critical" {
default = 200
description = "Critical threshold in seconds"
}

variable "postgresql_replication_delay_threshold_warning" {
default = 100
description = "Warning threshold in seconds"
}

variable "postgresql_replication_delay_enabled" {
description = "Flag to enable PostgreSQL replication delay monitor"
type = bool
default = true
}

variable "postgresql_replication_delay_extra_tags" {
description = "Extra tags for PostgreSQL replication delay monitor"
type = list(string)
default = []
}

variable "postgresql_replication_delay_message" {
description = "Custom message for PostgreSQL replication delay monitor"
type = string
default = ""
}
28 changes: 28 additions & 0 deletions database/postgresql/monitors-postgresql.tf
Original file line number Diff line number Diff line change
Expand Up @@ -111,3 +111,31 @@ EOQ

tags = concat(["env:${var.environment}", "type:database", "provider:postgres", "resource:postgresql", "team:claranet", "created-by:terraform"], var.postgresql_disk_queue_extra_tags)
}

resource "datadog_monitor" "postgresql_replication_delay" {
count = var.postgresql_replication_delay_enabled ? 1 : 0
name = "${var.prefix_slug == "" ? "" : "[${var.prefix_slug}]"}[${var.environment}] PostgreSQL replication delay on {{host}}:{{port}}"
message = coalesce(var.postgresql_replication_delay_message, var.message)
type = "query alert"

query = <<EOQ
${var.postgresql_replication_delay_aggregator}(${var.postgresql_replication_delay_timeframe}):
avg:postgresql.replication_delay${module.filter-tags.query_alert} by {host,port}
> ${var.postgresql_replication_delay_threshold_critical}
EOQ

monitor_thresholds {
warning = var.postgresql_replication_delay_threshold_warning
critical = var.postgresql_replication_delay_threshold_critical
}

evaluation_delay = var.evaluation_delay
new_host_delay = var.new_host_delay
notify_no_data = false
renotify_interval = 0
require_full_window = true
timeout_h = 0
include_tags = true

tags = concat(["env:${var.environment}", "type:database", "provider:postgres", "resource:postgresql", "team:claranet", "created-by:terraform"], var.postgresql_replication_delay_extra_tags)
}
5 changes: 5 additions & 0 deletions database/postgresql/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ output "postgresql_disk_queue_depth_id" {
value = datadog_monitor.postgresql_disk_queue_depth.*.id
}

output "postgresql_replication_delay_id" {
description = "id for monitor postgresql_replication_delay"
value = datadog_monitor.postgresql_replication_delay.*.id
}

output "postgresql_too_many_locks_id" {
description = "id for monitor postgresql_too_many_locks"
value = datadog_monitor.postgresql_too_many_locks.*.id
Expand Down

0 comments on commit cf8c7cf

Please sign in to comment.