Tool to manage AxonOps settings
To run the Ansible AxonOps Playbook you will need:
- Anisble >= 2.10
- Python3.8 or above
Clone this repo https://github.com/axonops/axonops-alerts
The Ansible Playbook takes as input environment variables, the standard for a SaaS connections are:
# export your org
# This is the only mandatory variable value
export AXONOPS_ORG='example'
# Create you API token within your AxonOps console. You will need DBA level access or above to the clusters
# you will be configuring.
export AXONOPS_TOKEN='aaaabbbbccccddddeeee'
To simplify the process, the export_tokens.sh
file has been created with all the accepted variables. Modify this file with your specific details, and then export the variables.
source ./export_tokens.sh
The AXONOPS_TOKEN
parameter is used only for AxonOps SaaS. For AxonOps on-premises, you can use a username and password or configure it to allow anonymous login.
Refer to export_tokens.sh
for more information on configuring the Ansible Playbook for AxonOps on-premises and on the accepted environment variables.
The playbooks have been tested on Ansible 2.10 but it should work on most versions.
The system ansible
package should work fine for RedHat and RockyLinux >=8
sudo dnf -y install epel-release
sudo dnf -y install ansible make
It has been tested for Ubuntu 22.04 and Debian Bookworm:
sudo apt update
sudo apt -y install ansible make
If you're using virtualenv, simply createa a python 3 environment and install Ansible to it:
virtualenv ~/py-axonops
source ~/py-axonops/bin/activate
pip3 install -r requirements.txt
We recommend using pipenv
to manage the environment. After installing pipenv
, simply run:
pipenv install
and the export the variable:
export PIPENV=true
The configuration is structured in folders following the format within the directory config
:
The config/REPLACE_WITH_ORG_NAME
folder contains alert_endpoints.yml
which defines alert endpoints at the org level,
since the alert endpoints are defined for the entire org and shared across all your clusters. All other
configurations are defined per cluster. This folder also contains the following files:
- metric_alert_rules.yml
- log_alert_rules.yml
- service_checks.yml
These configurations defined in this folder will be applied to all of your clusters. You should define common alert rules and service checks in this folder.
To define cluster specific configurations, overiding the rules and configurations defined at the org level, example
files are provided under config/REPLACE_WITH_ORG_NAME/REPLACE_WITH_CLUSTER_NAME
. The rules and configurations
in this folder will append and override the settings provided in the org folder.
Alert endpoints such as Slack, Teams, PagerDuty, OpsGenie can be configured using this Ansible playbook.
Since alert enpoints configurations are AxonOps org-level setting, the configuration file is placed at ./config/<org_name>/alert_endpoints.yml
.
The metric alert rules are configured against the charts that exists for the AxonOps dashboard in each cluster. Multiple alert rules can be configured against each chart.
An example configuration for a metric alert is shown below.
- name: CPU usage per host
dashboard: System
chart: CPU usage per host
operator: '>='
critical_value: 99
warning_value: 90
duration: 1h
description: Detected High CPU usage
present: true
name:
is the name of the alert
dashboard:
must correspond to the dashboard name in the AxonOps right-hand menu.
chart:
must correspond to the name of the chart within the above dashboard. In this case CPU usage per host
. The metric query is
automatically detected by specifying the chart name.
operator:
options are: ==
, >=
, >
, <=
, <
, !=
.
critical_value:
is the critical value threshold.
warning_value:
is the warning value threshold.
duration:
is the duration the warning or critical values must violate the operator rule before the alert is triggered.
description:
sets the description of the alert. You may want to add a description of an action to take when this alert is raised.
present:
true|false
- by setting it to false
it will remove the alert.
Log alerts can be defined using this Ansible playbook.
An example configuration for a log alert is shown below.
- name: TLS failed to handshake with peer
warning_value: 50
critical_value: 100
duration: 5m
content: \"Failed to handshake with peer\"
source: "/var/log/cassandra/system.log"
description: "Detected TLS handshake error with peer"
level: warning,error,info
present: true
name:
is the name of the alert.
warning_value:
is the warning value threshold based on the count of matched logs.
critical_value:
is the critical value threshold based on the count of matched logs.
duration:
is the duration the warning or critical values must violate the operator rule before the alert is triggered.
content
: is the text search. Double quotes must be escaped.
Currently the following matching syntax is supported:
hello
- matcheshello
hello world
- matcheshello
orworld
"hello world"
- matches exacthello world
-hello
- matches excludinghello
-"hello world"
- matches excludinghello world
-hello -world
- matches excludinghello
orworld
source
: specifies the log source. This must match with one of the options available in the Source
filter found in the Logs&Events view.
description:
sets the description of the alert. You may want to add a description of an action to take when this alert is raised.
level:
sets the event level filter - a comma separated list with the following values: debug
, error
, warning
, info
present:
true|false
- by setting it to false
it will remove the alert.
Service checks in AxonOps can be configured using this playbook. Example service check configurations can be found
in:
./config/REPLACE_WITH_ORG_NAME/REPLACE_WITH_CLUSTER_NAME/service_checks.yml
Backup Schedules can be create and Backup snapshots taken
Supported backup locations are:
- local
- s3
- sftp
- azure
Remote Backup paths take the form of
$remote_path/cassandra/$cluster_name/$node_id
These following options apply to all backup configurations
Option | Required | Type | Description |
---|---|---|---|
present | No | Bool | Whether a backup schedule should exist. Setting to False will remove an existing schedule. Defaults to True |
local_retention | No | Str | How long to keep a snapshot on the local node. Defaults to 10d (10 Days) |
remote | No | Bool | Whether backup is to a remote. Defaults to False |
remote_retention | No | Str | How long to keep a snapshot on the remote location. Defaults to 60d (60 Days) |
remote_type | Only if remote is True | Str | Where to send backups. One of 'local', 's3', 'sftp', 'azure'. Defaults to local. |
timeout | No | Str | Time before backup times out. Defaults to 10h (10 Hours) |
transfers | No | Int | File Transfers Parallelism |
tps_limit | No | Int | Throttle transfer amount |
bw_limit | No | Str | Apply bandwith throttling. Use a suffix b |
tag | No | Str | Tag to apply to the backup |
datacenters | Yes | List(Str) | Datacenters to include in backup |
nodes | No | List(str) | Nodes to include in backup |
tables_keyspace | No | List(str) | Mutually exclusive with tables |
tables | No | List(str) | Tables to include in backup. Mutually exclusive with tables_keyspace |
keyspaces | No | List(str) | Keyspaces to include in backup |
schedule | No | Bool | Whether to schedule a future backup. If False then an immediate snapshot will be taken |
schedule_expre | No | Str | Crontab expression of backup schedule. Defaults to '0 1 * * *' |
Backs up to the local filesystem of the node.
Sends backups to an S3 bucket
Option | Required | Type | Description |
---|---|---|---|
remote_path | Yes | Str | Path to store the backups, Needs to include the bucketname. eg mybucket/path/to/backups |
s3_region | Yes | Bool | S3 region that bucket is in |
s3_access_key_id | No | Str | S3 Access key ID if not using IAM authentication |
s3_secret_access_key | No | Str | S3 Access key if not using IAM authentication |
s3_storage_class | No | Str | Storage class of bucket. Defaults to STANDARD. One of 'default', 'STANDARD', 'reduced_redundancy', 'standard_ia', 'onezone_ia', 'glacier', 'deep_archive', 'intelligent_tiering' |
s3_acl | No | Str | ACL type of bucket. Defaults to private. One of 'private', 'public-read', 'public-read-write','authenticated-read', 'bucket-owner-read' |
s3_encryption | No | Str | Encryption to apply. Defaults to AES256. One of 'none', 'AES256' |
s3_no_check_bucket | No | Bool | |
s3_disable_checksum | No | Bool |
Sends backups to and SFTP/SSH server
Option | Required | Type | Description |
---|---|---|---|
remote_path | Yes | Str | Path to store the backups on the remote server |
host | Yes | Str | Host to connect to |
ssh_user | Yes | Str | Username to connect as |
ssh_pass | No | Str | Password to connect with. Either ssh_pass or key_file needs to be set |
key_file | No | Str | Location of key file on the host. Either ssh_pass or key_file needs to be set |
Sends backups to an Azure Storage Blob container
Option | Required | Type | Description |
---|---|---|---|
remote_path | Yes | Str | Path to store the backups, Needs to include the container name. eg mycontainer/path/to/backups |
azure_account | Yes | Str | The name of the Azure storage account |
azure_endpoint | No | Str | To override the endpoint destination for the Azure storage account. Generally not required |
azure_key | No | Str | Storage account key. Only required if not using Azure MSI authentication |
azure_msi | No | Bool | Whether to use Azure MSI authentication to connect to the storage account |
azure_msi_object_id | No | Only required if there are multiple user assigned identities. Mutually exlusive with azure_msi_client_id and azure_msi_mi_res_id | |
azure_msi_client_id | No | Only required if there are multiple user assigned identities. Mutually exlusive with azure_msi_object_id and azure_msi_mi_res_id | |
azure_msi_mi_res_id | No | Only required if there are multiple user assigned identities. Mutually exlusive with azure_msi_object_id and azure_msi_client_id |
- name: Schedule a backup to S3 bucket
remote_type: s3
cluster: testcluster
datacenters: dc1
remote_path: bucketname/path
local_retention: 10d
remote_retention: 60d
tag: "scheduled backup"
timeout: 10h
remote: True
schedule: True
schedule_expr: 0 1 * * *
s3_region: eu-west-2
s3_acl: private
- name: Snapshot a table to an Azure Blob
remote_type: azure
cluster: testcluster
datacenters: dc1
remote_path: foo
local_retention: 10d
remote_retention: 30d
tag: "Snapshot appTable"
timeout: 10h
remote: True
tables: ['appKeyspace.appTable']
keyspaces: ['appKeyspace']
schedule: False
azure_account: azure_storage_account_name
azure_use_msi: True
The playbooks are designed to run in a predefined order as some of them depend on the others. For example, you'll need to create the alert endpoints before you can set up alert routing.
- Set up alert endpoints
- Set up routes
- Set up metrics alerts
- Set up log alerts
- Set up Service checks
- Set up backup schedules
The provided Makefile is the easiest way to run the playbooks:
❯ make help
metrics-alerts Create alerts based on metrics
check run pre-commit tests
endpoints Create alert endpoints and integrations
log-alerts Create alerts based on logs
routes Create alert routes
service-checks Create alerts for TCP and shell connections
backups Create backup schedules
You can dedide to either configure all the parameters as explained above using the export_tokens.sh file or you can set them in the command line overriding the environment configuration:
make endpoints AXONOPS_ORG=ORG_NAME
make routes AXONOPS_ORG=ORG_NAME AXONOPS_CLUSTER=CLUSTER_NAME
make metrics-alerts AXONOPS_ORG=ORG_NAME AXONOPS_CLUSTER=CLUSTER_NAME
make log-alerts AXONOPS_ORG=ORG_NAME AXONOPS_CLUSTER=CLUSTER_NAME
make service-checks AXONOPS_ORG=ORG_NAME AXONOPS_CLUSTER=CLUSTER_NAME
make backups AXONOPS_ORG=ORG_NAME AXONOPS_CLUSTER=CLUSTER_NAME
The provided playbooks are only examples. Adapt the rules and configurations to suit your enterprise requirements.