Skip to content

How to Evaluate statistics for generic "Lambda Error" Alerts posted to Slack

Tim Ellis edited this page Aug 27, 2024 · 14 revisions

Pre-Requisites

  • Access to CMS VPN
  • Access to BFD/CMS AWS account(s)
  • Installation of AWS CLI, properly configured for access to BFD/CMS AWS account
  • Installation of jq

Instructions

Determine which alarms have been tripped within the past 24h:

Save the following bash script with execute permissions to your local environment after connecting to CMS VPN:

#!/bin/bash

## set Cloudwatch Metrics Query

LAMBDAERRORMETRICSRCH=$(

cat <<EOF

[ { "Id": "e1", "Expression": "SEARCH('{AWS/Lambda,FunctionName} (\"Errors\")', 'Sum', 300)", "Label": "Expression1", "ReturnData": true } ]

EOF

)

## set start and end time window for Metrics Query

STARTTS=$(date -u -v-1d "+%Y-%m-%dT%H:%M:%S")

ENDTS=$(date -u "+%Y-%m-%dT%H:%M:%S")

## set temp file for JQ summary of Metrics Query and set trap for exit on failure

JQRYEX=$(mktemp -q /tmp/jqry.XXXXXX.jq || exit 1)

trap 'rm -f -- "${JQRYEX}"' EXIT

## construct JQ expression

cat > ${JQRYEX} <<EOF2

.MetricDataResults | map({ Lambda: (.Label | ltrimstr("Expression1 ")), HitsInUTC: [ .Timestamps as \$ts | .Values as \$vals | [ range(0; \$ts | length) | select (\$vals[.] > 0) | { (\$ts[.]) : \$vals[.]} ] | add ] }) | .[] | select(.HitsInUTC[])

EOF2

## Query Cloudwatch Metrics

aws cloudwatch get-metric-data --metric-data-queries "${LAMBDAERRORMETRICSRCH}" --start-time "${STARTTS}" --end-time "${ENDTS}" --output json | jq -f ${JQRYEX} | jq -s .

## temp file cleanup

rm -f -- "${JQRYEX}"

trap - EXIT

exit

Note: The above script queries the AWS/Lambda Cloudwatch Metrics for the past 24 hours (date -u -v-1d "+%Y-%m-%dT%H:%M:%S"). That result set is filtered and sorted by the corresponding JQ query (JQRYEX) for output to be evaluated by the engineer.

Investigation

Execute the saved script and proceed with further detailed investigation as warranted; feel free to customize as needed.

Clone this wiki locally