Add callback failure warning email #2190

whabanks · 2024-06-04T21:03:03Z

Summary | Résumé

This PR adds the groundwork for sending a warning email to service owners when their callback service is not working correctly.

The actual email sending based on threshold is not yet implemented. It is non-trivial to ensure that we don't spam users with emails every time a retried callback reaches the max retries of 5. For the time being we will simply monitor how frequently this occurs while the remaining work is completed.

Related Issues | Cartes liées

Test instructions | Instructions pour tester la modification

TODO: Fill in test instructions for the reviewer.

Release Instructions | Instructions pour le déploiement

None.

Reviewer checklist | Liste de vérification du réviseur

This PR does not break existing functionality.
This PR does not violate GCNotify's privacy policies.
This PR does not raise new security concerns. Refer to our GC Notify Risk Register document on our Google drive.
This PR does not significantly alter performance.
Additional required documentation resulting of these changes is covered (such as the README, setup instructions, a related ADR or the technical documentation).

⚠ If boxes cannot be checked off before merging the PR, they should be moved to the "Release Instructions" section with appropriate steps required to verify before release. For example, changes to celery code may require tests on staging to verify that performance has not been affected.

- Alphabeticalized the list of Notify's templates in config.py, because readability is nice.

whabanks · 2024-06-04T21:03:46Z

app/celery/service_callback_tasks.py

@@ -86,5 +86,5 @@ def _send_data_to_service_callback_api(self, data, service_callback_url, token,
                self.retry(queue=QueueNames.CALLBACKS_RETRY)
            except self.MaxRetriesExceededError:
                current_app.logger.warning(
-                    "Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id}"
+                    f"Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id}"


Turns out we have not been properly capturing callback failure logs for a while now.

- Added CALLBACK_FAILURE_THRESHOLD_PERCENTAGE env var

migrations/versions/0453_add_callback_failure_email.py

- Updated the callback email migration script with the latest changes from content - Added a method to send the callback failure email to service owners - Stubbed a method to query cloudwatch so we can determine if callbacks for the service have failed at least 5 times in a 30 minute time period before we send the email to the service owner

app/celery/service_callback_tasks.py

whabanks · 2024-06-06T16:10:22Z

app/config.py

    ALREADY_REGISTERED_EMAIL_TEMPLATE_ID = "0880fbb1-a0c6-46f0-9a8e-36c986381ceb"
+    APIKEY_REVOKE_TEMPLATE_ID = "a0a4e7b8-8a6a-4eaa-9f4e-9c3a5b2dbcf3"
+    BRANDING_REQUEST_TEMPLATE_ID = "7d423d9e-e94e-4118-879d-d52f383206ae"
+    CALLBACK_FAILURE_TEMPLATE_ID = "d8d580f4-86b3-4ba4-9d7c-263a630af354"


Amongst the rearranging, this env var is the one I added.

Easier to read, thank you 🔤

- fix circular dependency - formatting

migrations/versions/0453_add_callback_failure_email.py

andrewleith

Looks good. If our threshold is simply whenever MaxRetriesExceededError is thrown, I think we can use a redis key to ensure we only send one email per day. See comment below.

andrewleith · 2024-06-12T11:45:19Z

app/celery/service_callback_tasks.py

        # Retry if the response status code is server-side or 429 (too many requests).
        if not isinstance(e, HTTPError) or e.response.status_code >= 500 or e.response.status_code == 429:
            try:
                self.retry(queue=QueueNames.CALLBACKS_RETRY)
            except self.MaxRetriesExceededError:
                current_app.logger.warning(
-                    "Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id}"
+                    f"Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id} for service: {current_app.service.id}"


For the limits feature, we use an expiring redis key to ensure only one email is sent within a particular time period. For example, if they reach the threshold, we send the email then set this key in redis to true. Then if they hit the threshold again within the time period the redis key will still be true and we will check it before sending another email.

andrewleith · 2024-06-12T11:46:02Z

app/config.py

    ALREADY_REGISTERED_EMAIL_TEMPLATE_ID = "0880fbb1-a0c6-46f0-9a8e-36c986381ceb"
+    APIKEY_REVOKE_TEMPLATE_ID = "a0a4e7b8-8a6a-4eaa-9f4e-9c3a5b2dbcf3"
+    BRANDING_REQUEST_TEMPLATE_ID = "7d423d9e-e94e-4118-879d-d52f383206ae"
+    CALLBACK_FAILURE_TEMPLATE_ID = "d8d580f4-86b3-4ba4-9d7c-263a630af354"


Easier to read, thank you 🔤

app/config.py

    HEARTBEAT_TEMPLATE_EMAIL_LOW = "73079cb9-c169-44ea-8cf4-8d397711cc9d"
    HEARTBEAT_TEMPLATE_EMAIL_MEDIUM = "c75c4539-3014-4c4c-96b5-94d326758a74"
-    HEARTBEAT_TEMPLATE_EMAIL_HIGH = "276da251-3103-49f3-9054-cbf6b5d74411"
+    HEARTBEAT_TEMPLATE_SMS_HIGH = "4969a9e9-ddfd-476e-8b93-6231e6f1be4a"


jzbahrai · 2024-09-19T19:02:42Z

We are not doing this,this is already in the PRD GC notify service

Add callback failure warning email

4878ab9

- Alphabeticalized the list of Notify's templates in config.py, because readability is nice.

whabanks commented Jun 4, 2024

View reviewed changes

Add callback failure warning email (actually this time)

ece47da

- Added CALLBACK_FAILURE_THRESHOLD_PERCENTAGE env var

github-advanced-security bot found potential problems Jun 4, 2024

View reviewed changes

migrations/versions/0453_add_callback_failure_email.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Jun 5, 2024

View reviewed changes

app/celery/service_callback_tasks.py Fixed Show fixed Hide fixed

app/celery/service_callback_tasks.py Fixed Show fixed Hide fixed

Merge branch 'main' into feat/notify-user-failed-callbacks

3a760db

whabanks commented Jun 6, 2024

View reviewed changes

Update email content

02b533f

- fix circular dependency - formatting

github-advanced-security bot found potential problems Jun 6, 2024

View reviewed changes

migrations/versions/0453_add_callback_failure_email.py Show resolved Hide resolved

whabanks marked this pull request as ready for review June 6, 2024 21:02

Fix implicit string concat

f490ccf

github-advanced-security bot found potential problems Jun 6, 2024

View reviewed changes

migrations/versions/0453_add_callback_failure_email.py Fixed Show fixed Hide fixed

whabanks and others added 2 commits June 11, 2024 09:06

Merge branch 'main' into feat/notify-user-failed-callbacks

f9bb8db

Formatting

8a41ad8

whabanks requested review from andrewleith and jzbahrai and removed request for andrewleith June 11, 2024 14:41

andrewleith reviewed Jun 12, 2024

View reviewed changes

Merge branch 'main' into feat/notify-user-failed-callbacks

12986af

github-advanced-security bot found potential problems Jul 30, 2024

View reviewed changes

whabanks and others added 3 commits July 30, 2024 15:52

Update migration

5507643

Merge branch 'main' into feat/notify-user-failed-callbacks

9f3e6da

Timeout callback requests after 1 sec

d11521d

jzbahrai closed this Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add callback failure warning email #2190

Add callback failure warning email #2190

whabanks commented Jun 4, 2024 •

edited

Loading

whabanks Jun 4, 2024

whabanks Jun 6, 2024

andrewleith Jun 12, 2024

andrewleith left a comment

andrewleith Jun 12, 2024

andrewleith Jun 12, 2024

jzbahrai commented Sep 19, 2024

Add callback failure warning email #2190

Add callback failure warning email #2190

Conversation

whabanks commented Jun 4, 2024 • edited Loading

Summary | Résumé

Related Issues | Cartes liées

Test instructions | Instructions pour tester la modification

Release Instructions | Instructions pour le déploiement

Reviewer checklist | Liste de vérification du réviseur

whabanks Jun 4, 2024

Choose a reason for hiding this comment

whabanks Jun 6, 2024

Choose a reason for hiding this comment

andrewleith Jun 12, 2024

Choose a reason for hiding this comment

andrewleith left a comment

Choose a reason for hiding this comment

andrewleith Jun 12, 2024

Choose a reason for hiding this comment

andrewleith Jun 12, 2024

Choose a reason for hiding this comment

jzbahrai commented Sep 19, 2024

whabanks commented Jun 4, 2024 •

edited

Loading