Dashboard reporting incorrect notification counts #1369

whabanks · 2023-10-18T16:16:57Z

Describe the bug

Notification counts, under the "Sent in the last week" section of a users dashboard, are being reported incorrectly. The counts are much larger than the actual count of sent notifications found in the DB and in Redis under the total_notifications key for that service.

Bug Severity

See examples in the documentation

SEV-2 Major

To Reproduce

Steps to reproduce are currently unknown

Expected behavior

Notification counts under the "Sent in the last week" section should be reported accurately.

Impact

Users are unable to accurately evaluate their sent notification count against their failure and bounce rates, making it very confusing to understand the current state of their service. Users may feel deterred from addressing their problem email addresses due to this confusion, and thus have an affect on our overall bounce rate in AWS. Users who sent frequently in the morning are more affected.

Impact on Notify users:

Confusion around how many notifications a user's service have sent. This makes it difficult for a user to understand their send limits, failure rates, and bounce rates. Concern, around the trust level of their service, that recipients may have received duplicate emails.

Impact on Recipients:

None that we are aware of at this time. There is no evidence to suggest that any notification sends have been duplicated.

Impact on Notify team:

Increased load on the support team needing to answer tickets to clarify to users that the counts are incorrect.

Screenshots

This user's dashboard is reporting 28,712 sent notifications.
[Private Zenhub Image]
(https://api.zenhub.com/attachedFiles/eyJfcmFpbHMiOnsibWVzc2FnZSI6IkJBaHBBMVpNQVE9PSIsImV4cCI6bnVsbCwicHVyIjoiYmxvYl9pZCJ9fQ==--89aed9ef8ef19b7dc7778fd2145ab85009a7b16a/image.png)
However in actuality over the last week, the user has only sent a total of

Additional context

This issue was brought to our attention by two support tickets asking about the discrepancy:
https://cds-snc.freshdesk.com/a/tickets/15832
https://cds-snc.freshdesk.com/a/tickets/15825

Discussion threads:
https://gcdigital.slack.com/archives/C03FA4DJCCU/p1697628973581769
https://gcdigital.slack.com/archives/C03FA4DJCCU/p1697554077616559

The text was updated successfully, but these errors were encountered:

whabanks · 2023-10-18T16:25:27Z

Navigate to an affected service and click on the "Emails sent in the last week" link and view the sent notifications. You can verify, without using the database or Redis keys, that the counts are incorrect by:

Downloading the report and see that the counts do not line up
Take the count displayed on the Dashboard, in this case 28,712 and divide it by 50 (the number of notifications per page). Then edit the URL where we specify the page number and attempt to navigate to the last page and see that it is not possible.

This implies we have the information we need to display the correct counts somewhere in the data we pass to the dashboard. We can use that data temporarily as stop-gap measure while we determine where things are going wrong in our current process for fetching and displaying the notification counts.

whabanks · 2023-11-06T19:38:24Z

Possibly related: #1378

yaelberger-commits · 2023-11-16T16:47:00Z

Hey team! Please add your planning poker estimate with Zenhub @andrewleith @jzbahrai @whabanks

whabanks · 2023-11-22T14:04:42Z

Started work on a Jupyter notebook to compare Redis notification counts to notification counts in the DB to identify services with discrepancies to make investigation simpler.

whabanks · 2023-11-23T21:51:08Z

Uncovered some leads from investigating affected services, identified via the Jupyter notebook mentioned above.

Affected services all have jobs associated with them. (At least the ones investigated thus far)
For some services the sum(notifications across jobs) = discrepancy count

As the sum(notifications across all jobs) does not always = discrepancy count our job reprocessing code seems like a reasonable place to start

Next steps / plan moving foward:

Identify if any of the jobs in affected services were retried or not, and If so does the sum(notifications in retried jobs) = discrepancy count.

Add a query to the Jupyter notebook, examining each affected service, to determine if this is an issue with jobs specifically or if one-off sends are also contributing the discrepancies.

If the aforementioned theory around retried jobs can be validated, then try reproducing the issue by delaying job processing long enough to trigger a retry.

whabanks · 2023-12-01T16:46:04Z

After further investigation with both @andrewleith and @jzbahrai we've narrowed the cause of the discrepancies down to a couple issues, when fetching notifications from the ft_notification_status table, believed to be the root cause of the discrepancies we see between counts on the dashboard, notifications page, and downloadable reports.

We fetch notifications starting from the service's retention period - 1 day.

For provincial services with a retention period of 3 days this means we are not fetching notifications for the entire previous week.

retention period - 1 day is converted to midnight UTC before use in the query.

This leaves a 5 hour window between 00:30 and 05:30 where not all notifications for the week are being collected.

whabanks · 2023-12-06T14:04:36Z

Further investigation/analysis confirms the issue with UTC times affecting counts.
Next steps are to confirm if the retention period is also affecting these counts, and implement the fix.

whabanks · 2023-12-13T13:50:05Z

Ready for review!

andrewleith · 2023-12-13T14:03:44Z

Work complete, ready for review
@andrewleith to review

whabanks · 2023-12-14T14:11:26Z

PR was reviewed, some small refactors to improve testability to come before merging.

adriannelee · 2023-12-20T14:04:16Z

Good to go for QA once code freeze lifts

whabanks · 2024-01-10T13:38:09Z

Code has been merged and is ready for QA in staging.

whabanks · 2024-01-10T20:42:27Z

Current state:

For some services, counts are now correct
For others, they're slightly off - Likely a side effect of the nightly task being off by 5 hours. (Discussed below)

Notes for when this is picked up later

Notification counts are pulled from the notification table for the current day, and from ft_notification_status for previous days
The nightly task that populates counts in the ft_notification_status table collects notifications from 5AM the day before to 5AM of the current day. So we need to match that timeframe when we fetch notifications for the current day.
- Timezone and daylight savings time aware dates were introduced in an attempt to match the timeframes used in the nightly task when populating ft_notification_status
We need to revisit and adjust the timeframe that the nightly task uses when aggregating notifications for ft_notification_status. 5AM to 5AM doesn't really make a lot of sense and is likely the cause of some services still having discrepancies. If they sent anything between midnight and 5AM UTC, those notifications would not be counted in the aggregate count for that day.

andrewleith · 2024-02-12T14:24:45Z

Made some headway on this. Looks like there are 2 areas causing issues:

When we copy aggregate data to ft_notification_status we are not using midnight UTC
When we calculate statistics for the dashboard, we are in one case not using midnight UTC

whabanks · 2024-03-06T14:03:53Z

@jzbahrai to review today.

mtoutloff · 2024-03-13T13:03:10Z

@jzbahrai QA'd and will move this back to product backlog for now

yaelberger-commits · 2024-05-23T14:19:16Z

Notification table and notification history table. When we store into facts table, it's within a certain timeframe. Our timeframes are not actually adding up. When we download that report, with aggregate data, it's not adding up. Had teamed up with Core, one of the celery tasks for doing aggregation was using different time. Many different parts

yaelberger-commits · 2024-05-23T14:27:18Z

Revisit priority in Q2 (early July)

andrewleith · 2024-10-02T15:19:31Z

This is now complete. Stats are matching on the dashboard, the notifications report page, as well as the notification reports download. 🎉

whabanks added High Priority | Haute priorité Bug | Bogue Dev Task for implementation of a technical solution labels Oct 18, 2023

whabanks self-assigned this Nov 20, 2023

andrewleith assigned andrewleith and unassigned whabanks Feb 7, 2024

andrewleith added the Ready for release to prod Work that is done and ready to release label Feb 26, 2024

jzbahrai assigned jzbahrai and unassigned andrewleith Mar 4, 2024

yaelberger-commits added Medium Priority | Priorité moyenne and removed High Priority | Haute priorité labels May 23, 2024

yaelberger-commits self-assigned this May 23, 2024

yaelberger-commits unassigned jzbahrai May 23, 2024

yaelberger-commits removed their assignment Aug 21, 2024

yaelberger-commits removed the Ready for release to prod Work that is done and ready to release label Aug 21, 2024

yaelberger-commits closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dashboard reporting incorrect notification counts #1369

Dashboard reporting incorrect notification counts #1369

whabanks commented Oct 18, 2023 •

edited by yaelberger-commits

Loading

whabanks commented Oct 18, 2023 •

edited

Loading

whabanks commented Nov 6, 2023 •

edited

Loading

yaelberger-commits commented Nov 16, 2023

whabanks commented Nov 22, 2023

whabanks commented Nov 23, 2023 •

edited

Loading

whabanks commented Dec 1, 2023

whabanks commented Dec 6, 2023

whabanks commented Dec 13, 2023

andrewleith commented Dec 13, 2023

whabanks commented Dec 14, 2023

adriannelee commented Dec 20, 2023

whabanks commented Jan 10, 2024

whabanks commented Jan 10, 2024

andrewleith commented Feb 12, 2024

whabanks commented Mar 6, 2024

mtoutloff commented Mar 13, 2024

yaelberger-commits commented May 23, 2024

yaelberger-commits commented May 23, 2024

andrewleith commented Oct 2, 2024

Dashboard reporting incorrect notification counts #1369

Dashboard reporting incorrect notification counts #1369

Comments

whabanks commented Oct 18, 2023 • edited by yaelberger-commits Loading

Describe the bug

Bug Severity

To Reproduce

Expected behavior

Impact

Impact on Notify users:

Impact on Recipients:

Impact on Notify team:

Screenshots

Additional context

whabanks commented Oct 18, 2023 • edited Loading

whabanks commented Nov 6, 2023 • edited Loading

yaelberger-commits commented Nov 16, 2023

whabanks commented Nov 22, 2023

whabanks commented Nov 23, 2023 • edited Loading

Next steps / plan moving foward:

whabanks commented Dec 1, 2023

whabanks commented Dec 6, 2023

whabanks commented Dec 13, 2023

andrewleith commented Dec 13, 2023

whabanks commented Dec 14, 2023

adriannelee commented Dec 20, 2023

whabanks commented Jan 10, 2024

whabanks commented Jan 10, 2024

Current state:

Notes for when this is picked up later

andrewleith commented Feb 12, 2024

whabanks commented Mar 6, 2024

mtoutloff commented Mar 13, 2024

yaelberger-commits commented May 23, 2024

yaelberger-commits commented May 23, 2024

andrewleith commented Oct 2, 2024

whabanks commented Oct 18, 2023 •

edited by yaelberger-commits

Loading

whabanks commented Oct 18, 2023 •

edited

Loading

whabanks commented Nov 6, 2023 •

edited

Loading

whabanks commented Nov 23, 2023 •

edited

Loading