Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard reporting incorrect notification counts #1369

Closed
whabanks opened this issue Oct 18, 2023 · 19 comments
Closed

Dashboard reporting incorrect notification counts #1369

whabanks opened this issue Oct 18, 2023 · 19 comments
Labels
Bug | Bogue Dev Task for implementation of a technical solution Medium Priority | Priorité moyenne

Comments

@whabanks
Copy link

whabanks commented Oct 18, 2023

Describe the bug

Notification counts, under the "Sent in the last week" section of a users dashboard, are being reported incorrectly. The counts are much larger than the actual count of sent notifications found in the DB and in Redis under the total_notifications key for that service.

Bug Severity

See examples in the documentation

SEV-2 Major

To Reproduce

Steps to reproduce are currently unknown

Expected behavior

Notification counts under the "Sent in the last week" section should be reported accurately.

Impact

Users are unable to accurately evaluate their sent notification count against their failure and bounce rates, making it very confusing to understand the current state of their service. Users may feel deterred from addressing their problem email addresses due to this confusion, and thus have an affect on our overall bounce rate in AWS. Users who sent frequently in the morning are more affected.

Impact on Notify users:

Confusion around how many notifications a user's service have sent. This makes it difficult for a user to understand their send limits, failure rates, and bounce rates. Concern, around the trust level of their service, that recipients may have received duplicate emails.

Impact on Recipients:

None that we are aware of at this time. There is no evidence to suggest that any notification sends have been duplicated.

Impact on Notify team:

Increased load on the support team needing to answer tickets to clarify to users that the counts are incorrect.

Screenshots

This user's dashboard is reporting 28,712 sent notifications.
[Private Zenhub Image]
(https://api.zenhub.com/attachedFiles/eyJfcmFpbHMiOnsibWVzc2FnZSI6IkJBaHBBMVpNQVE9PSIsImV4cCI6bnVsbCwicHVyIjoiYmxvYl9pZCJ9fQ==--89aed9ef8ef19b7dc7778fd2145ab85009a7b16a/image.png)
However in actuality over the last week, the user has only sent a total of

Additional context

This issue was brought to our attention by two support tickets asking about the discrepancy:
https://cds-snc.freshdesk.com/a/tickets/15832
https://cds-snc.freshdesk.com/a/tickets/15825

Discussion threads:
https://gcdigital.slack.com/archives/C03FA4DJCCU/p1697628973581769
https://gcdigital.slack.com/archives/C03FA4DJCCU/p1697554077616559

@whabanks whabanks added High Priority | Haute priorité Bug | Bogue Dev Task for implementation of a technical solution labels Oct 18, 2023
@whabanks
Copy link
Author

whabanks commented Oct 18, 2023

Navigate to an affected service and click on the "Emails sent in the last week" link and view the sent notifications. You can verify, without using the database or Redis keys, that the counts are incorrect by:

  1. Downloading the report and see that the counts do not line up
  2. Take the count displayed on the Dashboard, in this case 28,712 and divide it by 50 (the number of notifications per page). Then edit the URL where we specify the page number and attempt to navigate to the last page and see that it is not possible.

This implies we have the information we need to display the correct counts somewhere in the data we pass to the dashboard. We can use that data temporarily as stop-gap measure while we determine where things are going wrong in our current process for fetching and displaying the notification counts.

@whabanks
Copy link
Author

whabanks commented Nov 6, 2023

Possibly related: #1378

@yaelberger-commits
Copy link
Collaborator

Hey team! Please add your planning poker estimate with Zenhub @andrewleith @jzbahrai @whabanks

@whabanks whabanks self-assigned this Nov 20, 2023
@whabanks
Copy link
Author

Started work on a Jupyter notebook to compare Redis notification counts to notification counts in the DB to identify services with discrepancies to make investigation simpler.

@whabanks
Copy link
Author

whabanks commented Nov 23, 2023

Uncovered some leads from investigating affected services, identified via the Jupyter notebook mentioned above.

  1. Affected services all have jobs associated with them. (At least the ones investigated thus far)
  2. For some services the sum(notifications across jobs) = discrepancy count

As the sum(notifications across all jobs) does not always = discrepancy count our job reprocessing code seems like a reasonable place to start

Next steps / plan moving foward:

  • Identify if any of the jobs in affected services were retried or not, and If so does the sum(notifications in retried jobs) = discrepancy count.
  • Add a query to the Jupyter notebook, examining each affected service, to determine if this is an issue with jobs specifically or if one-off sends are also contributing the discrepancies.
  • If the aforementioned theory around retried jobs can be validated, then try reproducing the issue by delaying job processing long enough to trigger a retry.

@whabanks
Copy link
Author

whabanks commented Dec 1, 2023

After further investigation with both @andrewleith and @jzbahrai we've narrowed the cause of the discrepancies down to a couple issues, when fetching notifications from the ft_notification_status table, believed to be the root cause of the discrepancies we see between counts on the dashboard, notifications page, and downloadable reports.

  1. We fetch notifications starting from the service's retention period - 1 day.

For provincial services with a retention period of 3 days this means we are not fetching notifications for the entire previous week.

  1. retention period - 1 day is converted to midnight UTC before use in the query.

This leaves a 5 hour window between 00:30 and 05:30 where not all notifications for the week are being collected.

@whabanks
Copy link
Author

whabanks commented Dec 6, 2023

Further investigation/analysis confirms the issue with UTC times affecting counts.
Next steps are to confirm if the retention period is also affecting these counts, and implement the fix.

@whabanks
Copy link
Author

Ready for review!

@andrewleith
Copy link
Member

@whabanks
Copy link
Author

PR was reviewed, some small refactors to improve testability to come before merging.

@adriannelee
Copy link
Collaborator

Good to go for QA once code freeze lifts

@whabanks
Copy link
Author

Code has been merged and is ready for QA in staging.

@whabanks
Copy link
Author

Current state:

  • For some services, counts are now correct
  • For others, they're slightly off - Likely a side effect of the nightly task being off by 5 hours. (Discussed below)

Notes for when this is picked up later

  • Notification counts are pulled from the notification table for the current day, and from ft_notification_status for previous days
  • The nightly task that populates counts in the ft_notification_status table collects notifications from 5AM the day before to 5AM of the current day. So we need to match that timeframe when we fetch notifications for the current day.
    • Timezone and daylight savings time aware dates were introduced in an attempt to match the timeframes used in the nightly task when populating ft_notification_status
  • We need to revisit and adjust the timeframe that the nightly task uses when aggregating notifications for ft_notification_status. 5AM to 5AM doesn't really make a lot of sense and is likely the cause of some services still having discrepancies. If they sent anything between midnight and 5AM UTC, those notifications would not be counted in the aggregate count for that day.

@andrewleith andrewleith assigned andrewleith and unassigned whabanks Feb 7, 2024
@andrewleith
Copy link
Member

Made some headway on this. Looks like there are 2 areas causing issues:

  1. When we copy aggregate data to ft_notification_status we are not using midnight UTC
  2. When we calculate statistics for the dashboard, we are in one case not using midnight UTC

@andrewleith andrewleith added the Ready for release to prod Work that is done and ready to release label Feb 26, 2024
@jzbahrai jzbahrai assigned jzbahrai and unassigned andrewleith Mar 4, 2024
@whabanks
Copy link
Author

whabanks commented Mar 6, 2024

@jzbahrai to review today.

@mtoutloff
Copy link
Collaborator

@jzbahrai QA'd and will move this back to product backlog for now

@yaelberger-commits
Copy link
Collaborator

Notification table and notification history table. When we store into facts table, it's within a certain timeframe. Our timeframes are not actually adding up. When we download that report, with aggregate data, it's not adding up. Had teamed up with Core, one of the celery tasks for doing aggregation was using different time. Many different parts

@yaelberger-commits
Copy link
Collaborator

Revisit priority in Q2 (early July)

@yaelberger-commits yaelberger-commits self-assigned this May 23, 2024
@yaelberger-commits yaelberger-commits removed their assignment Aug 21, 2024
@yaelberger-commits yaelberger-commits removed the Ready for release to prod Work that is done and ready to release label Aug 21, 2024
@andrewleith
Copy link
Member

This is now complete. Stats are matching on the dashboard, the notifications report page, as well as the notification reports download. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug | Bogue Dev Task for implementation of a technical solution Medium Priority | Priorité moyenne
Projects
None yet
Development

No branches or pull requests

6 participants