Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API gateway timing out on spreadsheet upload #308

Closed
1 task
jimleroyer opened this issue Mar 5, 2024 · 13 comments
Closed
1 task

API gateway timing out on spreadsheet upload #308

jimleroyer opened this issue Mar 5, 2024 · 13 comments
Assignees

Comments

@jimleroyer
Copy link
Member

jimleroyer commented Mar 5, 2024

Describe the bug

The upload of a spreadsheet by IRCC triggered a 502 error on the API gateway, s it seems that the API it taking too long to process the request.

Bug Severity

See examples in the documentation

SEV-2 Major - This impedes user experience, indicates a potential failure upon spreadsheet upload and leads to confusion. The notifications to be sent are successful, hence this is not a critical issue.

To Reproduce

Steps to reproduce the behavior:

  1. Upload a CSV spreadsheet containing 50K notifications, using simulator addresses is sufficient to trigger this.
  2. Click the Send All button!
  3. Get the timeout error.

Expected behavior

No errors

Impact

We get repeated 502 errors on these spreadsheet uploads. This can involve the support team to answer on warning or critical alerts.

QA

  • upload csvs of various sizes (up to 50K) - the "send all now" button should quickly return and not cause a timeout

Screenshots

No screenshot available for now. The whole page said to retry the operation later, hence the error is not even properly surfaced as the website is blanking.

Additional context

This error was seen repeatedly during the mass send of 1M emails over 2 days by one of our user.
The service and template involved.

@sastels
Copy link

sastels commented Mar 5, 2024

Looking at the files uploaded, they actually had an extra named column that wasn't in the template. I verified with a small send that this all works (ie the proper columns are used to construct the notification. In this case, they have a Date column that isn't used in the template:

email address,Date,Date_EN,Date_FR,Date_Spanish
[email protected],2021-12-10,"December 10, 2021",10 décembre 2021,10 de diciembre de 2021
[email protected],2021-12-11,"December 11, 2021",11 décembre 2021,11 de diciembre de 2021

Note that the original files were .xlsx files, Notify read and stored them as csvs in the above format.

@sastels
Copy link

sastels commented Mar 5, 2024

recreated template on staging.

Send 50K row file. upload was fine

image.png

After clicking "Send all now" button, eventually got an error

image.png

When went back to dashboard, job was processing normally

@sastels
Copy link

sastels commented Mar 5, 2024

Add some logging to help figure out in staging where the time is being spent:
cds-snc/notification-api#2130

@sastels
Copy link

sastels commented Mar 5, 2024

In Athena with the query

select time, cast(elb_status_code as integer) as code, request_url, target_processing_time as api_lambda_time
from alb_logs
where request_url like '%start-job%'
order by target_processing_time desc
limit 100

We can see the IRCC requests timing out after 29 seconds, but all other recent jobs have taken less than 19 seconds. Don't know how big those jobs were though.

@sastels
Copy link

sastels commented Mar 6, 2024

Logging is in staging! Running the big send again we see from this query:
image.png

So checking that the service will not go over its limit is taking about a minute!

Since we are sending emails, this is the code that is taking a minute to run:

   elif template.template_type == EMAIL_TYPE:
        check_email_daily_limit(service, len(list(recipient_csv.get_rows())))
        scheduled_for = datetime.fromisoformat(data.get("scheduled_for")) if data.get("scheduled_for") else None

        if scheduled_for is None or not scheduled_for.date() > datetime.today().date():
            increment_email_daily_count_send_warnings_if_needed(service, len(list(recipient_csv.get_rows())))

There are three things here that could be the issue:

  • recipient_csv.get_rows() might take a while, and we are calling it twice to get the length of the file. There might be a better way. In particular, recipient_csv has a __len__ defined, so we should be able to just do len(recipient_csv).
  • check_email_daily_limit()
  • increment_email_daily_count_send_warnings_if_needed()

@sastels
Copy link

sastels commented Mar 6, 2024

With a bit more logging we get
image.png

So it's taking half the time now. I believe what is happening is that recipient_csv.get_rows() is taking 30 seconds to run. Before we were calling it twice, but in the recent logging PR we used properties of recipient_csv that cache the results, so now it's only being caused once.

@sastels
Copy link

sastels commented Mar 7, 2024

Can get the notification count from the s3 object's metadata instead. preparing a PR...

@sastels
Copy link

sastels commented Mar 7, 2024

This should fix it!
cds-snc/notification-api#2133

@sastels
Copy link

sastels commented Mar 7, 2024

This should fix it!
cds-snc/notification-api#2135

@sastels
Copy link

sastels commented Mar 7, 2024

testing on staging with the big IRCC file now takes about one second with no timeouts.

@jimleroyer
Copy link
Member Author

I will QA this morning.

@jimleroyer jimleroyer assigned jimleroyer and unassigned sastels Mar 13, 2024
@sastels
Copy link

sastels commented Mar 18, 2024

Jimmy will QA for sure this time!

@jimleroyer
Copy link
Member Author

Tested and working with the same template that IRCC used and with a 50K entries spreadsheet. The upload took a few seconds (~5s) but there were no timing out. The period of time after clicking Send all was almost instantaneous. This can be considered fixed and working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants