Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jap/simple forms/s3 spike #18338

Closed
wants to merge 38 commits into from
Closed

Jap/simple forms/s3 spike #18338

wants to merge 38 commits into from

Conversation

pennja
Copy link
Contributor

@pennja pennja commented Sep 5, 2024

Note: Delete the description statements, complete each step. None are optional, but can be justified as to why they cannot be completed as written. Provide known gaps to testing that may raise the risk of merging to production.

Summary

  • SubmissionArchiver
    • Handles archiving a single submission’s PDFs and all related data/attachments/remediation docs within S3
  • SubmissionArchiveBuilder
    • Builds the archive bundle, complete with all remediation documents
    • If the submitted form/attachments are no longer present locally, they are built from saved form submission data
  • SubmissionArchiveHandler
    • Takes in a collection of benefits_intake_uuids and calls SubmissionArchiver on each, wrapping each one in logging and error handling
  • SubmissionArchiveHandlerJob
    • An asynchronous Sidekiq job which calls the handler. (Our primary process doesn’t call for this to happen, this is more for remediation)
  • ArchiveUploader
    • A SharePoint client which will take a file path and upload the file found within up to SharePoint (largely unfinished until Tricia/Matthew have their OFO meeting)
  • SharePoint::Client
    • Contains all the base logic to connect to and interact with SharePoint (hopefully)
  • ArchiveUploaderJob
    • An asynchronous job which handles downloading the submission folder from S3, zipping it up, and calling the ArchiveUploader to drop it into SharePoint

Related issue(s)

  • Link to ticket created in va.gov-team repo OR screenshot of Jira ticket if your team uses Jira
  • Link to previous change of the code/bug (if applicable)
  • Link to epic if not included in ticket

Testing done

  • New code is covered by unit tests
  • Describe what the old behavior was prior to the change
  • Describe the steps required to verify your changes are working as expected. Exclusively stating 'Specs run' is NOT acceptable as appropriate testing
  • If this work is behind a flipper:
    • Tests need to be written for both the flipper on and flipper off scenarios. Docs.
    • What is the testing plan for rolling out the feature?

Screenshots

Note: Optional

What areas of the site does it impact?

(Describe what parts of the site are impacted andifcode touched other areas)

Acceptance criteria

  • I fixed|updated|added unit tests and integration tests for each feature (if applicable).
  • No error nor warning in the console.
  • Events are being sent to the appropriate logging solution
  • Documentation has been updated (link to documentation)
  • No sensitive information (i.e. PII/credentials/internal URLs/etc.) is captured in logging, hardcoded, or specs
  • Feature/bug has a monitor built into Datadog (if applicable)
  • If app impacted requires authentication, did you login to a local build and verify all authenticated routes work as expected
  • I added a screenshot of the developed feature

Requested Feedback

(OPTIONAL)What should the reviewers know in addition to the above. Is there anything specific you wish the reviewer to assist with. Do you have any concerns with this PR, why?

Copy link

github-actions bot commented Sep 5, 2024

1 Error
🚫 This PR changes 632 LoC (not counting whitespace/newlines).

In order to ensure each PR receives the proper attention it deserves, those exceeding
500 will not be reviewed, nor will they be allowed to merge. Please break this PR up into
smaller ones.

If you have reason to believe that this PR should be granted an exception, please see the
Submitting pull requests for approval - FAQ.

File Summary

Files

  • modules/simple_forms_api/app/controllers/simple_forms_api/v1/uploads_controller.rb (+8/-3)

  • modules/simple_forms_api/app/services/simple_forms_api/s3/jobs/archive_uploader_job.rb (+35/-0)

  • modules/simple_forms_api/app/services/simple_forms_api/s3/jobs/submission_archive_handler_job.rb (+27/-0)

  • modules/simple_forms_api/app/services/simple_forms_api/s3/submission_archive_builder.rb (+125/-0)

  • modules/simple_forms_api/app/services/simple_forms_api/s3/submission_archive_handler.rb (+45/-0)

  • modules/simple_forms_api/app/services/simple_forms_api/s3/submission_archiver.rb (+92/-0)

  • modules/simple_forms_api/app/services/simple_forms_api/s3/utils.rb (+29/-0)

  • modules/simple_forms_api/app/services/simple_forms_api/share_point/archive_uploader.rb (+60/-0)

  • modules/simple_forms_api/app/services/simple_forms_api/share_point/client.rb (+87/-0)

  • modules/simple_forms_api/app/services/simple_forms_api/share_point/jobs/build_archive_and_upload_job.rb (+35/-0)

  • modules/simple_forms_api/spec/services/s3/submission_archive_builder_spec.rb (+35/-0)

  • modules/simple_forms_api/spec/services/s3/submission_archiver_spec.rb (+51/-0)

    Note: We exclude files matching the following when considering PR size:

    *.csv, *.json, *.tsv, *.txt, *.md, Gemfile.lock, app/swagger, modules/mobile/docs, spec/fixtures/, spec/support/vcr_cassettes/, modules/mobile/spec/support/vcr_cassettes/, db/seeds, modules/vaos/app/docs, modules/meb_api/app/docs, modules/appeals_api/app/swagger/, *.bru, *.pdf
    

Big PRs are difficult to review, often become stale, and cause delays.

Generated by 🚫 Danger

@va-vfs-bot va-vfs-bot temporarily deployed to jap/simple-forms/s3-spike/main/main September 5, 2024 21:56 Inactive
@va-vfs-bot va-vfs-bot temporarily deployed to jap/simple-forms/s3-spike/main/main September 5, 2024 23:39 Inactive
@pennja pennja force-pushed the jap/simple-forms/s3-spike branch from ef6ec33 to fa3ab67 Compare September 6, 2024 15:00
@va-vfs-bot va-vfs-bot temporarily deployed to jap/simple-forms/s3-spike/main/main September 6, 2024 15:04 Inactive
@va-vfs-bot va-vfs-bot temporarily deployed to jap/simple-forms/s3-spike/main/main September 6, 2024 15:24 Inactive
@va-vfs-bot va-vfs-bot temporarily deployed to jap/simple-forms/s3-spike/main/main September 6, 2024 15:32 Inactive
@va-vfs-bot va-vfs-bot temporarily deployed to jap/simple-forms/s3-spike/main/main September 6, 2024 16:33 Inactive
@pennja pennja force-pushed the jap/simple-forms/s3-spike branch from 983445a to 9543280 Compare September 6, 2024 21:17
@va-vfs-bot va-vfs-bot temporarily deployed to jap/simple-forms/s3-spike/main/main September 6, 2024 21:26 Inactive
@va-vfs-bot va-vfs-bot temporarily deployed to jap/simple-forms/s3-spike/main/main September 6, 2024 21:37 Inactive
end

def sign_s3_file_url(pdf)
signed_url = pdf.presigned_url(:get, expires_in: 1.year.to_i)
Copy link
Contributor

@kylesoskin kylesoskin Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pre-signed urls for long durations, from what I think I have heard, is not recommended, for security reasons.

In fact, I think 7 days is the longest it supports, with a recommended length of 1min to 12hrs

Curious what you are doing with the pre-signed urls to see if there is something else that could be done instead of using pre-signed url.

Usually I think for the S3 objects we would store the path to the file(s) in the DB and assume whatever needs to use it is authenticated itself to pull it down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the plan is to generate a signed URL that lasts for like 30 minutes so that the user can download their submitted PDF on the confirmation page. Then we can regenerate signed URLs in other locations as necessary. @kylesoskin Do you know what we could store in our database that lets us reference and generate signed URLs for the item in the S3 bucket later? Some sort of S3 ID? Or maybe a path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, yeah that all makes sense guys. I think I was misunderstanding how they work. I can do some more research and change the supporting logic accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndsprinkle I'll bet all we'd need is the path to the file in S3, I'll look into that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndsprinkle I'll bet all we'd need is the path to the file in S3, I'll look into that.

Yea, same way your plopping stuff in there, using that same S3 obj you should be able to pull stuff down. There is examples in vets-api somewhere if you grep for the S3 class thing.

Copy link
Contributor

@kylesoskin kylesoskin Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But... you gotta make sure that is super locked down if it is passing back to the user. If that is what youa re doing, providing back the doc to an end user, I almost like the signed url better.... That is scoped only to that 1 document. Allowing a user to use our backend to make calls to pull arbitrary files down from S3 makes me nervous that it could be exploited, IE user 1 changes some parameters and pulls down someone elses medical records. Since our backend can pull anything, presigned shortlived document scoped url MAY actually be best here... something to discuss for sure

Copy link
Contributor

@kylesoskin kylesoskin Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some combo of both, us storing the s3 details, but then generating a presigned url whenever the user wants to download it, gets the best of both worlds? But if this is a one time only thing then what you have here seems good. If it were something that the user could come back to at any time, and you want to be able to provide it, then maybe store s3 path, and generate short lived url each time they want to might be better. That way you arent storing things that are going to expire, can still use short-lived urls, and we can auth the user and confirm ownership of the file before generating that presigned url. Idk.

edit: nvm I see this is what @ndsprinkle said in his initial comment

end

def write_as_text_archive
save_file_to_s3("#{output_directory_path}/form_text_archive.txt", form_text_archive.to_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously I think when we did this we would pull everything down, and then zip it, and then only upload 1 zip file, with the name of the file being the va.gov internal submission ID, so that you had 1 thing to pass along and knew 1 file essentially represented 1 single claim. So less work to pass/move all around, and also less chance of only passing around partial amounts of claims and such. It also eliminated partial uploads, if the zip was there, it worked in its entirety, if the zip wasn't there, something didn't work. With this it is possible for some portions of a claim to upload, and someone think it is complete just looking from the S3 side, when in fact components of it may have failed. This is probably for @SamStuckey as I think he changed this from zip to not-zip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, good callout. I'll look at the original implementation again and determine the best approach to zipping up the files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, @SamStuckey moved away from that. There might have been a good reason, Im just not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to zip anything? I thought we'd want a .pdf file so we can supply it directly to the end user. Or are we expecting ourselves (or the user) to unzip it upon download?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it might depend on if there are attachments. If there are attachments, then zip but otherwise maybe just provide the pdf?

Comment on lines 60 to 62
quiet_pdf_failures: true, # skip PDF generation silently
quiet_upload_failures: true, # skip problematic uploads silently
run_quiet: true, # silence but record errors, logged at the end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all the stuff over the last year with the code yellow and silent failures, I think VA people blow a gasket if they saw these lol, especially defaulted to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'd be better just to get rid of the option to silence failures or just change the defaults? I could also add a "dry_run" option if that seems more appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not a huge fan of hard-coding stuff to support testing into where it could potentially get used in prod, but at the very least would default it false so that if anyone uses this without knowing about it, they are not by default in a state of silently failing.

If I needed to do things to test I would mock of the calls in the rspec, or just let it do the thing. There is mocked S3 returns in the mockdata repo that allow it to be coded like it was actually uploading to S3, and the testing framework/mock-data repo just intercepts the request and knows to return a thing that looks like an S3 thing.

For example:

VCR.use_cassette("s3/object/put/#{form_attachment_guid}/doctors-note.jpg", vcr_options) do

end

def s3_resource
@s3_resource ||= Reports::Uploader.new_s3_resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylesoskin @pennja It looks like we're using Reports::Uploader here (which relies on Settings.reports.aws) rather than something that relies on Settings.form526_backup.aws or Settings.form526_export.aws. Do we think it matters which secrets we use? I guess we're using Reports::Uploader because that is what form526.rake uses? Is that right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants