-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CC-2305: create script for auditing S3 files and their permissions #180
base: master
Are you sure you want to change the base?
CC-2305: create script for auditing S3 files and their permissions #180
Conversation
Turns out setting up to use either path based or native is very un-simple. Try if necessary another time. This reverts commit c4ecdcb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good - given how much things have changed, could you redo your PR description to reflect the final changeset? notably, I want to see an overview of both what each of these scripts do and examples of their usage especially as I believe the idea is they're now used together.
I can provide some pull requests from close-sourced codebases if you'd like some examples of what I'm looking for
(ideally we'd get this documented in say the readme, but right now I think its most important we get it documented somewhere and having it in the PR description is a good first step as we can copy from that into other places later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few more suggestions and ideas. I don't want to become a blocker myself now that Gareth is able to approve/request-changes as needed. So I have only added them as comments.
I also have a couple of more general overall suggestions that probably are not worth the effort now - but I think you should keep in mind next time.
A couple of the script deal with lists containing heterogenous data. i.e. [True, False, a bunch of other values]. Each of the separate positions has a specific purpose. You have added comments when setting them as to their purpose. However, as soon as you pass that list to another function the context is no longer available to the reader. This makes it a little harder to follow what the code is doing where. I'd consider using dataclasses
or at a minimum dictionaries. The use of these types instead of straight lists means that the purpose of each value is carried wit it. This will of course have a small overhead for creating the objects but unless you are dealing with a very large dataset or performance is critical - the added readability will be a net win.
The other piece of advice is - if you are working with a Django project, every little script to fix up data - not matter how short lived should probably start as a Django management command. It provides access to configuration, the database and models without having to jump through hoops. It also allows you to re-use business logic that already exists within the application and easily move shared functionality out of the management commands into well organised modules within the Django app. e.g. the dict building code you have in a couple of the scripts.
This all said - these are effectively one-shot scripts to fix up a specific issue. So I also understand that there is little value in applying too much spit and polish :-)
…Django upgrade in progress anyway that will address this vuln)
Agree with all your points above, Jon M. I will make these scripts into Management commands with some instructions, as that will make using them much cleaner. |
I have made all 3 scripts into Django Management Commands. |
Adds scripts to resolve discrepancies between NZSL Signbank's Postgres database and Amazon S3 file storage.
Jira ticket
CC-2305 NZSL Signbank: UAT database backport
Changes
docs
in this repo for further expansion.This PR creates 3 scripts.
These are used to report on relationships between Signbank's Postgres database and Amazon's S3 file storage, and then assist with effecting some types of repair where discrepancies exist.
Only some actions are performed by these scripts, other operations have to be manually scripted using the AWS
cli
or other means, using the output from these scripts as data.The scripts use the
Boto3
python library to talk to AWS S3.They use an external client to talk to Postgres.
They output diagnostic and progress information on STDERR.
All data output is on STDOUT and may be safely redirected.
The scripts require, usually in the environment:
AWS_PROFILE
environment variable set to a pre-configured profile.DATABASE_URL
environment variable with target and credentials.The scripts have common arguments:
--help
or-h
- emit a Help message showing the available arguments.--env
- specifies the target environment, eg.dev
,uat
,production
. This is used to contruct the name of the AWS S3 bucket name, eg.nzsl-signbank-media-uat
. The default isuat
.--pgcli
- allows the user to specify a different path for the Postgres command-line client. The default is/usr/bin/psql
.get-video-s3-acls.py
This script has extra arguments:
--dumpnzsl
- just get the NZSL Signbank database contents, output it, then exit. This is mainly for debugging.--dumps3
- just get the AWS S3 contents, output it, then exit. This is mainly for debugging.This script produces a full report on NZSL vs S3.
It outputs as CSV, with headers.
The columns are as follows:
Action
is a fix suggested by the script.Action
is one of:"Delete S3 Object" - the S3 object is "orphaned", that is, it has no corresponding NZSL Signbank database record. Some of these are fixable, see the
find-fixable-s3-orphans.py
script. But any that are not should be deleted as they are taking up space without being visible to the NZSL Signbank application."Update ACL" - make sure that the S3 object's ACL matches the expected value in the column next to it, and fix it if not. This uses AWS "Canned ACLs", which in our case means the two values
private
andpublic-read
."Review" - usually means there is a Signbank NZSL database entry with no corresponding S3 object. These are out of scope of these scripts, and are expected to be fixed by other means (eg. functionality within the NZSL Signbank app).
Example usage:
find-fixable-s3-orphans.py
This script accesses the database and S3 in a similar way to
get-video-s3-acls.py
.(Dev note: It contains a lot of duplicated code with that script, which should be libratised at some point.)
It finds S3 objects that have no corresponding NZSL Signbank database record. These are 'orphaned' S3 objects.
It then parses the name string of the object and attempts to find an NZSL Signbank record that matches it. This is not guaranteed to be correct, so the output needs human review.
It outputs what it finds as CSV with header, in a format that can be digested by the 3rd script
repair-fixable-s3-orphans.py
.repair-fixable-s3-orphans.py
This attempts to unify NZSL Signbank records with S3 orphans, by digesting a CSV input of the same format as output by
find-fixable-orphans.py
. It does this by generatingGlossVideo
Django objects where necessary, and associating them with the correctGloss
Django objects. This operation changes the database contents and so must be used with caution.