Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CC-2305: create script for auditing S3 files and their permissions #180

Draft
wants to merge 223 commits into
base: master
Choose a base branch
from

Conversation

jonholdsworth
Copy link

@jonholdsworth jonholdsworth commented Sep 2, 2024

Adds scripts to resolve discrepancies between NZSL Signbank's Postgres database and Amazon S3 file storage.

Jira ticket

CC-2305 NZSL Signbank: UAT database backport

Changes

  • This documentation is no longer canonical. It has been 'capped off' and will be copied to docs in this repo for further expansion.

This PR creates 3 scripts.
These are used to report on relationships between Signbank's Postgres database and Amazon's S3 file storage, and then assist with effecting some types of repair where discrepancies exist.
Only some actions are performed by these scripts, other operations have to be manually scripted using the AWS cli or other means, using the output from these scripts as data.

The scripts use the Boto3 python library to talk to AWS S3.
They use an external client to talk to Postgres.

They output diagnostic and progress information on STDERR.
All data output is on STDOUT and may be safely redirected.

The scripts require, usually in the environment:

  • An AWS profile - eg. AWS_PROFILE environment variable set to a pre-configured profile.
  • A Postgres context - eg. DATABASE_URL environment variable with target and credentials.

The scripts have common arguments:

  • --help or -h - emit a Help message showing the available arguments.
  • --env - specifies the target environment, eg. dev, uat, production. This is used to contruct the name of the AWS S3 bucket name, eg. nzsl-signbank-media-uat. The default is uat.
  • --pgcli - allows the user to specify a different path for the Postgres command-line client. The default is /usr/bin/psql.

get-video-s3-acls.py

This script has extra arguments:

  • --dumpnzsl - just get the NZSL Signbank database contents, output it, then exit. This is mainly for debugging.
  • --dumps3 - just get the AWS S3 contents, output it, then exit. This is mainly for debugging.

This script produces a full report on NZSL vs S3.
It outputs as CSV, with headers.
The columns are as follows:

Action
S3 Video key
S3 LastModified
S3 Expected Canned ACL
S3 Actual Canned ACL
Sbank Gloss ID
Sbank Video ID
Sbank Gloss public
Sbank Video public
Sbank Gloss
Sbank Gloss created at

Action is a fix suggested by the script.

Action is one of:

  • "Delete S3 Object" - the S3 object is "orphaned", that is, it has no corresponding NZSL Signbank database record. Some of these are fixable, see the find-fixable-s3-orphans.py script. But any that are not should be deleted as they are taking up space without being visible to the NZSL Signbank application.

  • "Update ACL" - make sure that the S3 object's ACL matches the expected value in the column next to it, and fix it if not. This uses AWS "Canned ACLs", which in our case means the two values private and public-read.

  • "Review" - usually means there is a Signbank NZSL database entry with no corresponding S3 object. These are out of scope of these scripts, and are expected to be fixed by other means (eg. functionality within the NZSL Signbank app).

Example usage:

This example will access a local postgres port, an AWS account specified by the AWS profile 'nzsl',   
and an AWS S3 bucket called 'nzsl-signbank-media-dev' and output the resulting CSV to a text file 'dev.csv'.

export DATABASE_URL="postgres://postgres:postgres@localhost:5432/postgres"
export AWS_PROFILE=nzsl

get-video-s3-acls.py --env dev > dev.csv

find-fixable-s3-orphans.py

This script accesses the database and S3 in a similar way to get-video-s3-acls.py.
(Dev note: It contains a lot of duplicated code with that script, which should be libratised at some point.)

It finds S3 objects that have no corresponding NZSL Signbank database record. These are 'orphaned' S3 objects.
It then parses the name string of the object and attempts to find an NZSL Signbank record that matches it. This is not guaranteed to be correct, so the output needs human review.
It outputs what it finds as CSV with header, in a format that can be digested by the 3rd script repair-fixable-s3-orphans.py.


repair-fixable-s3-orphans.py

This attempts to unify NZSL Signbank records with S3 orphans, by digesting a CSV input of the same format as output by find-fixable-orphans.py. It does this by generating GlossVideo Django objects where necessary, and associating them with the correct Gloss Django objects. This operation changes the database contents and so must be used with caution.

bin/find-fixable-orphans.py Outdated Show resolved Hide resolved
bin/find-fixable-orphans.py Outdated Show resolved Hide resolved
bin/find-fixable-orphans.py Outdated Show resolved Hide resolved
bin/find-fixable-orphans.py Outdated Show resolved Hide resolved
Copy link

@G-Rath G-Rath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good - given how much things have changed, could you redo your PR description to reflect the final changeset? notably, I want to see an overview of both what each of these scripts do and examples of their usage especially as I believe the idea is they're now used together.

I can provide some pull requests from close-sourced codebases if you'd like some examples of what I'm looking for

(ideally we'd get this documented in say the readme, but right now I think its most important we get it documented somewhere and having it in the PR description is a good first step as we can copy from that into other places later)

bin/find-fixable-orphans.py Outdated Show resolved Hide resolved
bin/find-fixable-orphans.py Outdated Show resolved Hide resolved
bin/find-fixable-orphans.py Outdated Show resolved Hide resolved
bin/find-fixable-orphans.py Outdated Show resolved Hide resolved
bin/find-fixable-orphans.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@a-musing-moose a-musing-moose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few more suggestions and ideas. I don't want to become a blocker myself now that Gareth is able to approve/request-changes as needed. So I have only added them as comments.

I also have a couple of more general overall suggestions that probably are not worth the effort now - but I think you should keep in mind next time.

A couple of the script deal with lists containing heterogenous data. i.e. [True, False, a bunch of other values]. Each of the separate positions has a specific purpose. You have added comments when setting them as to their purpose. However, as soon as you pass that list to another function the context is no longer available to the reader. This makes it a little harder to follow what the code is doing where. I'd consider using dataclasses or at a minimum dictionaries. The use of these types instead of straight lists means that the purpose of each value is carried wit it. This will of course have a small overhead for creating the objects but unless you are dealing with a very large dataset or performance is critical - the added readability will be a net win.

The other piece of advice is - if you are working with a Django project, every little script to fix up data - not matter how short lived should probably start as a Django management command. It provides access to configuration, the database and models without having to jump through hoops. It also allows you to re-use business logic that already exists within the application and easily move shared functionality out of the management commands into well organised modules within the Django app. e.g. the dict building code you have in a couple of the scripts.

This all said - these are effectively one-shot scripts to fix up a specific issue. So I also understand that there is little value in applying too much spit and polish :-)

bin/get-video-s3-acls.py Outdated Show resolved Hide resolved
bin/get-video-s3-acls.py Outdated Show resolved Hide resolved
bin/get-video-s3-acls.py Outdated Show resolved Hide resolved
bin/get-video-s3-acls.py Outdated Show resolved Hide resolved
bin/repair-fixable-orphans.py Outdated Show resolved Hide resolved
@jonholdsworth
Copy link
Author

This all said - these are effectively one-shot scripts to fix up a specific issue. So I also understand that there is little value in applying too much spit and polish :-)

Agree with all your points above, Jon M.
And no it's probably not worth making dramatic changes. If these scripts prove very useful and get tricky to maintain then it will be informative to revisit this PR and its comments.

I will make these scripts into Management commands with some instructions, as that will make using them much cleaner.

@jonholdsworth
Copy link
Author

I have made all 3 scripts into Django Management Commands.
They have improved help text as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants