CC-2305: create script for auditing S3 files and their permissions #180

jonholdsworth · 2024-09-02T02:17:07Z

Adds scripts to resolve discrepancies between NZSL Signbank's Postgres database and Amazon S3 file storage.

Jira ticket

CC-2305 NZSL Signbank: UAT database backport

Changes

This documentation is no longer canonical. It has been 'capped off' and will be copied to docs in this repo for further expansion.

This PR creates 3 scripts.
These are used to report on relationships between Signbank's Postgres database and Amazon's S3 file storage, and then assist with effecting some types of repair where discrepancies exist.
Only some actions are performed by these scripts, other operations have to be manually scripted using the AWS cli or other means, using the output from these scripts as data.

The scripts use the Boto3 python library to talk to AWS S3.
They use an external client to talk to Postgres.

They output diagnostic and progress information on STDERR.
All data output is on STDOUT and may be safely redirected.

The scripts require, usually in the environment:

An AWS profile - eg. AWS_PROFILE environment variable set to a pre-configured profile.
A Postgres context - eg. DATABASE_URL environment variable with target and credentials.

The scripts have common arguments:

--help or -h - emit a Help message showing the available arguments.
--env - specifies the target environment, eg. dev, uat, production. This is used to contruct the name of the AWS S3 bucket name, eg. nzsl-signbank-media-uat. The default is uat.
--pgcli - allows the user to specify a different path for the Postgres command-line client. The default is /usr/bin/psql.

get-video-s3-acls.py

This script has extra arguments:

--dumpnzsl - just get the NZSL Signbank database contents, output it, then exit. This is mainly for debugging.
--dumps3 - just get the AWS S3 contents, output it, then exit. This is mainly for debugging.

This script produces a full report on NZSL vs S3.
It outputs as CSV, with headers.
The columns are as follows:

Action
S3 Video key
S3 LastModified
S3 Expected Canned ACL
S3 Actual Canned ACL
Sbank Gloss ID
Sbank Video ID
Sbank Gloss public
Sbank Video public
Sbank Gloss
Sbank Gloss created at

Action is a fix suggested by the script.

Action is one of:

"Delete S3 Object" - the S3 object is "orphaned", that is, it has no corresponding NZSL Signbank database record. Some of these are fixable, see the find-fixable-s3-orphans.py script. But any that are not should be deleted as they are taking up space without being visible to the NZSL Signbank application.
"Update ACL" - make sure that the S3 object's ACL matches the expected value in the column next to it, and fix it if not. This uses AWS "Canned ACLs", which in our case means the two values private and public-read.
"Review" - usually means there is a Signbank NZSL database entry with no corresponding S3 object. These are out of scope of these scripts, and are expected to be fixed by other means (eg. functionality within the NZSL Signbank app).

Example usage:

This example will access a local postgres port, an AWS account specified by the AWS profile 'nzsl',   
and an AWS S3 bucket called 'nzsl-signbank-media-dev' and output the resulting CSV to a text file 'dev.csv'.

export DATABASE_URL="postgres://postgres:postgres@localhost:5432/postgres"
export AWS_PROFILE=nzsl

get-video-s3-acls.py --env dev > dev.csv

find-fixable-s3-orphans.py

This script accesses the database and S3 in a similar way to get-video-s3-acls.py.
(Dev note: It contains a lot of duplicated code with that script, which should be libratised at some point.)

It finds S3 objects that have no corresponding NZSL Signbank database record. These are 'orphaned' S3 objects.
It then parses the name string of the object and attempts to find an NZSL Signbank record that matches it. This is not guaranteed to be correct, so the output needs human review.
It outputs what it finds as CSV with header, in a format that can be digested by the 3rd script repair-fixable-s3-orphans.py.

repair-fixable-s3-orphans.py

This attempts to unify NZSL Signbank records with S3 orphans, by digesting a CSV input of the same format as output by find-fixable-orphans.py. It does this by generating GlossVideo Django objects where necessary, and associating them with the correct Gloss Django objects. This operation changes the database contents and so must be used with caution.

Turns out setting up to use either path based or native is very un-simple. Try if necessary another time. This reverts commit c4ecdcb.

bin/find-fixable-orphans.py

G-Rath

This is looking pretty good - given how much things have changed, could you redo your PR description to reflect the final changeset? notably, I want to see an overview of both what each of these scripts do and examples of their usage especially as I believe the idea is they're now used together.

I can provide some pull requests from close-sourced codebases if you'd like some examples of what I'm looking for

(ideally we'd get this documented in say the readme, but right now I think its most important we get it documented somewhere and having it in the PR description is a good first step as we can copy from that into other places later)

bin/find-fixable-orphans.py

a-musing-moose

I've added a few more suggestions and ideas. I don't want to become a blocker myself now that Gareth is able to approve/request-changes as needed. So I have only added them as comments.

I also have a couple of more general overall suggestions that probably are not worth the effort now - but I think you should keep in mind next time.

A couple of the script deal with lists containing heterogenous data. i.e. [True, False, a bunch of other values]. Each of the separate positions has a specific purpose. You have added comments when setting them as to their purpose. However, as soon as you pass that list to another function the context is no longer available to the reader. This makes it a little harder to follow what the code is doing where. I'd consider using dataclasses or at a minimum dictionaries. The use of these types instead of straight lists means that the purpose of each value is carried wit it. This will of course have a small overhead for creating the objects but unless you are dealing with a very large dataset or performance is critical - the added readability will be a net win.

The other piece of advice is - if you are working with a Django project, every little script to fix up data - not matter how short lived should probably start as a Django management command. It provides access to configuration, the database and models without having to jump through hoops. It also allows you to re-use business logic that already exists within the application and easily move shared functionality out of the management commands into well organised modules within the Django app. e.g. the dict building code you have in a couple of the scripts.

This all said - these are effectively one-shot scripts to fix up a specific issue. So I also understand that there is little value in applying too much spit and polish :-)

bin/get-video-s3-acls.py

bin/repair-fixable-orphans.py

…Django upgrade in progress anyway that will address this vuln)

jonholdsworth · 2024-12-11T06:06:03Z

This all said - these are effectively one-shot scripts to fix up a specific issue. So I also understand that there is little value in applying too much spit and polish :-)

Agree with all your points above, Jon M.
And no it's probably not worth making dramatic changes. If these scripts prove very useful and get tricky to maintain then it will be informative to revisit this PR and its comments.

I will make these scripts into Management commands with some instructions, as that will make using them much cleaner.

jonholdsworth · 2024-12-13T06:16:34Z

I have made all 3 scripts into Django Management Commands.
They have improved help text as well.

jonholdsworth added 30 commits August 19, 2024 14:48

Initial code inclusion, conversion from bash scripts

e549e57

S3 retrieval working, NZSL Signbank retrieval working

c07c66a

pg:psql header and footer removed

f2191b8

Sorting newlines

b5ed4b9

Differencing working

dfcba14

Comparing is_public with ACL return

4427cdf

First pass at native boto s3 client use (messy)

c4ecdcb

Revert "First pass at native boto s3 client use (messy)"

00e2d65

Turns out setting up to use either path based or native is very un-simple. Try if necessary another time. This reverts commit c4ecdcb.

Rudimentary command line parsing

a487f59

Rename

b58f6bf

Better command line arguments

58d2a9a

Command line arguments and external apps codified

da626ae

Comments

3fdb0b2

black

026c2fa

Better arguments

cc6acb2

Better args

f9019fd

Better arg help and ordering

32cf39d

Better cached handling

49ea762

Debug removed

e692cb9

Cut n pasted text fixed

ed40d18

Incremental improvements

29e0fda

Minor feedback text fix

516f6a7

set_public() functions renamed for clarity

dbb1a43

Basics of final output collection working

1c294a2

Basics of final output collection working

eab29ba

About to remove legacy files output

54595ff

Legacy output files removed

c8c5162

cleanups

d2ebec3

s3_bucket_raw_keys_list

ce51fa4

Output changed to JSON -> py dict for processing

fedf7e9

Unused imports removed

76e81b5

jonholdsworth requested review from G-Rath and nzlaura November 28, 2024 04:01

a-musing-moose reviewed Nov 28, 2024

View reviewed changes

bin/find-fixable-orphans.py Outdated Show resolved Hide resolved

bin/find-fixable-orphans.py Outdated Show resolved Hide resolved

bin/find-fixable-orphans.py Outdated Show resolved Hide resolved

bin/find-fixable-orphans.py Outdated Show resolved Hide resolved

G-Rath requested changes Nov 28, 2024

View reviewed changes

a-musing-moose reviewed Nov 28, 2024

View reviewed changes

jonholdsworth added 7 commits December 11, 2024 15:44

Initial review commits/black

1d2a86a

Script renamings

749bb20

OSV ignore GHSA-rrqc-c2jx-6jgv to suppress build warnings (We have a …

48a3207

…Django upgrade in progress anyway that will address this vuln)

Do not orphan-test fake keys

760dd8e

Use csv.writer() for get_ script

781bedd

Other scripts now using csv.writerow() also

04c1cc9

Dry run mode made default, flag changed to --commit

87db6a2

jonholdsworth added 12 commits December 13, 2024 13:51

moved get script

556e709

rename get script for consistency

da0befb

changed permissions on get script for consistency

20f8bf4

get_video_s3_acls -> Management Command

695b398

Comments

4d79d32

Comments

4caa11a

Comments

75e82cf

Moved remaining commands

967daaf

Renamed

2ae11cd

black

cbe56c8

find_fixable_s3_orphans.py -> Management Command

1fb9978

black and cleanups

4f1934a

jonholdsworth requested review from G-Rath and a-musing-moose December 17, 2024 01:49

jonholdsworth mentioned this pull request Dec 17, 2024

CC-2390: Production to UAT database backport: Documentation for S3 scripts #184

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CC-2305: create script for auditing S3 files and their permissions #180

CC-2305: create script for auditing S3 files and their permissions #180

jonholdsworth commented Sep 2, 2024 •

edited

Loading

G-Rath left a comment

a-musing-moose left a comment

jonholdsworth commented Dec 11, 2024

jonholdsworth commented Dec 13, 2024

CC-2305: create script for auditing S3 files and their permissions #180

Are you sure you want to change the base?

CC-2305: create script for auditing S3 files and their permissions #180

Conversation

jonholdsworth commented Sep 2, 2024 • edited Loading

Jira ticket

Changes

G-Rath left a comment

Choose a reason for hiding this comment

a-musing-moose left a comment

Choose a reason for hiding this comment

jonholdsworth commented Dec 11, 2024

jonholdsworth commented Dec 13, 2024

jonholdsworth commented Sep 2, 2024 •

edited

Loading