Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CC-2305: create script for auditing S3 files and their permissions #180

Draft
wants to merge 223 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 204 commits
Commits
Show all changes
223 commits
Select commit Hold shift + click to select a range
e549e57
Initial code inclusion, conversion from bash scripts
jonholdsworth Aug 19, 2024
c07c66a
S3 retrieval working, NZSL Signbank retrieval working
jonholdsworth Aug 19, 2024
f2191b8
pg:psql header and footer removed
jonholdsworth Aug 19, 2024
b5ed4b9
Sorting newlines
jonholdsworth Aug 19, 2024
dfcba14
Differencing working
jonholdsworth Aug 19, 2024
4427cdf
Comparing is_public with ACL return
jonholdsworth Aug 26, 2024
c4ecdcb
First pass at native boto s3 client use (messy)
jonholdsworth Aug 26, 2024
00e2d65
Revert "First pass at native boto s3 client use (messy)"
jonholdsworth Aug 26, 2024
a487f59
Rudimentary command line parsing
jonholdsworth Aug 28, 2024
b58f6bf
Rename
jonholdsworth Aug 28, 2024
58d2a9a
Better command line arguments
jonholdsworth Aug 28, 2024
da626ae
Command line arguments and external apps codified
jonholdsworth Aug 28, 2024
3fdb0b2
Comments
jonholdsworth Aug 28, 2024
026c2fa
black
jonholdsworth Aug 28, 2024
cc6acb2
Better arguments
jonholdsworth Aug 28, 2024
f9019fd
Better args
jonholdsworth Aug 28, 2024
32cf39d
Better arg help and ordering
jonholdsworth Aug 28, 2024
49ea762
Better cached handling
jonholdsworth Aug 28, 2024
e692cb9
Debug removed
jonholdsworth Aug 28, 2024
ed40d18
Cut n pasted text fixed
jonholdsworth Aug 28, 2024
29e0fda
Incremental improvements
jonholdsworth Aug 28, 2024
516f6a7
Minor feedback text fix
jonholdsworth Aug 29, 2024
dbb1a43
set_public() functions renamed for clarity
jonholdsworth Aug 29, 2024
1c294a2
Basics of final output collection working
jonholdsworth Aug 30, 2024
eab29ba
Basics of final output collection working
jonholdsworth Aug 30, 2024
54595ff
About to remove legacy files output
jonholdsworth Aug 30, 2024
c8c5162
Legacy output files removed
jonholdsworth Aug 30, 2024
d2ebec3
cleanups
jonholdsworth Aug 30, 2024
ce51fa4
s3_bucket_raw_keys_list
jonholdsworth Aug 30, 2024
fedf7e9
Output changed to JSON -> py dict for processing
jonholdsworth Aug 30, 2024
274ff5a
All fields represented and ACL logic working
jonholdsworth Sep 1, 2024
8fc2e41
black
jonholdsworth Sep 1, 2024
c143b2f
black
jonholdsworth Sep 1, 2024
1c267ff
remove pprint
jonholdsworth Sep 2, 2024
1937bf3
Header writes to stderr
jonholdsworth Sep 2, 2024
79e1361
black
jonholdsworth Sep 2, 2024
08afe68
Long line that black missed
jonholdsworth Sep 2, 2024
4b64642
AWS_PROFILE purely environment var
jonholdsworth Sep 2, 2024
e907e2e
Revert "AWS_PROFILE purely environment var"
jonholdsworth Sep 2, 2024
1f8f70b
Revert "Revert "AWS_PROFILE purely environment var""
jonholdsworth Sep 2, 2024
5100287
DATABASE_URL purely environment var. Missing stderrs added.
jonholdsworth Sep 2, 2024
114d5c4
Production/UAT mode changed to string
jonholdsworth Sep 2, 2024
dd94fb2
Better column names
jonholdsworth Sep 2, 2024
c477058
black
jonholdsworth Sep 2, 2024
e4ca16b
Output raw ACL data as well
jonholdsworth Sep 2, 2024
9e24507
Comment showing where canned ACLs are set in main app
jonholdsworth Sep 2, 2024
19909a1
Merge branch 'master' into CCSD-2305-nzsl-production-uat-database-bac…
jonholdsworth Sep 3, 2024
15af142
Raw ACL data and header removed again
jonholdsworth Sep 3, 2024
e85309a
Extraneous columns removed
jonholdsworth Sep 3, 2024
9eed309
NSZL_APP removed, no longer needed
jonholdsworth Sep 3, 2024
0e48d3a
AWS_PROFILE requirement removed
jonholdsworth Sep 3, 2024
4aa6add
Removed question mark
jonholdsworth Sep 3, 2024
d7af505
Header refactored
jonholdsworth Sep 3, 2024
32b2042
AWS_PROFILE printing conditional
jonholdsworth Sep 3, 2024
4bc617c
Tidy up
jonholdsworth Sep 3, 2024
ea689f4
duplicated line removed
jonholdsworth Sep 3, 2024
53c4154
Initial organisation into functions, and cleanup
jonholdsworth Sep 3, 2024
bb6a53e
Tidy ups and renaming
jonholdsworth Sep 3, 2024
1b93c81
DATABASE_URL warning message removed
jonholdsworth Sep 3, 2024
ea751e5
whitespace
jonholdsworth Sep 4, 2024
0b3ce8d
Adding OSV ignores just to silence warnings. Remove later.
jonholdsworth Sep 4, 2024
fbe33fe
More OSV ignores
jonholdsworth Sep 4, 2024
7659632
tidy ups
jonholdsworth Sep 4, 2024
cb2b3a0
File names hidden
jonholdsworth Sep 4, 2024
9b502dd
Bunch of things could be made global, starting here
jonholdsworth Sep 4, 2024
a4d978c
More tidying
jonholdsworth Sep 4, 2024
8364354
black
jonholdsworth Sep 4, 2024
15f5443
Simpler and cleaner
jonholdsworth Sep 4, 2024
e4189fe
Output text
jonholdsworth Sep 4, 2024
0d13787
DATABASE_URL output removed as security issue
jonholdsworth Sep 4, 2024
ec62dc6
os.environ used everywhere
jonholdsworth Sep 4, 2024
5c0fb65
Exception test removed
jonholdsworth Sep 4, 2024
1a3a612
PSQL client works smarter using COPY
jonholdsworth Sep 4, 2024
8ff16e3
Output canned ACL even if video_key absent from NZSL Signback postgres
jonholdsworth Sep 5, 2024
0561831
S3 intermediate file removed
jonholdsworth Sep 6, 2024
47494ca
Intermediate files gone. Cache file only. Tidy up.
jonholdsworth Sep 6, 2024
81d7789
Text, whitespace
jonholdsworth Sep 6, 2024
0fe589b
Cache file made global
jonholdsworth Sep 6, 2024
278cb49
Unbuffered python output
jonholdsworth Sep 6, 2024
666c8de
Text change: mode -> env
jonholdsworth Sep 10, 2024
a884e3b
Cache file written in same loop as keys dictionary
jonholdsworth Sep 10, 2024
e0e7a8e
Simplified conditional
jonholdsworth Sep 10, 2024
7249def
CSV construction deconstructed
jonholdsworth Sep 10, 2024
70d65fa
Superfluous variable removed
jonholdsworth Sep 10, 2024
1c558b4
First approximation of bidirectional matching
jonholdsworth Sep 11, 2024
1ad888c
Presence/Absence S3 vs NZSL now bi-directional
jonholdsworth Sep 11, 2024
2c7d375
NZSL Present S3 Absent case now outputs to CSV
jonholdsworth Sep 11, 2024
3438dd9
Added --tmpdir argument
jonholdsworth Sep 11, 2024
f13b703
Minor tidy-ups
jonholdsworth Sep 12, 2024
fb5c8b7
Initial code for new columns
jonholdsworth Sep 17, 2024
abc430f
Debug removed, tmpdir announced
jonholdsworth Sep 17, 2024
70027e3
typo
jonholdsworth Sep 18, 2024
f9727f2
Debug removed
jonholdsworth Sep 18, 2024
336241c
Tidy ups
jonholdsworth Sep 18, 2024
f6ffc18
Video key moved, functions reordered, gloss quoting hardened
jonholdsworth Sep 19, 2024
76966ab
Revert "Video key moved, functions reordered, gloss quoting hardened"
jonholdsworth Sep 19, 2024
afcfa81
Cache file removed
jonholdsworth Sep 19, 2024
a5ed632
Fields broken out for clarity
jonholdsworth Sep 19, 2024
b5bec41
Reordered function declarations
jonholdsworth Sep 19, 2024
7b7f111
Renamed created_at prior to S3 replacement
jonholdsworth Sep 19, 2024
6360bf6
Reformatted
jonholdsworth Sep 19, 2024
f985cd8
Fields rearranged (still gloss created_at)
jonholdsworth Sep 19, 2024
8387571
Comments
jonholdsworth Sep 19, 2024
51caea7
S3 Lastmodified datetime. Reordering. Column names updated.
jonholdsworth Sep 20, 2024
0e92d20
Message updated
jonholdsworth Sep 20, 2024
c26b9aa
Message
jonholdsworth Sep 20, 2024
86e4d34
Rearranged. Intermediate variables removed. Black.
jonholdsworth Sep 20, 2024
0d564a9
Comment
jonholdsworth Sep 20, 2024
45064ba
Missing empty LastModified for case not in S3
jonholdsworth Sep 20, 2024
4cff3da
TMPDIR removed
jonholdsworth Sep 20, 2024
7795405
More efficient dictionary lookup
jonholdsworth Sep 23, 2024
a68b89b
comments
jonholdsworth Sep 26, 2024
6a5514a
Basic 'action' recommendation working
jonholdsworth Oct 3, 2024
27b389d
Pass whole row to build_csv_header()
jonholdsworth Oct 7, 2024
777dac0
tweaks
jonholdsworth Oct 7, 2024
5b2715f
Actions in a function, rearranged
jonholdsworth Oct 7, 2024
57ef780
LastModified retrieved via query path
jonholdsworth Oct 7, 2024
35021d6
Lots of refactoring
jonholdsworth Oct 7, 2024
63979e1
Default shell=false redundancy
jonholdsworth Oct 7, 2024
f4a858a
subprocess.run wrapped for PGCLI
jonholdsworth Oct 9, 2024
31b41f2
subprocess.run wrapped for AWSCLI. Better argument handling.
jonholdsworth Oct 9, 2024
975540e
Another AWSCLI wrap.
jonholdsworth Oct 9, 2024
6370e21
get-object-attributes -> head-object for LastModified
jonholdsworth Oct 9, 2024
a773a63
Removed json module
jonholdsworth Oct 9, 2024
1adbe92
Moved header fn closer to row fn
jonholdsworth Oct 9, 2024
2dd442f
Reorder fields
jonholdsworth Oct 22, 2024
6f38ac0
Canned ACL variable removed
jonholdsworth Oct 22, 2024
92988c4
Renamed action function
jonholdsworth Oct 22, 2024
438a83e
Else's defaulted
jonholdsworth Oct 22, 2024
ea5d6bb
Guard code removed.
jonholdsworth Oct 22, 2024
c214dde
Debug removed.
jonholdsworth Oct 22, 2024
e342fac
Array argument fix
jonholdsworth Oct 22, 2024
b64e1ad
Swapped created_at and lastmodified columns
jonholdsworth Oct 22, 2024
ba137dc
FULL JOIN from INNER JOIN
jonholdsworth Oct 24, 2024
a504554
Internal changes
jonholdsworth Oct 24, 2024
75107c9
Comments
jonholdsworth Oct 24, 2024
66b5db1
Comments
jonholdsworth Oct 24, 2024
e166694
Comments
jonholdsworth Oct 24, 2024
0642a25
Added forced retries to AWS command
jonholdsworth Oct 24, 2024
97b559a
Superfluous arguments removed
jonholdsworth Oct 25, 2024
a588af2
Experimental Django code
jonholdsworth Oct 25, 2024
5d11500
Experimental Django code
jonholdsworth Oct 25, 2024
d430c59
black
jonholdsworth Oct 25, 2024
566ad2a
black
jonholdsworth Oct 25, 2024
4e62dbd
Django imports only occur under tests (requires a virtualenv).
jonholdsworth Oct 25, 2024
7c997a6
exit on postgres exception
jonholdsworth Oct 25, 2024
a1e10a7
Postgres tests safety guard
jonholdsworth Oct 25, 2024
a05eb00
Experimental refactor csv_import
jonholdsworth Oct 28, 2024
00a58b6
Revert "Experimental refactor csv_import"
jonholdsworth Oct 28, 2024
5c5fcc6
Revert "Revert "Experimental refactor csv_import""
jonholdsworth Oct 28, 2024
3de214c
Debug removed
jonholdsworth Oct 28, 2024
335f07d
Revert "Debug removed"
jonholdsworth Oct 28, 2024
65de3ca
Revert "Revert "Revert "Experimental refactor csv_import"""
jonholdsworth Oct 28, 2024
95c3179
More experimental client code
jonholdsworth Oct 28, 2024
11dffb6
Forking video tests away from ACL script
jonholdsworth Oct 30, 2024
3f63e42
Moved video tests out of this script
jonholdsworth Oct 30, 2024
6831061
Minimum functionality
jonholdsworth Oct 30, 2024
19a0881
Fake key to handle FULL JOIN absent video keys
jonholdsworth Oct 30, 2024
078b479
black
jonholdsworth Oct 30, 2024
fa4689d
Rearranging
jonholdsworth Oct 30, 2024
0aa68e0
S3 orphan resolution next pass
jonholdsworth Oct 31, 2024
09e124b
black
jonholdsworth Oct 31, 2024
12a309b
Comment
jonholdsworth Oct 31, 2024
fd62f81
CSV orphans
jonholdsworth Oct 31, 2024
83d1c82
black
jonholdsworth Oct 31, 2024
2418d96
Script splitting
jonholdsworth Nov 1, 2024
7c8d461
Orphan-detection code removed
jonholdsworth Nov 1, 2024
7ef56b8
Orphan detection script separated
jonholdsworth Nov 1, 2024
fcbde52
Removed --pyenv requirement, prior to management command
jonholdsworth Nov 1, 2024
bfe0848
Moved to management dir
jonholdsworth Nov 1, 2024
068244d
Revert "Moved to management dir"
jonholdsworth Nov 1, 2024
a49f9df
Revert "Removed --pyenv requirement, prior to management command"
jonholdsworth Nov 1, 2024
61158e9
Comments
jonholdsworth Nov 1, 2024
8f5b88a
Comment
jonholdsworth Nov 1, 2024
917e9ad
refactor
jonholdsworth Nov 1, 2024
f748523
Comment
jonholdsworth Nov 1, 2024
ad2733d
Cleanups
jonholdsworth Nov 1, 2024
12ab098
rename
jonholdsworth Nov 1, 2024
75f0a8f
initial commit of orphan video repair script
jonholdsworth Nov 1, 2024
f16ca20
pyenv whoops
jonholdsworth Nov 1, 2024
4951b52
Repair script stripped
jonholdsworth Nov 1, 2024
aae1bd7
Import cleanup
jonholdsworth Nov 1, 2024
44a1bd8
Syncing headers
jonholdsworth Nov 1, 2024
a23eadb
cleanups
jonholdsworth Nov 1, 2024
bc48b26
help message
jonholdsworth Nov 1, 2024
8b2aa15
refactor
jonholdsworth Nov 1, 2024
860235f
Basics working
jonholdsworth Nov 1, 2024
81ef0cc
More import cleanups
jonholdsworth Nov 1, 2024
39bb18f
Warnings
jonholdsworth Nov 1, 2024
9e6d570
Notes
jonholdsworth Nov 1, 2024
f94518b
Notes
jonholdsworth Nov 1, 2024
d7c3bc2
First success
jonholdsworth Nov 1, 2024
9684c69
Uses bulk_create() so that save() does not run
jonholdsworth Nov 4, 2024
cd0143a
Neatening and rename
jonholdsworth Nov 4, 2024
36d251e
Comments
jonholdsworth Nov 4, 2024
964321f
Added S3 dumper
jonholdsworth Nov 4, 2024
bd6f86d
Boto3 conversion of get-video-s3-acls
jonholdsworth Nov 4, 2024
4b37c93
black
jonholdsworth Nov 4, 2024
01ce2dd
Boto3 conversion: find-fixable-orphans
jonholdsworth Nov 4, 2024
ab294e5
Boto3 conversion: repair-fixable-orphans.py
jonholdsworth Nov 4, 2024
57e0b46
Added a public/published boolean column
jonholdsworth Nov 11, 2024
587e01c
message
jonholdsworth Nov 11, 2024
a240a3d
comments/black
jonholdsworth Nov 11, 2024
76e81b5
Unused imports removed
jonholdsworth Nov 11, 2024
1d2a86a
Initial review commits/black
jonholdsworth Dec 11, 2024
749bb20
Script renamings
jonholdsworth Dec 11, 2024
48a3207
OSV ignore GHSA-rrqc-c2jx-6jgv to suppress build warnings (We have a …
jonholdsworth Dec 11, 2024
760dd8e
Do not orphan-test fake keys
jonholdsworth Dec 11, 2024
781bedd
Use csv.writer() for get_ script
jonholdsworth Dec 11, 2024
04c1cc9
Other scripts now using csv.writerow() also
jonholdsworth Dec 11, 2024
87db6a2
Dry run mode made default, flag changed to --commit
jonholdsworth Dec 11, 2024
556e709
moved get script
jonholdsworth Dec 13, 2024
da0befb
rename get script for consistency
jonholdsworth Dec 13, 2024
20f8bf4
changed permissions on get script for consistency
jonholdsworth Dec 13, 2024
695b398
get_video_s3_acls -> Management Command
jonholdsworth Dec 13, 2024
4d79d32
Comments
jonholdsworth Dec 13, 2024
4caa11a
Comments
jonholdsworth Dec 13, 2024
75e82cf
Comments
jonholdsworth Dec 13, 2024
967daaf
Moved remaining commands
jonholdsworth Dec 13, 2024
2ae11cd
Renamed
jonholdsworth Dec 13, 2024
cbe56c8
black
jonholdsworth Dec 13, 2024
1fb9978
find_fixable_s3_orphans.py -> Management Command
jonholdsworth Dec 13, 2024
4f1934a
black and cleanups
jonholdsworth Dec 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .osv-detector.yml
jonholdsworth marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@ ignore:
- GHSA-257q-pv89-v3xv # GHSA says affected versions are jQuery v.2.2.0 until v.3.5.0
- GHSA-vm8q-m57g-pff3
- GHSA-w3h3-4rj7-4ph4
- GHSA-248v-346w-9cwc # Certifi removes GLOBALTRUST root certificate (https://github.com/advisories/GHSA-248v-346w-9cwc)
- GHSA-g92j-qhmh-64v2 # Sentry's Python SDK unintentionally exposes environment variables to subprocesses (https://github.com/advisories/GHSA-g92j-qhmh-64v2)
- GHSA-9mvj-f7w8-pvh2 # Bootstrap Cross-Site Scripting (XSS) vulnerability (https://github.com/advisories/GHSA-9mvj-f7w8-pvh2)
293 changes: 293 additions & 0 deletions bin/find-fixable-orphans.py
jonholdsworth marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
#!/usr/bin/env -S python3 -u
#
# This script needs to be run in a pyenv virtualenv with the Django project installed.
#
# Finds orphaned S3 objects that can be matched back to NZSL entries that are missing S3 objects.
# Essentially finds one form of import error.
#
# Bang line above passes '-u' to python, for unbuffered output
# Permissions required:
# psql - access to heroku app's postgres
# aws s3 - NZSL IAM access
# s3:GetObjectAcl permissions or READ_ACP access to the object
# https://docs.aws.amazon.com/cli/latest/reference/s3api/get-object-acl.html
# For some commands you need to run this in a venv that has all the right Python site-packages.
# TODO Convert this script to a Django Management Command

import os
import sys
import subprocess
import argparse
from uuid import uuid4
import boto3

# Magic required to allow this script to use Signbank Django classes
# This goes away if this script becomes a Django Management Command
print("Importing site-packages environment", file=sys.stderr)
print(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), file=sys.stderr)
jonholdsworth marked this conversation as resolved.
Show resolved Hide resolved
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))
jonholdsworth marked this conversation as resolved.
Show resolved Hide resolved
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "signbank.settings.development")
from django.core.wsgi import get_wsgi_application

get_wsgi_application()

from django.contrib.auth import get_user_model

User = get_user_model()

from signbank.dictionary.models import (
Gloss,
)


parser = argparse.ArgumentParser(
description="You must setup: An AWS auth means, eg. AWS_PROFILE env var. "
"Postgres access details, eg. DATABASE_URL env var."
)
parser.add_argument(
"--env",
default="uat",
required=False,
help="Environment to run against, eg 'production, 'uat', etc (default: '%(default)s')",
)
parser.add_argument(
"--pgcli",
default="/usr/bin/psql",
required=False,
help=f"Postgres client path (default: %(default)s)",
)
args = parser.parse_args()

# Keep synced with other scripts
GLOSS_ID_COLUMN = "Gloss ID"
GLOSS_COLUMN = "Gloss"
GLOSS_PUBLIC_COLUMN = "Gloss public"
GLOSS_VIDEO_COLUMN = "Suggested Video key"
GLOBAL_COLUMN_HEADINGS = [
GLOSS_ID_COLUMN,
GLOSS_COLUMN,
GLOSS_PUBLIC_COLUMN,
GLOSS_VIDEO_COLUMN,
]

# Other globals
CSV_DELIMITER = ","
FAKEKEY_PREFIX = "this_is_not_a_key_"
DATABASE_URL = os.getenv("DATABASE_URL", "")
PGCLI = args.pgcli
AWS_S3_BUCKET = f"nzsl-signbank-media-{args.env}"


def pg_cli(args_list):
try:
return subprocess.run(
[PGCLI, "-c"] + args_list + [f"{DATABASE_URL}"],
env=os.environ,
capture_output=True,
check=True,
text=True,
)
except subprocess.CalledProcessError as e:
print(f"Error: subprocess.run returned code {e.returncode}", file=sys.stderr)
print(e.cmd, file=sys.stderr)
print(e.stdout, file=sys.stderr)
print(e.stderr, file=sys.stderr)
exit()


# Fake key is a hack to handle FULL JOIN
def maybe_fakekey(instring):
return instring if instring else FAKEKEY_PREFIX + str(uuid4())


def filter_fakekey(instring):
return "" if instring.startswith(FAKEKEY_PREFIX) else instring


# Get the video files info from NZSL Signbank
def get_nzsl_raw_keys_dict():
print(
f"Getting raw list of video file info from NZSL Signbank ...",
file=sys.stderr,
)
this_nzsl_raw_keys_dict = {}
# Column renaming is for readability
# Special delimiter because columns might contain commas
result = pg_cli(
[
"COPY ("
"SELECT "
"dg.id AS gloss_id, "
"dg.idgloss AS gloss_idgloss, "
"dg.created_at AS gloss_created_at, "
"dg.published AS gloss_public, "
"vg.is_public AS video_public, "
"vg.id AS video_id, "
"vg.videofile AS video_key "
"FROM dictionary_gloss AS dg "
"FULL JOIN video_glossvideo AS vg ON vg.gloss_id = dg.id"
") TO STDOUT WITH (FORMAT CSV, DELIMITER '|')",
]
)

# Separate the NZSL db columns
# Write them to a dictionary, so we can do fast operations
for rawl in result.stdout.split("\n"):
rawl = rawl.strip()
if not rawl:
continue
[
gloss_id,
gloss_idgloss,
gloss_created_at,
gloss_public,
video_public,
video_id,
video_key,
] = rawl.split("|")

# Hack to handle FULL JOIN
jonholdsworth marked this conversation as resolved.
Show resolved Hide resolved
video_key = maybe_fakekey(video_key.strip())

# This sets the initial field ordering in the all_keys dictionary row
this_nzsl_raw_keys_dict[video_key] = [
gloss_idgloss.replace(CSV_DELIMITER, ""),
gloss_created_at,
gloss_id,
video_id,
gloss_public.lower() == "t",
video_public.lower() == "t",
]

print(
f"{len(this_nzsl_raw_keys_dict)} rows retrieved",
file=sys.stderr,
)

return this_nzsl_raw_keys_dict


# Get all keys from AWS S3
def get_s3_bucket_raw_keys_list(s3_bucket=AWS_S3_BUCKET):
print(f"Getting raw AWS S3 keys recursively ({s3_bucket}) ...", file=sys.stderr)

s3_resource = boto3.resource("s3")
s3_resource_bucket = s3_resource.Bucket(s3_bucket)
this_s3_bucket_raw_keys_list = [
s3_object.key for s3_object in s3_resource_bucket.objects.all()
]

print(
f"{len(this_s3_bucket_raw_keys_list)} rows retrieved",
file=sys.stderr,
)

return this_s3_bucket_raw_keys_list


# Get the keys present and absent across NZSL Signbank and S3, to dictionary
def create_all_keys_dict(this_nzsl_raw_keys_dict, this_s3_bucket_raw_keys_list):
print(
"Getting keys present and absent across NZSL Signbank and S3 ...",
file=sys.stderr,
)
this_all_keys_dict = {}

# Find S3 keys that are present in NZSL, or absent
for video_key in this_s3_bucket_raw_keys_list:
dict_row = this_nzsl_raw_keys_dict.get(video_key, None)
jonholdsworth marked this conversation as resolved.
Show resolved Hide resolved
if dict_row:
# NZSL glossvideo record for this S3 key
this_all_keys_dict[video_key] = [
True, # NZSL PRESENT
True, # S3 PRESENT
] + dict_row
else:
# S3 key with no corresponding NZSL glossvideo record
this_all_keys_dict[video_key] = [
False, # NZSL Absent
True, # S3 PRESENT
] + [""] * 6

# Find NZSL keys that are absent from S3 (present in both handled above)
for video_key, dict_row in this_nzsl_raw_keys_dict.items():
if video_key not in this_s3_bucket_raw_keys_list:
# gloss/glossvideo record with no corresponding S3 key
# Either:
# video_key is real, but the S3 object is missing
# video_key is fake (to handle the FULL JOIN) and this gloss/glossvideo never had an S3 object
this_all_keys_dict[video_key] = [
True, # NZSL PRESENT
False, # S3 Absent
] + dict_row

return this_all_keys_dict


def find_orphans():
all_keys_dict = create_all_keys_dict(
get_nzsl_raw_keys_dict(), get_s3_bucket_raw_keys_list()
)
print("Finding fixable orphans", file=sys.stderr)

print(CSV_DELIMITER.join(GLOBAL_COLUMN_HEADINGS))

# Traverse all the NZSL Signbank glosses that are missing S3 objects
for video_key, [
key_in_nzsl,
key_in_s3,
gloss_idgloss,
gloss_created_at,
gloss_id,
video_id,
gloss_public,
video_public,
] in all_keys_dict.items():

if not key_in_nzsl:
# This is an S3 object, not a Signbank record
continue

if key_in_s3:
# This Signbank record already has an S3 object, all is well
continue

# Business rule
if int(gloss_id) < 8000:
jonholdsworth marked this conversation as resolved.
Show resolved Hide resolved
continue

# The gloss_id is the only reliable retrieval key at the Signbank end
gloss = Gloss.objects.get(id=gloss_id)
gloss_name = gloss.idgloss.split(":")[0].strip()
video_path = gloss.get_video_path()
jonholdsworth marked this conversation as resolved.
Show resolved Hide resolved

# Skip any that already have a video path
# These should have an S3 object but don't: For some reason the video never made it to S3
# These will have to have their videos reinstated (separate operation)
if len(video_path) > 0:
continue

# We try to find the orphaned S3 object, if it exists
# TODO We could improve on brute-force by installing new libraries eg. rapidfuzz
for test_key, [key_nzsl_yes, key_s3_yes, *_] in all_keys_dict.items():
if gloss_name in test_key:
jonholdsworth marked this conversation as resolved.
Show resolved Hide resolved
if str(gloss_id) in test_key:
if key_nzsl_yes:
print(f"Anomaly (in NZSL): {gloss.idgloss}", file=sys.stderr)
continue
if not key_s3_yes:
print(f"Anomaly (not in S3): {gloss.idgloss}", file=sys.stderr)
continue
print(
CSV_DELIMITER.join(
[gloss_id, gloss.idgloss, str(gloss_public), test_key]
)
)


print(f"Env: {args.env}", file=sys.stderr)
print(f"S3 bucket: {AWS_S3_BUCKET}", file=sys.stderr)
print(f"PGCLI: {PGCLI}", file=sys.stderr)
print(f"AWS profile: {os.environ.get('AWS_PROFILE', '')}", file=sys.stderr)

find_orphans()
Loading
Loading