Remove Redis JSON extension from MARC file generation (PP-1839) #2143

jonathangreen · 2024-10-29T15:24:27Z

Description

Updates to MARC generation Celery tasks:

Remove Redis JSON Usage 🔥
I decided to stop using Redis JSON because the AWS Elasticache implementation differs significantly from the standard Redis setup, making local testing challenging. This update resolves an issue that only occurred in production, where AWS Elasticache would throw exceptions if the JSON document exceeded 64 MB. The standard Redis JSON does not have this limitation, so the problem only surfaced in the production environment.
Optimize Database Queries
To reduce the duration of the MARC file generation tasks, I optimized the database queries:
- I added an additional index to speed up queries, along with a database migration to apply this index to existing installations.
- I adjusted the loader options in the queries to minimize the number of queries made.
These changes reduced the task time to about 45 seconds in the worst case, with an average around 20 seconds. While it’s not ideal, it’s a significant improvement from the previous 5-minute duration. This optimization allows me to skip data buffering in Redis and push it directly to S3.

Motivation and Context

This update addresses MARC record generation failures seen in production, particularly when collections are added to multiple libraries.
Removing the reliance on Redis JSON simplifies future debugging and troubleshooting.

How Has This Been Tested?

Tested locally with production data
Running unit tests in CI

Checklist

I have updated the documentation accordingly.
All new and existing tests passed.

codecov · 2024-10-29T15:39:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.77%. Comparing base (fefed3d) to head (2b86670).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2143      +/-   ##
==========================================
- Coverage   90.83%   90.77%   -0.06%     
==========================================
  Files         352      350       -2     
  Lines       40865    40607     -258     
  Branches     8852     8796      -56     
==========================================
- Hits        37120    36863     -257     
- Misses       2438     2439       +1     
+ Partials     1307     1305       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tdilauro

Looks good! 🎉

A few small comments below.

tdilauro · 2024-10-30T18:55:08Z

src/palace/manager/marc/exporter.py

@@ -283,92 +239,113 @@ def query_works(
            )
            .limit(batch_size)
            .order_by(Work.id.asc())
+            .options(
+                # We set loader options on all the collection properties
+                # needed to generate the MARC records, so that don't end


Minor - missing word: "... so that [we] don't ..."

tdilauro · 2024-10-30T18:57:28Z

src/palace/manager/marc/exporter.py

+                # We set raiseload on all the other properties, so we quickly know if
+                # a change causes us to start having to issue queries to get a property.
+                # This will raise a InvalidRequestError, that should fail our tests, so
+                # we know to add the new required properties to this function.
+                raiseload("*"),


tdilauro · 2024-10-30T19:00:50Z

src/palace/manager/marc/exporter.py

+        isbn_query = (
+            select(RecursiveEquivalencyCache)
+            .join(
+                Identifier,
+                RecursiveEquivalencyCache.identifier_id == Identifier.id,
+            )
+            .where(
+                RecursiveEquivalencyCache.parent_identifier_id.in_(needs_lookup.keys()),
+                Identifier.type == Identifier.ISBN,
+            )
+            .order_by(
+                RecursiveEquivalencyCache.parent_identifier_id,
+                RecursiveEquivalencyCache.identifier_id,
+            )
+            .options(
+                selectinload(RecursiveEquivalencyCache.identifier),
+            )
+        )
+        isbn_equivalents = session.execute(isbn_query).scalars().all()


We do something similar for looking up ISBN equivalent in the playback time accounting pipeline. Might be good to extract this as a utility query that can be used by both. But that doesn't need to happen in this PR.

I saw that, and I was actually calling that function from playtime reporting in my first iteration of this. The query in playtime reporting was too slow for this case in the end as it it only looks up a single equivalent identifier, and for efficiency I need to lookup a whole batch of them.

I think we will probably still need two functions, but its a good point, they both should get moved somewhere more accessible where other code can use them.

Okay I moved this one out to be a staticmethod on RecursiveEquivalencyCache so it can be reused.

tdilauro · 2024-10-30T20:02:16Z

tests/fixtures/marc.py

@@ -54,13 +52,13 @@ def work(self, collection: Collection | None = None) -> Work:
        edition = self._db.edition()
        self._db.licensepool(edition, collection=collection)
        work = self._db.work(presentation_edition=edition)
-        work.last_update_time = utc_now()
+        work.last_update_time = utc_now() - datetime.timedelta(days=1)


Not clear why setting this to a day before. Is it so that running daily won't skip days because it might be less than a day? Would be helpful to have some explanation here.

jonathangreen added the bug Something isn't working label Oct 29, 2024

jonathangreen force-pushed the bugfix/failing-marc-xml branch from fbb4c93 to 2955774 Compare October 29, 2024 15:26

jonathangreen added the DB migration This PR contains a DB migration label Oct 29, 2024

jonathangreen marked this pull request as ready for review October 29, 2024 16:56

jonathangreen requested a review from a team October 29, 2024 16:56

tdilauro approved these changes Oct 30, 2024

View reviewed changes

jonathangreen added 5 commits October 30, 2024 18:40

Removed Redis JSON from MARC generation

daa25b1

Add new index

57062d4

Update MARC tests

cc29d57

Another update

f8592cf

Remove elasticache escaping mixin file

c13d35e

jonathangreen force-pushed the bugfix/failing-marc-xml branch from da3f511 to c13d35e Compare October 30, 2024 21:40

jonathangreen added 2 commits October 30, 2024 21:44

Code review feedback

afaaf51

Remove unused function

2b86670

jonathangreen merged commit 2377294 into main Oct 31, 2024
21 checks passed

jonathangreen deleted the bugfix/failing-marc-xml branch October 31, 2024 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Redis JSON extension from MARC file generation (PP-1839) #2143

Remove Redis JSON extension from MARC file generation (PP-1839) #2143

jonathangreen commented Oct 29, 2024 •

edited

Loading

codecov bot commented Oct 29, 2024 •

edited

Loading

tdilauro left a comment

tdilauro Oct 30, 2024

tdilauro Oct 30, 2024

tdilauro Oct 30, 2024

jonathangreen Oct 31, 2024

jonathangreen Oct 31, 2024

tdilauro Oct 30, 2024

Remove Redis JSON extension from MARC file generation (PP-1839) #2143

Remove Redis JSON extension from MARC file generation (PP-1839) #2143

Conversation

jonathangreen commented Oct 29, 2024 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Checklist

codecov bot commented Oct 29, 2024 • edited Loading

Codecov Report

tdilauro left a comment

Choose a reason for hiding this comment

tdilauro Oct 30, 2024

Choose a reason for hiding this comment

tdilauro Oct 30, 2024

Choose a reason for hiding this comment

tdilauro Oct 30, 2024

Choose a reason for hiding this comment

jonathangreen Oct 31, 2024

Choose a reason for hiding this comment

jonathangreen Oct 31, 2024

Choose a reason for hiding this comment

tdilauro Oct 30, 2024

Choose a reason for hiding this comment

jonathangreen commented Oct 29, 2024 •

edited

Loading

codecov bot commented Oct 29, 2024 •

edited

Loading