Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Redis JSON extension from MARC file generation (PP-1839) #2143

Merged
merged 7 commits into from
Oct 31, 2024

Conversation

jonathangreen
Copy link
Member

@jonathangreen jonathangreen commented Oct 29, 2024

Description

Updates to MARC generation Celery tasks:

  1. Remove Redis JSON Usage 🔥
    I decided to stop using Redis JSON because the AWS Elasticache implementation differs significantly from the standard Redis setup, making local testing challenging. This update resolves an issue that only occurred in production, where AWS Elasticache would throw exceptions if the JSON document exceeded 64 MB. The standard Redis JSON does not have this limitation, so the problem only surfaced in the production environment.

  2. Optimize Database Queries
    To reduce the duration of the MARC file generation tasks, I optimized the database queries:

    • I added an additional index to speed up queries, along with a database migration to apply this index to existing installations.
    • I adjusted the loader options in the queries to minimize the number of queries made.

    These changes reduced the task time to about 45 seconds in the worst case, with an average around 20 seconds. While it’s not ideal, it’s a significant improvement from the previous 5-minute duration. This optimization allows me to skip data buffering in Redis and push it directly to S3.

Motivation and Context

  • This update addresses MARC record generation failures seen in production, particularly when collections are added to multiple libraries.
  • Removing the reliance on Redis JSON simplifies future debugging and troubleshooting.

How Has This Been Tested?

  • Tested locally with production data
  • Running unit tests in CI

Checklist

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

@jonathangreen jonathangreen added the bug Something isn't working label Oct 29, 2024
@jonathangreen jonathangreen force-pushed the bugfix/failing-marc-xml branch from fbb4c93 to 2955774 Compare October 29, 2024 15:26
@jonathangreen jonathangreen added the DB migration This PR contains a DB migration label Oct 29, 2024
Copy link

codecov bot commented Oct 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.77%. Comparing base (fefed3d) to head (2b86670).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2143      +/-   ##
==========================================
- Coverage   90.83%   90.77%   -0.06%     
==========================================
  Files         352      350       -2     
  Lines       40865    40607     -258     
  Branches     8852     8796      -56     
==========================================
- Hits        37120    36863     -257     
- Misses       2438     2439       +1     
+ Partials     1307     1305       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jonathangreen jonathangreen marked this pull request as ready for review October 29, 2024 16:56
@jonathangreen jonathangreen requested a review from a team October 29, 2024 16:56
Copy link
Contributor

@tdilauro tdilauro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 🎉

A few small comments below.

@@ -283,92 +239,113 @@ def query_works(
)
.limit(batch_size)
.order_by(Work.id.asc())
.options(
# We set loader options on all the collection properties
# needed to generate the MARC records, so that don't end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - missing word: "... so that [we] don't ..."

Comment on lines +259 to +263
# We set raiseload on all the other properties, so we quickly know if
# a change causes us to start having to issue queries to get a property.
# This will raise a InvalidRequestError, that should fail our tests, so
# we know to add the new required properties to this function.
raiseload("*"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!! 🎉

Comment on lines 284 to 302
isbn_query = (
select(RecursiveEquivalencyCache)
.join(
Identifier,
RecursiveEquivalencyCache.identifier_id == Identifier.id,
)
.where(
RecursiveEquivalencyCache.parent_identifier_id.in_(needs_lookup.keys()),
Identifier.type == Identifier.ISBN,
)
.order_by(
RecursiveEquivalencyCache.parent_identifier_id,
RecursiveEquivalencyCache.identifier_id,
)
.options(
selectinload(RecursiveEquivalencyCache.identifier),
)
)
isbn_equivalents = session.execute(isbn_query).scalars().all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do something similar for looking up ISBN equivalent in the playback time accounting pipeline. Might be good to extract this as a utility query that can be used by both. But that doesn't need to happen in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that, and I was actually calling that function from playtime reporting in my first iteration of this. The query in playtime reporting was too slow for this case in the end as it it only looks up a single equivalent identifier, and for efficiency I need to lookup a whole batch of them.

I think we will probably still need two functions, but its a good point, they both should get moved somewhere more accessible where other code can use them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I moved this one out to be a staticmethod on RecursiveEquivalencyCache so it can be reused.

@@ -54,13 +52,13 @@ def work(self, collection: Collection | None = None) -> Work:
edition = self._db.edition()
self._db.licensepool(edition, collection=collection)
work = self._db.work(presentation_edition=edition)
work.last_update_time = utc_now()
work.last_update_time = utc_now() - datetime.timedelta(days=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear why setting this to a day before. Is it so that running daily won't skip days because it might be less than a day? Would be helpful to have some explanation here.

@jonathangreen jonathangreen force-pushed the bugfix/failing-marc-xml branch from da3f511 to c13d35e Compare October 30, 2024 21:40
@jonathangreen jonathangreen merged commit 2377294 into main Oct 31, 2024
21 checks passed
@jonathangreen jonathangreen deleted the bugfix/failing-marc-xml branch October 31, 2024 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working DB migration This PR contains a DB migration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants