-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Redis JSON extension from MARC file generation (PP-1839) #2143
Conversation
fbb4c93
to
2955774
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2143 +/- ##
==========================================
- Coverage 90.83% 90.77% -0.06%
==========================================
Files 352 350 -2
Lines 40865 40607 -258
Branches 8852 8796 -56
==========================================
- Hits 37120 36863 -257
- Misses 2438 2439 +1
+ Partials 1307 1305 -2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 🎉
A few small comments below.
src/palace/manager/marc/exporter.py
Outdated
@@ -283,92 +239,113 @@ def query_works( | |||
) | |||
.limit(batch_size) | |||
.order_by(Work.id.asc()) | |||
.options( | |||
# We set loader options on all the collection properties | |||
# needed to generate the MARC records, so that don't end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor - missing word: "... so that [we] don't ..."
# We set raiseload on all the other properties, so we quickly know if | ||
# a change causes us to start having to issue queries to get a property. | ||
# This will raise a InvalidRequestError, that should fail our tests, so | ||
# we know to add the new required properties to this function. | ||
raiseload("*"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!! 🎉
src/palace/manager/marc/exporter.py
Outdated
isbn_query = ( | ||
select(RecursiveEquivalencyCache) | ||
.join( | ||
Identifier, | ||
RecursiveEquivalencyCache.identifier_id == Identifier.id, | ||
) | ||
.where( | ||
RecursiveEquivalencyCache.parent_identifier_id.in_(needs_lookup.keys()), | ||
Identifier.type == Identifier.ISBN, | ||
) | ||
.order_by( | ||
RecursiveEquivalencyCache.parent_identifier_id, | ||
RecursiveEquivalencyCache.identifier_id, | ||
) | ||
.options( | ||
selectinload(RecursiveEquivalencyCache.identifier), | ||
) | ||
) | ||
isbn_equivalents = session.execute(isbn_query).scalars().all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do something similar for looking up ISBN equivalent in the playback time accounting pipeline. Might be good to extract this as a utility query that can be used by both. But that doesn't need to happen in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that, and I was actually calling that function from playtime reporting in my first iteration of this. The query in playtime reporting was too slow for this case in the end as it it only looks up a single equivalent identifier, and for efficiency I need to lookup a whole batch of them.
I think we will probably still need two functions, but its a good point, they both should get moved somewhere more accessible where other code can use them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I moved this one out to be a staticmethod
on RecursiveEquivalencyCache
so it can be reused.
@@ -54,13 +52,13 @@ def work(self, collection: Collection | None = None) -> Work: | |||
edition = self._db.edition() | |||
self._db.licensepool(edition, collection=collection) | |||
work = self._db.work(presentation_edition=edition) | |||
work.last_update_time = utc_now() | |||
work.last_update_time = utc_now() - datetime.timedelta(days=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear why setting this to a day before. Is it so that running daily won't skip days because it might be less than a day? Would be helpful to have some explanation here.
da3f511
to
c13d35e
Compare
Description
Updates to MARC generation Celery tasks:
Remove Redis JSON Usage 🔥
I decided to stop using Redis JSON because the AWS Elasticache implementation differs significantly from the standard Redis setup, making local testing challenging. This update resolves an issue that only occurred in production, where AWS Elasticache would throw exceptions if the JSON document exceeded 64 MB. The standard Redis JSON does not have this limitation, so the problem only surfaced in the production environment.
Optimize Database Queries
To reduce the duration of the MARC file generation tasks, I optimized the database queries:
These changes reduced the task time to about 45 seconds in the worst case, with an average around 20 seconds. While it’s not ideal, it’s a significant improvement from the previous 5-minute duration. This optimization allows me to skip data buffering in Redis and push it directly to S3.
Motivation and Context
How Has This Been Tested?
Checklist