Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't fetch image metadata during Algolia sync #3496

Merged
merged 12 commits into from
Apr 17, 2024
Merged

Conversation

ikesau
Copy link
Member

@ikesau ikesau commented Apr 15, 2024

The reason this PR exists is because the algolia sync script was timing out during my local testing.

It was calling Gdoc.getAndLoadPublishedGdocPosts (a proxy for getAndLoadPublishedGdocPosts) which called loadGdocFromGdocBase on every post, which called loadImageMetadata on every post.

loadImageMetadata fetches from Gdrive and syncs to the DB+S3 if there are updates. It was written with the Gdocs preview pipeline in mind, and turned out to be an anti-pattern: when used on all 500 gdocs all at once, we get throttled.

So, at first I got a bit distracted by gdocs performance in general and added a cache to the imageStore singleton so that it would fetch all metadata during instantiation and then serve the cache on subsequent calls.

Fortunately I realized, for the specific issue at hand, it would be far simpler to just not load image metadata when we're indexing documents to Algolia.

Now, GdocBase.loadState only loads image data from the DB, and loadGdocFromGdocBase (its caller) will do the sync with Google beforehand if the contentSource is gdocs

e.g.

  1. GET /admin/gdocs/:id?contentSource=gdocs
  2. Select the Gdoc from the DB
  3. Parse the response
  4. Request fresh JSON from Google
  5. Parse the response
  6. Request image metadata from google, sync with the DB+S3 if updated
  7. Load attachments

This meant I could remove a bunch of the readwrite transactions we added due to the previous image flow, but once I started on that, I realized it would quickly blow this PR up into something massive, which seemed like a bad idea right before the offsite, especially because I would prefer to blow this whole system up with Cloudflare Images instead. 😅

@ikesau ikesau requested a review from danyx23 April 15, 2024 23:17
Copy link
Member

@marcelgerber marcelgerber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thank you for working on this ❤️

db/db.ts Outdated Show resolved Hide resolved
db/model/Gdoc/GdocBase.ts Outdated Show resolved Hide resolved
db/model/Gdoc/GdocFactory.ts Outdated Show resolved Hide resolved
db/db.ts Outdated
filenames: string[]
): Promise<Record<string, ImageMetadata>> => {
if (filenames.length === 0) return {}
const rows = (await knexRaw(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const rows = (await knexRaw(
const rows = (await knexRaw<Pick<DbEnrichedImage, "id" | "googleId" | "filename" | "defaultAlt" | "updatedAt" | "originalWidth" | "originalHeight">> (

I think this pattern is nicer where instead of casting to the full type in the end you pick at the point of the query

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! I've also changed ImageMetadata to derive from DbEnrichedImage to hopefully keep things in sync a little better.

Comment on lines 319 to 324
return Promise.all(images.map((i) => Image.syncImage(knex, i)))
.then(excludeUndefined)
.catch((e) => {
console.error(`Error syncing images to S3`, e)
return []
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this a lot less often now should we then maybe take care of the errors a bit better? It might be worth capturing the errors so we can report them up the call chain and back to the client so we can warn the user if some images didn't sync. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I've added errors in the right places so that users will get a JSON 500 response if something bad happens on fetch from gdrive or sync.

Far from ideal, but programming is a process 🙂

Copy link
Contributor

@danyx23 danyx23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is very nice ☺️! Can you create a follow-up issue to do the cleanup and get rid of the unnecessary read-write transactions now?

I have one question about the error reporting if syncing fails, otherwise it's good to go I think!

@ikesau ikesau force-pushed the image-store-caching branch from 63240f4 to 66590a5 Compare April 16, 2024 22:49
@ikesau
Copy link
Member Author

ikesau commented Apr 16, 2024

Follow up issue here: #3504

@ikesau ikesau merged commit 041cdd2 into master Apr 17, 2024
19 of 22 checks passed
@ikesau ikesau deleted the image-store-caching branch April 17, 2024 20:02
marcelgerber added a commit that referenced this pull request Apr 18, 2024
This reverts commit 041cdd2, reversing
changes made to 1976d05.
ikesau added a commit that referenced this pull request Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants