Don't fetch image metadata during Algolia sync #3496

ikesau · 2024-04-15T23:17:05Z

The reason this PR exists is because the algolia sync script was timing out during my local testing.

It was calling Gdoc.getAndLoadPublishedGdocPosts (a proxy for getAndLoadPublishedGdocPosts) which called loadGdocFromGdocBase on every post, which called loadImageMetadata on every post.

loadImageMetadata fetches from Gdrive and syncs to the DB+S3 if there are updates. It was written with the Gdocs preview pipeline in mind, and turned out to be an anti-pattern: when used on all 500 gdocs all at once, we get throttled.

So, at first I got a bit distracted by gdocs performance in general and added a cache to the imageStore singleton so that it would fetch all metadata during instantiation and then serve the cache on subsequent calls.

Fortunately I realized, for the specific issue at hand, it would be far simpler to just not load image metadata when we're indexing documents to Algolia.

Now, GdocBase.loadState only loads image data from the DB, and loadGdocFromGdocBase (its caller) will do the sync with Google beforehand if the contentSource is gdocs

e.g.

GET /admin/gdocs/:id?contentSource=gdocs
Select the Gdoc from the DB
Parse the response
Request fresh JSON from Google
Parse the response
Request image metadata from google, sync with the DB+S3 if updated
Load attachments

This meant I could remove a bunch of the readwrite transactions we added due to the previous image flow, but once I started on that, I realized it would quickly blow this PR up into something massive, which seemed like a bad idea right before the offsite, especially because I would prefer to blow this whole system up with Cloudflare Images instead. 😅

marcelgerber

nice, thank you for working on this ❤️

db/db.ts

db/model/Gdoc/GdocBase.ts

db/model/Gdoc/GdocFactory.ts

danyx23 · 2024-04-16T20:20:17Z

db/db.ts

+    filenames: string[]
+): Promise<Record<string, ImageMetadata>> => {
+    if (filenames.length === 0) return {}
+    const rows = (await knexRaw(


Suggested change

const rows = (await knexRaw(

const rows = (await knexRaw<Pick<DbEnrichedImage, "id" | "googleId" | "filename" | "defaultAlt" | "updatedAt" | "originalWidth" | "originalHeight">> (

I think this pattern is nicer where instead of casting to the full type in the end you pick at the point of the query

True! I've also changed ImageMetadata to derive from DbEnrichedImage to hopefully keep things in sync a little better.

danyx23 · 2024-04-16T20:24:31Z

db/model/Image.ts

+    return Promise.all(images.map((i) => Image.syncImage(knex, i)))
+        .then(excludeUndefined)
+        .catch((e) => {
+            console.error(`Error syncing images to S3`, e)
+            return []
+        })


If we do this a lot less often now should we then maybe take care of the errors a bit better? It might be worth capturing the errors so we can report them up the call chain and back to the client so we can warn the user if some images didn't sync. What do you think?

Sounds good. I've added errors in the right places so that users will get a JSON 500 response if something bad happens on fetch from gdrive or sync.

Far from ideal, but programming is a process 🙂

danyx23

Oh this is very nice ☺️! Can you create a follow-up issue to do the cleanup and get rid of the unnecessary read-write transactions now?

I have one question about the error reporting if syncing fails, otherwise it's good to go I think!

ikesau · 2024-04-16T23:08:19Z

Follow up issue here: #3504

This reverts commit 041cdd2, reversing changes made to 1976d05.

This reverts commit 9c6abaa.

ikesau requested a review from danyx23 April 15, 2024 23:17

github-actions bot assigned ikesau Apr 15, 2024

marcelgerber reviewed Apr 16, 2024

View reviewed changes

db/db.ts Outdated Show resolved Hide resolved

db/model/Gdoc/GdocBase.ts Outdated Show resolved Hide resolved

db/model/Gdoc/GdocFactory.ts Outdated Show resolved Hide resolved

danyx23 reviewed Apr 16, 2024

View reviewed changes

ikesau added 11 commits April 16, 2024 21:50

🎉 refactor imageStore, gdocs baking

8465b89

✨ chunk metadata fetching

60a1f79

✨ image store caching polish

0be6fcf

🐝 revert image caching idea

ea4321f

🐝 delete accidentally comitted benchmark file

255f0f7

✅ fix lint

f2ea1c5

🐝 remove some more unused benchmarking code

0b75004

🐝 correctly parse GdocPost before indexing

f0e00d8

🐝 remove unused knex transaction typeguard function

5503e19

✅ fix lint

b0991a1

✨ improve image sync error handling, typing, and documentation

66590a5

ikesau force-pushed the image-store-caching branch from 63240f4 to 66590a5 Compare April 16, 2024 22:49

✅ fix lint

d4cf8ed

ikesau mentioned this pull request Apr 16, 2024

Clear out all the ReadWrite transactions that are only necessary because of images #3504

Open

ikesau merged commit 041cdd2 into master Apr 17, 2024
19 of 22 checks passed

ikesau deleted the image-store-caching branch April 17, 2024 20:02

marcelgerber added a commit that referenced this pull request Apr 18, 2024

Revert "Merge pull request #3496 from owid/image-store-caching"

9c6abaa

This reverts commit 041cdd2, reversing changes made to 1976d05.

ikesau added a commit that referenced this pull request Apr 18, 2024

Reapply "Merge pull request #3496 from owid/image-store-caching"

36c2a55

This reverts commit 9c6abaa.

ikesau mentioned this pull request Apr 18, 2024

Don't fetch image metadata during Algolia sync (with tags) #3515

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't fetch image metadata during Algolia sync #3496

Don't fetch image metadata during Algolia sync #3496

ikesau commented Apr 15, 2024

marcelgerber left a comment

danyx23 Apr 16, 2024

ikesau Apr 16, 2024

danyx23 Apr 16, 2024

ikesau Apr 16, 2024

danyx23 left a comment

ikesau commented Apr 16, 2024

	const rows = (await knexRaw(
	const rows = (await knexRaw<Pick<DbEnrichedImage, "id" \| "googleId" \| "filename" \| "defaultAlt" \| "updatedAt" \| "originalWidth" \| "originalHeight">> (

Don't fetch image metadata during Algolia sync #3496

Don't fetch image metadata during Algolia sync #3496

Conversation

ikesau commented Apr 15, 2024

marcelgerber left a comment

Choose a reason for hiding this comment

danyx23 Apr 16, 2024

Choose a reason for hiding this comment

ikesau Apr 16, 2024

Choose a reason for hiding this comment

danyx23 Apr 16, 2024

Choose a reason for hiding this comment

ikesau Apr 16, 2024

Choose a reason for hiding this comment

danyx23 left a comment

Choose a reason for hiding this comment

ikesau commented Apr 16, 2024