Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Update algolia index when publishing/updating/unpublishing gdoc posts #3427

Merged
merged 10 commits into from
Apr 9, 2024

Conversation

ikesau
Copy link
Member

@ikesau ikesau commented Mar 28, 2024

Resolves #2852

Adds a new step to the dreaded gdocs PUT route ☠️

If publishing/updating:
Update the pages index with new records

If deleting/unpublishing:
Delete the records from the pages index

I had to move stuff out of indexToAlgolia.tsx so that I could import a function without the implicit call to indexToAlgola() being evaluated.

Also added some comments about Fragments and #3426 as I encountered that annoying little gotcha.

To test:

Set ALGOLIA_INDEXING=true, ALGOLIA_ID, ALGOLIA_SEARCH_KEY, ALGOLIA_SECRET_KEY with values from the owid-staging application on Algolia and set in your local .env file

Publish a test post with at least 1500 characters (for chunking) and verify that it shows up in the owid-staging pages index

Update the post to have a short amount of content and republish it. Confirm that the content is updated in the index and that records from the larger version of the post are deleted.

Change the slug of the post and republish it again. Confirm that the records are updated correctly.

Unpublish the post. Confirm that the records are deleted from the index.

Publish the post again, confirm that new records are created, and then delete the post (without unpublishing first). Confirm that the records are deleted.

Confirm that fragments don't get indexed but topic pages, linear topic pages, about pages, and articles do.

@ikesau ikesau requested a review from danyx23 March 28, 2024 22:24
@ikesau ikesau self-assigned this Mar 28, 2024
@ikesau ikesau requested review from marcelgerber and removed request for danyx23 April 2, 2024 15:45
Copy link
Member

@marcelgerber marcelgerber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
It's a bummer that the gdocs PUT route is so full now; would be nice if we can factor some things out there at some point.

baker/algolia/algoliaUtils.tsx Outdated Show resolved Hide resolved
baker/algolia/algoliaUtils.tsx Outdated Show resolved Hide resolved
adminSiteClient/gdocsDeploy.ts Outdated Show resolved Hide resolved
@@ -2380,12 +2387,17 @@ putRouteWithRWTransaction(apiRouter, "/gdocs/:id", async (req, res, trx) => {
nextJson.slug
)
} else if (checkFullDeployFallback(prevJson, nextJson, hasChanges)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a bit weird is that lightning deploys don't trigger a GdocPublishingAction.Updating action.
I don't know the numbers, but they might be the more common case than fallback deploys, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! I've fixed lightning deploys: a function property was missing from the properties to exclude list and so isEqual(prev, next) was always returning false - not great that it's so easy to miss this but I think it was mostly due to the large gdocs model refactor and should be easier to notice again now that it's working.

I also extracted functions from the PUT route (probably easier to review in an editor than the git diff) most notably indexAndBakeGdocIfNeccesary which handles all four possible publishing action cases more explicitly, now, and should make it easier to update.

@ikesau ikesau force-pushed the index-articles-on-publish branch from 48c18da to fc1e521 Compare April 3, 2024 17:57
Copy link
Member

@marcelgerber marcelgerber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor, that's making this much better. Thank you!

baker/algolia/algoliaUtils.tsx Outdated Show resolved Hide resolved
adminSiteClient/gdocsDeploy.ts Outdated Show resolved Hide resolved
* - Updating (index and bake (potentially via lightning deploy))
* - Unpublishing (remove from index and bake)
*/
async function indexAndBakeGdocIfNeccesary(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to move this one to gdocsDeploy, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeahhh I had the same desire but those functions are in the adminSiteClient folder, and I didn't want to make a new file just for one function so I kept it close to usage instead.

Copy link

coderabbitai bot commented Apr 8, 2024

Walkthrough

The recent updates focus on enhancing content management and search capabilities through Algolia indexing at deploy time. Key changes include the introduction of new functions and enums for handling Google Docs posts, improvements in image synchronization, and the addition of Algolia indexing utilities. These modifications aim to streamline content updates and improve search functionality, aligning with the objective to index content live as it is published.

Changes

File Path Changes
adminSiteClient/gdocsDeploy.ts Added functions and enums for Google Docs publishing actions; refactored existing functions.
adminSiteServer/apiRouter.ts Enhanced Gdoc handling and image synchronization; added new functions for content management.
baker/algolia/algoliaUtils.tsx & baker/algolia/indexToAlgolia.tsx Introduced Algolia indexing utilities; refactored content record generation and indexing logic.
db/model/Gdoc/GdocBase.ts & db/model/Gdoc/GdocFactory.ts Improved Google Docs management with new methods and functions for image syncing and content graph updates.
packages/@ourworldindata/utils/src/... Added utility functions for Google Docs post validation.

Assessment against linked issues

Objective Addressed Explanation
Index our content with Algolia at deploy time (#2852)
Update individual documents in Algolia index instead of full-index updates (#2852)
Improve search quality by allowing live indexing of new content (#2852)
Reduce the time gap between content publication and its availability in search results (#2852)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Comment on lines +601 to +635
export async function syncImagesAndAddToContentGraph(
trx: KnexReadWriteTransaction,
gdoc: GdocPost | GdocDataInsight | GdocHomepage | GdocAuthor
): Promise<void> {
const id = gdoc.id
// Deleting and recreating these is simpler than tracking orphans over the next code block
await trx.table(PostsGdocsXImagesTableName).where({ gdocId: id }).delete()
const filenames = gdoc.filenames

// Includes fragments so that images in data pages are
// synced to S3 and ultimately baked in bakeDriveImages().
if (filenames.length && gdoc.published) {
await imageStore.fetchImageMetadata(filenames)
const images = await imageStore.syncImagesToS3(trx)
const gdocXImagesToInsert: DbInsertPostGdocXImage[] = []
for (const image of images) {
if (image) {
gdocXImagesToInsert.push({
gdocId: gdoc.id,
imageId: image.id,
})
}
}
try {
await trx
.table(PostsGdocsXImagesTableName)
.insert(gdocXImagesToInsert)
} catch (e) {
console.error(
`Error tracking image references with Google ID ${gdoc.id}`,
e
)
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Codebase Verification

The syncImagesAndAddToContentGraph function implements basic error handling by logging errors to the console and adopts a straightforward approach to syncing images by deleting and re-adding references. This method simplifies the logic but might not be the most efficient in scenarios where the image set does not change frequently. While the error handling captures errors during the database insertion process, it does not provide mechanisms for retrying failed operations or notifying external systems/services of the failure, which could be areas for improvement. The review comment is partially addressed in the implementation, suggesting that while the function includes error handling, there could be room for enhancement in terms of efficiency and more comprehensive error handling strategies.

Analysis chain

The implementation of syncImagesAndAddToContentGraph for syncing images to S3 and updating image references in the database is a crucial functionality. Ensure that the image syncing process is efficient and that error handling is robust to prevent data inconsistencies.


Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify the efficiency and error handling of the image syncing process.
rg "syncImagesAndAddToContentGraph" db/model/Gdoc/GdocFactory.ts

Length of output: 118



Script:

#!/bin/bash
# Display the content of the syncImagesAndAddToContentGraph function for detailed examination
rg -A 30 "async function syncImagesAndAddToContentGraph" db/model/Gdoc/GdocFactory.ts

Length of output: 1344

@ikesau ikesau force-pushed the index-articles-on-publish branch from b2fca20 to 3854a91 Compare April 9, 2024 14:55
@ikesau ikesau merged commit 5fc6796 into master Apr 9, 2024
22 checks passed
@ikesau ikesau deleted the index-articles-on-publish branch April 9, 2024 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Index our content with Algolia at deploy time
2 participants