Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bake posts, pages and blocks from WP API snapshots #3166

Merged
merged 25 commits into from
Feb 19, 2024
Merged

Conversation

mlbrgl
Copy link
Member

@mlbrgl mlbrgl commented Feb 2, 2024

This PR bakes all remaining Wordpress posts, pages and blocks from API snapshots saved by the previous PR in the stack.

The full API response is saved in the database, from which we can extract the content of the post, page or block. We are currently evaluating a git-based workflow to edit, override and keep track of the content part of the snapshot, but this will be tackled separately.

This PR also deprecates the old Wordpress API functions, and moves them to a separate file for clarity.

It is important to note that these snapshots are not replacing or superseding the post.content column in the database. What was previously in the post.content column is still there, and is still the source of truth for the code paths that used it (e.g. the WP HTML source -> ArchieML migration). The snapshots are only used as a static drop-in replacement for dynamic Wordpress API calls, which were happening during the rendering of posts, pages and blocks.

Depending on the remaining scope of the overall migration effort, we might decide to set up programmatic guardrails to address this source of truth issue, or simply document the caveats of the approach for the migration team.

Testing

Summary by CodeRabbit

  • New Features
    • Introduced new post retrieval and processing functions across various components for enhanced performance and maintainability.
  • Refactor
    • Replaced deprecated WordPress database operation functions with new modular functions in the API, baker, and database layers.
    • Updated post retrieval methods to use snapshots and modular functions for better efficiency and clarity.
    • Deprecated several functions related to querying WordPress tables or APIs in favor of new, optimized methods.
  • Documentation
    • Updated interface DbEnrichedPost to include a new property for API snapshots and added related utility functions.
  • Bug Fixes
    • Adjusted function calls related to post-related charts and post citability checks to ensure accuracy and reliability.

Copy link

coderabbitai bot commented Feb 2, 2024

Walkthrough

The overarching change involves modernizing and decoupling the system from WordPress dependencies by introducing new functions for post retrieval and processing, deprecating old WordPress querying functions, and refining data types to accommodate these updates. This transition aims to streamline data handling, enhance modularity, and future-proof the platform by moving towards a more structured and efficient architecture for managing content and its metadata.

Changes

Files Change Summary
adminSiteServer/apiRouter.ts Replaced wpdb.getTopics() with DEPRECATEDgetTopics().
baker/GrapherBaker.tsx, baker/SiteBaker.tsx, baker/algolia/indexToAlgolia.tsx, baker/pageOverrides.tsx, baker/siteRenderers.tsx, baker/sitemap.ts Updated post retrieval and cache management with new functions.
db/DEPRECATEDwpdb.ts Deprecated WordPress querying functions.
db/migrateWpPostsToArchieMl.ts, db/syncPostsToGrapher.ts, db/wpdb.ts Adjusted and refactored functions related to post processing and API querying.
db/model/Post.ts, packages/@ourworldindata/types/src/dbTypes/Posts.ts, packages/@ourworldindata/types/src/index.ts Introduced new functions and updated types for post retrieval and processing.
packages/@ourworldindata/types/src/wordpressTypes/WordpressTypes.ts Added new BlockGraphQlApi interface.

🐰✨
In the code's vast meadow, under the sun's bright gleam,
We hopped away from WordPress, towards a coder's dream.
With every line refactored, and every function new,
We crafted a future, efficient and true.
🌟📜

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository from git and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

@mlbrgl mlbrgl force-pushed the wp-api-snapshot-migration branch from bd5e309 to 8d1e7fb Compare February 2, 2024 10:04
@mlbrgl mlbrgl force-pushed the bake-from-snapshot branch from d7a3e3b to f8924c9 Compare February 2, 2024 10:04
@mlbrgl mlbrgl force-pushed the bake-from-snapshot branch 2 times, most recently from d808ba9 to 74cd8dd Compare February 2, 2024 19:10
@mlbrgl mlbrgl force-pushed the wp-api-snapshot-migration branch from d730a41 to efefec8 Compare February 5, 2024 15:53
@mlbrgl mlbrgl force-pushed the bake-from-snapshot branch 2 times, most recently from 0ef0e77 to da807c5 Compare February 5, 2024 17:25
@mlbrgl mlbrgl marked this pull request as ready for review February 5, 2024 20:39
@mlbrgl mlbrgl requested a review from danyx23 February 5, 2024 20:39
@mlbrgl mlbrgl force-pushed the wp-api-snapshot-migration branch from efefec8 to e7c0485 Compare February 7, 2024 13:45
@mlbrgl mlbrgl force-pushed the bake-from-snapshot branch from cb1d456 to 8265f67 Compare February 7, 2024 13:45
@danyx23
Copy link
Contributor

danyx23 commented Feb 8, 2024

@coderabbitai review

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between e7c0485 and 8265f67.
Files selected for processing (15)
  • adminSiteServer/apiRouter.ts (2 hunks)
  • baker/GrapherBaker.tsx (3 hunks)
  • baker/SiteBaker.tsx (5 hunks)
  • baker/algolia/indexToAlgolia.tsx (3 hunks)
  • baker/pageOverrides.tsx (2 hunks)
  • baker/siteRenderers.tsx (9 hunks)
  • baker/sitemap.ts (2 hunks)
  • db/DEPRECATEDwpdb.ts (1 hunks)
  • db/migrateWpPostsToArchieMl.ts (2 hunks)
  • db/model/Post.ts (3 hunks)
  • db/syncPostsToGrapher.ts (1 hunks)
  • db/wpdb.ts (13 hunks)
  • packages/@ourworldindata/types/src/dbTypes/Posts.ts (3 hunks)
  • packages/@ourworldindata/types/src/index.ts (1 hunks)
  • packages/@ourworldindata/types/src/wordpressTypes/WordpressTypes.ts (1 hunks)
Files not reviewed due to errors (1)
  • (no review received)
Additional comments: 37
packages/@ourworldindata/types/src/wordpressTypes/WordpressTypes.ts (1)
  • 54-60: LGTM!
baker/pageOverrides.tsx (2)
  • 7-10: LGTM!
  • 67-67: LGTM!
packages/@ourworldindata/types/src/dbTypes/Posts.ts (2)
  • 47-47: LGTM!
  • 99-111: LGTM!
baker/sitemap.ts (1)
  • 65-65: LGTM!
baker/algolia/indexToAlgolia.tsx (3)
  • 29-29: LGTM!
  • 96-96: LGTM!
  • 197-197: LGTM!
db/DEPRECATEDwpdb.ts (1)
  • 1-8: LGTM!
db/model/Post.ts (5)
  • 88-93: LGTM!
  • 111-122: LGTM!
  • 137-151: LGTM!
  • 153-183: LGTM!
  • 278-289: LGTM!
db/migrateWpPostsToArchieMl.ts (1)
  • 22-22: LGTM!
packages/@ourworldindata/types/src/index.ts (1)
  • 497-498: Ensure the newly exported functions snapshotIsPostRestApi and snapshotIsBlockGraphQlApi are correctly implemented and relevant to the consumers of this module.
db/syncPostsToGrapher.ts (1)
  • 353-354: Switching to wpdb.getBlockApiFromApi and wpdb.getPostApiBySlugFromApi for fetching post data based on type is a significant change. Verify that these methods provide improved functionality or performance compared to their predecessors.
db/wpdb.ts (4)
  • 107-109: Exporting WP_API_ENDPOINT, OWID_API_ENDPOINT, and WP_GRAPHQL_ENDPOINT constants improves modularity and reusability. Ensure these endpoints are correctly defined and used consistently across the project.
  • 117-123: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [120-145]

The graphqlQuery function is now exported, allowing for broader use. Verify that error handling and response parsing are adequately addressed to ensure robust interaction with the WordPress GraphQL API.

  • 285-287: The introduction of getPostApiBySlugFromApi as an exported function is a key change. Ensure that this function correctly handles errors and returns the expected data format from the WordPress REST API.
  • 558-573: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [528-572]

Exporting getBlockApiFromApi and refactoring the getTables function to improve data fetching from the WordPress API are significant changes. Confirm that these functions are implemented with proper error handling and efficient data processing.

baker/GrapherBaker.tsx (2)
  • 46-49: Ensure getPostRelatedCharts is correctly implemented and used as intended to replace getRelatedCharts. Verify its definition and ensure it aligns with the expected functionality.
  • 367-367: Confirm that the replacement of getRelatedCharts with getPostRelatedCharts correctly fetches related charts based on post ID and integrates seamlessly with the existing logic.
baker/siteRenderers.tsx (9)
  • 83-90: Verify the correct implementation and usage of snapshot-related functions (getBlockContentFromSnapshot, getBlogIndex, getFullPostByIdFromSnapshot, getFullPostBySlugFromSnapshot, isPostSlugCitable, postsTable). Ensure they properly fetch and handle post data from snapshots.
  • 197-197: Confirm that getFullPostBySlugFromSnapshot correctly retrieves full post data by slug from snapshots and integrates well with the rendering logic.
  • 202-202: Ensure getFullPostByIdFromSnapshot is used correctly to fetch full post data by ID from snapshots for preview rendering.
  • 236-237: Check that isPostSlugCitable correctly determines the citability of a post slug and integrates properly with page overrides logic.
  • 489-489: Verify that getFullPostBySlugFromSnapshot is correctly used to fetch generic country profile post data from snapshots.
  • 509-509: Confirm that getFullPostBySlugFromSnapshot correctly fetches the country profile landing post data from snapshots.
  • 568-568: Ensure that getFullPostBySlugFromSnapshot is correctly used to fetch post data by slug for rendering post thumbnails, and handle cases where the post might not exist.
  • 608-612: Check the usage of getFullPostBySlugFromSnapshot within renderProminentLinks to ensure it correctly fetches post titles for internal URLs and handles errors appropriately.
  • 722-722: Verify the correct usage of getBlockContentFromSnapshot for rendering reusable blocks within the explorer page context.
baker/SiteBaker.tsx (4)
  • 74-80: Replace direct imports from ../db/wpdb.js with modular functions from ../db/model/Post.js for getBlogIndex, getFullPost, getPostsFromSnapshots, postsFlushCache, and postsTable. This change enhances modularity and aligns with the PR's objective to transition from dynamic WordPress API calls to using static snapshots.
  • 391-404: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [394-427]

The removeDeletedPosts method now uses getPostsFromSnapshots and getFullPost to fetch posts and their details. Ensure that the logic for identifying and removing deleted posts is correctly implemented and that it efficiently handles the removal process without impacting performance.

  • 427-435: The bakePosts method now uses getPostsFromSnapshots with a filter condition to exclude posts already published via Gdocs. Verify that the filter condition is correctly implemented and that the method efficiently processes posts for baking.
  • 919-919: The flushCache method now uses postsFlushCache to clear post-related caches. This change aligns with the PR's objective to move towards a more modular approach and ensures efficient garbage collection.

Copy link
Contributor

@danyx23 danyx23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By and large this looks good, but I have two change requests:

  • the first one is to add a code path that fetches full posts in bulk (with one SQL query instead 1+N) and use that in most places where full posts are used
  • the second is to pass in the knex instance explicitly into all functions (i.e. not use db.knexTable or similar constructs). This is to make sure that transaction scopes work correctly when they are needed.

Grab me in slack if you want to discuss any of this or want me to help in rewriting the query!

db/model/Post.ts Outdated
Comment on lines 137 to 152
export const isPostSlugCitable = async (slug: string): Promise<boolean> => {
const entries = SiteNavigationStatic.categories
return entries.some((category) => {
return (
category.entries.some((entry) => entry.slug === slug) ||
(category.subcategories ?? []).some(
(subcategory: CategoryWithEntries) => {
return subcategory.entries.some(
(subCategoryEntry) => subCategoryEntry.slug === slug
)
}
)
)
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that in the place where we define SiteNavigationStatic we create a Set with all the leaves and then this one becomes a simple lookup in the set. The function should also not return a promise in any case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Promise removed. The issue for the refactor is here: #3193, and tracked here: #3097

db/model/Post.ts Outdated
SELECT * FROM ${postsTable}
WHERE status = "publish"
AND type IN (?)
ORDER BY JSON_UNQUOTE(JSON_EXTRACT(wpApiSnapshot, '$.date')) DESC;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not wpApiSnapshot ->> '$.date' here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who knows? 😅 Thanks!

db/model/Post.ts Outdated
const rawPosts: DbRawPost[] = (
await db.knexInstance().raw(
`
SELECT * FROM ${postsTable}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't select * here if we only use wpApiSnapshot below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's a bit heavy handed to parse the whole post just to use the wpApiSnapshot, but it allows us to keep a single pipeline to transform raw DB posts (DbRawPost) into parsed posts (DbEnrichedPost), through parsePostRaw.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know but the idea is for the parse*Raw functions to be explicitly modular - i.e. if you only need need wpApiSnapshot then there should be a parsePostWpApiSnapshot function next to parsePostRaw that parses just this field. A post row member is now 100K on average without spApiSnapshot and it just seems a bit on the extreme side to read all that if you won't even use it, especially as this is going to be used in a batch context.

Copy link
Member Author

@mlbrgl mlbrgl Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, thanks. Fixed in the latest push.

imageId: postApi.featured_media,
relatedCharts:
postApi.type === "page"
? await getPostRelatedCharts(postApi.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not a good query pattern - it looks like quite a few code paths need the result of getFullPost for many posts, but by doing it this way we have to run one query against MySQL for every post. I think it would be much better to have two flavours of getPostsFromSnapshots, one like it is at the moment, and one that also queries for the related chart information (doing a join and then a group by). This second query would mean that getting full posts for all posts can be done with a single query against the DB. Admittedly, this one query would then become quite complex, but I think this is still worth it. I'm happy to help with writing the refined query, just let me know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, thanks for calling this out. Issue created here: #3192, and tracked here: #3097

@mlbrgl mlbrgl force-pushed the bake-from-snapshot branch from f15615f to e93d1ba Compare February 19, 2024 14:35
@mlbrgl mlbrgl closed this Feb 19, 2024
@mlbrgl mlbrgl reopened this Feb 19, 2024
@mlbrgl
Copy link
Member Author

mlbrgl commented Feb 19, 2024

I reset the whole staging server and ran a full bake: everything looks good, so I'm going ahead with the deploy.

@mlbrgl
Copy link
Member Author

mlbrgl commented Feb 19, 2024

Merge activity

  • Feb 19, 12:02 PM EST: @mlbrgl started a stack merge that includes this pull request via Graphite.
  • Feb 19, 12:03 PM EST: @mlbrgl merged this pull request with Graphite.

@mlbrgl mlbrgl merged commit e7eef1f into master Feb 19, 2024
42 checks passed
@mlbrgl mlbrgl deleted the bake-from-snapshot branch February 19, 2024 17:03
mlbrgl added a commit that referenced this pull request Feb 19, 2024
This PR replaces the fortunejs content graph with a DB-based alternative, using the links stored in the `posts_links` and `posts_gdoc_links` tables.

![Screenshot 2024-02-07 at 10.54.47.png](https://graphite-user-uploaded-assets-prod.s3.amazonaws.com/0SFFiIjKuUK6UPYHVe6u/8f7371fb-0e80-4d07-8a4e-4fc4c8002c59.png)

This PR also continues the deprecation work started in #3166.

- [x] fully rebake site on staging
- [x] fixed malformed prominent link in content (ourworldindata.org/ instead of https://ourworldindata.org/)
	- [x] https://owid.cloud/wp/wp-admin/post.php?post=34103&action=edit
	- [x] https://owid.cloud/wp/wp-admin/post.php?post=34066&action=edit

**Inconsistency on how grapher chart links are stored in the DB**: there are a handful of chart links stored with the https://ourworldindata.org/grapher prefix. These disappear upon saving the containing articles, which indicates that they were created with an older version of the codebase. I'm then opting for not supporting them in the content graph.
```sql
    SELECT pgl.*, pg.published from posts_gdocs_links pgl
    JOIN posts_gdocs pg on pg.id = pgl.sourceId
    WHERE pgl.target LIKE "https://ourworldindata.org/grapher%"
```
 ### Testing links

Below are some testing links to grapher pages, backlinking to posts in different configurations.

- gdoc: https://ourworldindata.org/grapher/agricultural-export-subsidies
	- [x] http://localhost:3030/grapher/agricultural-export-subsidies
	- [x] http://staging-site-db-content-graph/grapher/agricultural-export-subsidies
- gdocs: https://ourworldindata.org/grapher/pollution-deaths-from-fossil-fuels
	- [x] http://localhost:3030/grapher/pollution-deaths-from-fossil-fuels
	- [x] http://staging-site-db-content-graph/grapher/pollution-deaths-from-fossil-fuels
- wp: https://ourworldindata.org/grapher/dalys-rate-from-all-causes
	- [x] http://localhost:3030/grapher/dalys-rate-from-all-causes
	- [x] http://staging-site-db-content-graph/grapher/dalys-rate-from-all-causes
- wp (with chart redirect): https://ourworldindata.org/grapher/age-standardized-death-rate-from-pm25-pollution-per-100000-vs-gdp-per-capita-int-
	- [x] http://localhost:3030/grapher/age-standardized-death-rate-from-pm25-pollution-per-100000-vs-gdp-per-capita-int-
	- [x] http://staging-site-db-content-graph/grapher/age-standardized-death-rate-from-pm25-pollution-per-100000-vs-gdp-per-capita-int-
- gdoc (with chart redirect): https://ourworldindata.org/grapher/population-long-run-with-projections?time=earliest..2100&country=~OWID_WRL
	- [x] http://localhost:3030/grapher/population-long-run-with-projections?time=earliest..2100&country=~OWID_WRL
	- [x] http://staging-site-db-content-graph/grapher/population-long-run-with-projections?time=earliest..2100&country=~OWID_WRL
- none: https://ourworldindata.org/grapher/death-rates-alcohol-drug-overdoses-by-age-who
	- [x] http://localhost:3030/grapher/death-rates-alcohol-drug-overdoses-by-age-who
	- [x] http://staging-site-db-content-graph/grapher/death-rates-alcohol-drug-overdoses-by-age-who
mlbrgl added a commit that referenced this pull request Feb 19, 2024
- pass knex as a parameter to downstream functions
- wrap top API calls in a transaction
- replace knex.raw() with knexRaw()
- this PR doesn't add [{readOnly: true}](https://www.notion.so/owid/Database-access-in-Grapher-d9db343c2bfb4ae0b14b3dec72f686c6?pvs=4#fdb6c225d4134e47a78fe3436a19f72c).

@danyx23 [as you know](#3145), these refactors tend to start with a small surface area but end up requiring extensive surgery across the codebase.

This is partly why I hadn't dived into it as part of this project, and continued using the `db.knexInstance().raw()` paradigm in downstream functions. (Another reason had to do with the value of doing this refactor on partly obsolete code).

So in case this ends up needing more work and blocks the delivery of the core wordpress sunsetting project, I'll revert it to draft for later.

How does that sound?

Context: #3166 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants