Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 filter non-editorial content out of sitemap #2846

Merged
merged 3 commits into from
Nov 1, 2023
Merged

Conversation

ikesau
Copy link
Member

@ikesau ikesau commented Oct 24, 2023

Fixes #2726

Uses the wbdb.getPosts() method (and fixes up / simplifies some incorrect typing) which correctly filters out reusable blocks. There's a default filter for posts that end in -country-profile in this method, but AFAICT that's to filter out the templates, not the pages themselves - those still get added:

image

As well as default country pages:

image

Also uses the Gdoc.getPublishedGdocs() method which filters out fragments and unpublished documents.

On my local environment:
sitemap.xml length before: 6056
sitemap.xml length after: 6043

The filtered pages:

http://localhost:3030/co2-country-profile
http://localhost:3030/coronavirus-country-profile
http://localhost:3030/energy-country-profile
http://localhost:3030/untitled-reusable-block-128
http://localhost:3030/untitled-reusable-block-129
http://localhost:3030/untitled-reusable-block-130
http://localhost:3030/untitled-reusable-block-131
http://localhost:3030/untitled-reusable-block-145
http://localhost:3030/untitled-reusable-block-171
http://localhost:3030/untitled-reusable-block-180
http://localhost:3030/untitled-reusable-block-266
http://localhost:3030/untitled-reusable-block-275
http://localhost:3030/untitled-reusable-block-277
http://localhost:3030/untitled-reusable-block-295
http://localhost:3030/untitled-reusable-block-7
http://localhost:3030/untitled-reusable-block-77

@ikesau ikesau requested a review from mlbrgl October 24, 2023 19:53
baker/sitemap.ts Outdated
where isGdocPublished = TRUE`)
const alreadyPublishedViaGdocsSlugsSet = new Set(
alreadyPublishedViaGdocsSlugs.map((row: any) => row.slug)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean this ?

Suggested change
)
alreadyPublishedViaGdocsSlugs[0].map((row: any) => row.slug)

I'm not too familiar with knex raw queries but it seems we're getting an array out of it, so
alreadyPublishedViaGdocsSlugs contains an array of rows and fields. Without this change, no slug make it into the set, and end up as duplicates in the sitemap, e.g. diet-affordability (which on my local has been ported over (isGdocPublished = 1))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn! Good catch! I'd just copied the code (it was on the cusp of my DRY threshold) without checking it. My local didn't have any published Gdocs successors when I diffed the sitemaps so I didn't notice it wasn't working.

I've extracted the function and fixed it! Thanks 🙂

Copy link
Member

@mlbrgl mlbrgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice reuse of the current APIs!

)
const postsApi = await wpdb.getPosts(
undefined,
(post) => !publishedGdocsBySlug[`/${post.slug}`]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed an existing bug related to this that you might want to address here: #2864

@ikesau ikesau force-pushed the sitemap-filtering branch from 9bf2c09 to 5a16038 Compare October 30, 2023 14:19
@ikesau ikesau merged commit f3d12b0 into master Nov 1, 2023
13 checks passed
@ikesau ikesau deleted the sitemap-filtering branch November 1, 2023 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filter out non-editorial content from the sitemap
2 participants