Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SitemapBridge] Add SitemapBridge #3602

Merged
merged 2 commits into from
Aug 8, 2023
Merged

[SitemapBridge] Add SitemapBridge #3602

merged 2 commits into from
Aug 8, 2023

Conversation

ORelio
Copy link
Contributor

@ORelio ORelio commented Aug 8, 2023

This bridge is a variant of CssSelectorBridge (Requires CssSelectorBridge to be installed).
Instead of retrieving article list from home page HTML, retrieves article list from SEO sitemap.xml.
The bridge expands articles with user-provided content selector using the same code from CssSelectorBridge.

This bridge is a variant of CssSelectorBridge.
Instead of retrieving article list from home page,
retrieves article list from SEO sitemap.xml.
Requires CssSelectorBridge to be installed.
@github-actions
Copy link

github-actions bot commented Aug 8, 2023

Pull request artifacts

file last change
SitemapBridge-pr-context1 2023-08-08, 12:57:34

@dvikan dvikan merged commit b86ee57 into RSS-Bridge:master Aug 8, 2023
7 checks passed
@dvikan
Copy link
Contributor

dvikan commented Aug 8, 2023

like it.

would prefer not using inheritance though. i'd prefer straight up duplicating entire blocks of code instead of using inheritance here

$sitemap_xml = $this->getSitemapXml($sitemap_url, !empty($site_map));
$links = $this->sitemapXmlToList($sitemap_xml, $url_pattern, empty($limit) ? 10 : $limit);

if (empty($links) && empty(sitemapXmlToList($sitemap_xml))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid using empty()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code treats the case where the set of URLs to expand is empty:

  • If the set of URLs is not empty when reprocessing without URL pattern, then assume the feed is valid (but empty)
  • Otherwise, treat as an error and show a message with sitemap URL to help troubleshooting

Is there anything wrong with that?
Maybe I should add a comment if that check was unclear?
Or should I avoid the built-in function empty() itself?

* @param string $is_site_map TRUE if the specified URL points directly to the sitemap XML
* @return object Sitemap DOM (from parsed XML)
*/
protected function getSitemapXml(&$url, $is_site_map = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont pass reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference here is for updating the $sitemap_url in the caller context.
Before the call, $sitemap_url is e.g. https://example.com/
After the call, $sitemap_url is e.g. https://example.com/seo/sitemap.xml
I did this to show a more meaningful error message here (line 75):

returnClientError('Could not retrieve URLs with Timestamps from Sitemap: ' . $sitemap_url);

I could remove the reference, and instead would either not update the variable (less meaningful error message), or return it as part of return value (less practical/less readable for other maintainers). What do you think?

* @param string $is_site_map TRUE if the specified URL points directly to the sitemap XML
* @return object Sitemap DOM (from parsed XML)
*/
protected function getSitemapXml(&$url, $is_site_map = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use type hint bool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do in next PR.

{
if (!$is_site_map) {
$robots_txt = getSimpleHTMLDOM(urljoin($url, '/robots.txt'))->outertext;
preg_match('/Sitemap: ([^ ]+)/', $robots_txt, $matches);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u can check return value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do.

@Bockiii
Copy link
Contributor

Bockiii commented Aug 8, 2023

You could have used an actual example page for the examplevalue. this way the check would fail.

@ORelio
Copy link
Contributor Author

ORelio commented Aug 8, 2023

@dvikan

like it.

Thanks for reviewing my pull request so fast 🙂

would prefer not using inheritance though. i'd prefer straight up duplicating entire blocks of code instead of using inheritance here

I was told the exact opposite for #1694. Besides, copying and pasting code makes it harder to maintain if needing to fix something since each occurrence of the copied code will need to be fixed/updated individually.

If you prefer to have fully independent bridges, I could look into moving the common code into libs such html.php so that both CssSelectorBridge and SitemapBridge reference the common code without referencing each other. Would that seems good to you?

(I'll respond to code review items individually, but will need to submit a new Pull Request to address them because this one was merged)

@Bockiii

Will look into providing a good example site for automated test.
(Of course if they remove or change their sitemap, this may break the automated test).

@dvikan
Copy link
Contributor

dvikan commented Aug 9, 2023

you dont have to do anything at this point. im only giving helpful tips

mruac pushed a commit to mruac/mruac-rss-bridge that referenced this pull request Aug 15, 2023
* [SitemapBridge] Add SitemapBridge

This bridge is a variant of CssSelectorBridge.
Instead of retrieving article list from home page,
retrieves article list from SEO sitemap.xml.
Requires CssSelectorBridge to be installed.

* [SitemapBridge] Code linting
@mdemoss
Copy link
Contributor

mdemoss commented Aug 24, 2023

Might be good to have a look at https://blog.feistel.party/2022/05/20/sitemaps-and-meta-tags-may-substitute-for-rss.html which has examples (from NHK news) of multiple sitemaps in a robots.txt, a sitemapindex, and sitemap-news. Is CssSelectorBridge flexible enough to grab article info from the meta tags?

Another possibly unusual and interesting example to have a look at is https://neocities.org/robots.txt which points to a gzip'ed sitemapindex which further points to a large number of gzip'ed sitemaps. I don't know whether that's common. I suspect not.

@ORelio
Copy link
Contributor Author

ORelio commented Aug 25, 2023

Extracting metadata from tags intented for social network embeds is a good idea!
I'll also look into shema.org properties like author or dateCreated.

Edit: SitemapBridge already supports nested sitemaps 🙂
Not sure about gziped ones, since gzip should be supported as a http compression method already.

@mdemoss
Copy link
Contributor

mdemoss commented Aug 26, 2023

you may compress your Sitemap files using gzip

Is the wording at https://www.sitemaps.org/protocol.html I do think they mean files served as application/gzip rather than the HTTP content encoding. Probably isn't too common. There are some other unusual things in the spec.

dvikan pushed a commit that referenced this pull request Sep 24, 2023
…3687) (#3706)

* [CssSelectorBridge] Metadata from social embed (#3602, #3687)

Implement the following metadata sources:
 - Facebook Open Graph
 - Twitter <meta> tags
 - Standard <meta> tags
 - JSON linked data (ld+json)

The following metadata is supported:
 - Canonical URL (may help removing garbage from URLs)
 - Article title
 - Truncated summary
 - Published/Updated timestamp
 - Enclosure/Thumbnail image
 - Author Name or Twitter handle

SitemapBridge will also automatically benefit from this commit.

* [php8backports] Add array_is_list()

Needed this function for ld+json implementation in CssSelectorBridge.

* [SitemapBridge] Add option to discard thumbnail

* [CssSelectorBridge] Fix linting issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants