[SitemapBridge] Add SitemapBridge #3602

ORelio · 2023-08-08T12:49:04Z

This bridge is a variant of CssSelectorBridge (Requires CssSelectorBridge to be installed).
Instead of retrieving article list from home page HTML, retrieves article list from SEO sitemap.xml.
The bridge expands articles with user-provided content selector using the same code from CssSelectorBridge.

This bridge is a variant of CssSelectorBridge. Instead of retrieving article list from home page, retrieves article list from SEO sitemap.xml. Requires CssSelectorBridge to be installed.

github-actions · 2023-08-08T12:50:27Z

Pull request artifacts

file	last change
`SitemapBridge-pr-context1`	2023-08-08, 12:57:34

dvikan · 2023-08-08T13:02:41Z

like it.

would prefer not using inheritance though. i'd prefer straight up duplicating entire blocks of code instead of using inheritance here

dvikan · 2023-08-08T13:03:17Z

bridges/SitemapBridge.php

+        $sitemap_xml = $this->getSitemapXml($sitemap_url, !empty($site_map));
+        $links = $this->sitemapXmlToList($sitemap_xml, $url_pattern, empty($limit) ? 10 : $limit);
+
+        if (empty($links) && empty(sitemapXmlToList($sitemap_xml))) {


avoid using empty()

This code treats the case where the set of URLs to expand is empty:

If the set of URLs is not empty when reprocessing without URL pattern, then assume the feed is valid (but empty)

Otherwise, treat as an error and show a message with sitemap URL to help troubleshooting

Is there anything wrong with that?
Maybe I should add a comment if that check was unclear?
Or should I avoid the built-in function empty() itself?

dvikan · 2023-08-08T13:04:04Z

bridges/SitemapBridge.php

+     * @param string $is_site_map TRUE if the specified URL points directly to the sitemap XML
+     * @return object Sitemap DOM (from parsed XML)
+     */
+    protected function getSitemapXml(&$url, $is_site_map = false)


dont pass reference

The reference here is for updating the $sitemap_url in the caller context.
Before the call, $sitemap_url is e.g. https://example.com/
After the call, $sitemap_url is e.g. https://example.com/seo/sitemap.xml
I did this to show a more meaningful error message here (line 75):

returnClientError('Could not retrieve URLs with Timestamps from Sitemap: ' . $sitemap_url);

I could remove the reference, and instead would either not update the variable (less meaningful error message), or return it as part of return value (less practical/less readable for other maintainers). What do you think?

dvikan · 2023-08-08T13:04:14Z

bridges/SitemapBridge.php

+     * @param string $is_site_map TRUE if the specified URL points directly to the sitemap XML
+     * @return object Sitemap DOM (from parsed XML)
+     */
+    protected function getSitemapXml(&$url, $is_site_map = false)


use type hint bool

OK, will do in next PR.

dvikan · 2023-08-08T13:04:35Z

bridges/SitemapBridge.php

+    {
+        if (!$is_site_map) {
+            $robots_txt = getSimpleHTMLDOM(urljoin($url, '/robots.txt'))->outertext;
+            preg_match('/Sitemap: ([^ ]+)/', $robots_txt, $matches);


u can check return value

OK, will do.

Bockiii · 2023-08-08T15:19:24Z

You could have used an actual example page for the examplevalue. this way the check would fail.

ORelio · 2023-08-08T21:48:47Z

@dvikan

like it.

Thanks for reviewing my pull request so fast 🙂

would prefer not using inheritance though. i'd prefer straight up duplicating entire blocks of code instead of using inheritance here

I was told the exact opposite for #1694. Besides, copying and pasting code makes it harder to maintain if needing to fix something since each occurrence of the copied code will need to be fixed/updated individually.

If you prefer to have fully independent bridges, I could look into moving the common code into libs such html.php so that both CssSelectorBridge and SitemapBridge reference the common code without referencing each other. Would that seems good to you?

(I'll respond to code review items individually, but will need to submit a new Pull Request to address them because this one was merged)

@Bockiii

Will look into providing a good example site for automated test.
(Of course if they remove or change their sitemap, this may break the automated test).

dvikan · 2023-08-09T06:11:10Z

you dont have to do anything at this point. im only giving helpful tips

* [SitemapBridge] Add SitemapBridge This bridge is a variant of CssSelectorBridge. Instead of retrieving article list from home page, retrieves article list from SEO sitemap.xml. Requires CssSelectorBridge to be installed. * [SitemapBridge] Code linting

mdemoss · 2023-08-24T00:21:12Z

Might be good to have a look at https://blog.feistel.party/2022/05/20/sitemaps-and-meta-tags-may-substitute-for-rss.html which has examples (from NHK news) of multiple sitemaps in a robots.txt, a sitemapindex, and sitemap-news. Is CssSelectorBridge flexible enough to grab article info from the meta tags?

Another possibly unusual and interesting example to have a look at is https://neocities.org/robots.txt which points to a gzip'ed sitemapindex which further points to a large number of gzip'ed sitemaps. I don't know whether that's common. I suspect not.

ORelio · 2023-08-25T17:58:15Z

Extracting metadata from tags intented for social network embeds is a good idea!
I'll also look into shema.org properties like author or dateCreated.

Edit: SitemapBridge already supports nested sitemaps 🙂
Not sure about gziped ones, since gzip should be supported as a http compression method already.

mdemoss · 2023-08-26T04:47:53Z

you may compress your Sitemap files using gzip

Is the wording at https://www.sitemaps.org/protocol.html I do think they mean files served as application/gzip rather than the HTTP content encoding. Probably isn't too common. There are some other unusual things in the spec.

…3687) (#3706) * [CssSelectorBridge] Metadata from social embed (#3602, #3687) Implement the following metadata sources: - Facebook Open Graph - Twitter <meta> tags - Standard <meta> tags - JSON linked data (ld+json) The following metadata is supported: - Canonical URL (may help removing garbage from URLs) - Article title - Truncated summary - Published/Updated timestamp - Enclosure/Thumbnail image - Author Name or Twitter handle SitemapBridge will also automatically benefit from this commit. * [php8backports] Add array_is_list() Needed this function for ld+json implementation in CssSelectorBridge. * [SitemapBridge] Add option to discard thumbnail * [CssSelectorBridge] Fix linting issues

[SitemapBridge] Add SitemapBridge

3960db5

This bridge is a variant of CssSelectorBridge. Instead of retrieving article list from home page, retrieves article list from SEO sitemap.xml. Requires CssSelectorBridge to be installed.

[SitemapBridge] Code linting

23c62b7

dvikan merged commit b86ee57 into RSS-Bridge:master Aug 8, 2023
7 checks passed

dvikan reviewed Aug 8, 2023

View reviewed changes

ORelio mentioned this pull request Sep 24, 2023

How to prevent duplicate adding posts using CSS Selector Bridge? #3687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SitemapBridge] Add SitemapBridge #3602

[SitemapBridge] Add SitemapBridge #3602

ORelio commented Aug 8, 2023

github-actions bot commented Aug 8, 2023 •

edited

Loading

dvikan commented Aug 8, 2023

dvikan Aug 8, 2023

ORelio Aug 8, 2023

dvikan Aug 8, 2023

ORelio Aug 8, 2023

dvikan Aug 8, 2023

ORelio Aug 8, 2023

dvikan Aug 8, 2023

ORelio Aug 8, 2023

Bockiii commented Aug 8, 2023

ORelio commented Aug 8, 2023

dvikan commented Aug 9, 2023

mdemoss commented Aug 24, 2023 •

edited

Loading

ORelio commented Aug 25, 2023 •

edited

Loading

mdemoss commented Aug 26, 2023

[SitemapBridge] Add SitemapBridge #3602

[SitemapBridge] Add SitemapBridge #3602

Conversation

ORelio commented Aug 8, 2023

github-actions bot commented Aug 8, 2023 • edited Loading

Pull request artifacts

dvikan commented Aug 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bockiii commented Aug 8, 2023

ORelio commented Aug 8, 2023

dvikan commented Aug 9, 2023

mdemoss commented Aug 24, 2023 • edited Loading

ORelio commented Aug 25, 2023 • edited Loading

mdemoss commented Aug 26, 2023

github-actions bot commented Aug 8, 2023 •

edited

Loading

mdemoss commented Aug 24, 2023 •

edited

Loading

ORelio commented Aug 25, 2023 •

edited

Loading