-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SitemapBridge] Add SitemapBridge #3602
Conversation
This bridge is a variant of CssSelectorBridge. Instead of retrieving article list from home page, retrieves article list from SEO sitemap.xml. Requires CssSelectorBridge to be installed.
Pull request artifacts
|
like it. would prefer not using inheritance though. i'd prefer straight up duplicating entire blocks of code instead of using inheritance here |
$sitemap_xml = $this->getSitemapXml($sitemap_url, !empty($site_map)); | ||
$links = $this->sitemapXmlToList($sitemap_xml, $url_pattern, empty($limit) ? 10 : $limit); | ||
|
||
if (empty($links) && empty(sitemapXmlToList($sitemap_xml))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid using empty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code treats the case where the set of URLs to expand is empty:
- If the set of URLs is not empty when reprocessing without URL pattern, then assume the feed is valid (but empty)
- Otherwise, treat as an error and show a message with sitemap URL to help troubleshooting
Is there anything wrong with that?
Maybe I should add a comment if that check was unclear?
Or should I avoid the built-in function empty()
itself?
* @param string $is_site_map TRUE if the specified URL points directly to the sitemap XML | ||
* @return object Sitemap DOM (from parsed XML) | ||
*/ | ||
protected function getSitemapXml(&$url, $is_site_map = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont pass reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference here is for updating the $sitemap_url
in the caller context.
Before the call, $sitemap_url
is e.g. https://example.com/
After the call, $sitemap_url
is e.g. https://example.com/seo/sitemap.xml
I did this to show a more meaningful error message here (line 75):
returnClientError('Could not retrieve URLs with Timestamps from Sitemap: ' . $sitemap_url);
I could remove the reference, and instead would either not update the variable (less meaningful error message), or return it as part of return value (less practical/less readable for other maintainers). What do you think?
* @param string $is_site_map TRUE if the specified URL points directly to the sitemap XML | ||
* @return object Sitemap DOM (from parsed XML) | ||
*/ | ||
protected function getSitemapXml(&$url, $is_site_map = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use type hint bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will do in next PR.
{ | ||
if (!$is_site_map) { | ||
$robots_txt = getSimpleHTMLDOM(urljoin($url, '/robots.txt'))->outertext; | ||
preg_match('/Sitemap: ([^ ]+)/', $robots_txt, $matches); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u can check return value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will do.
You could have used an actual example page for the examplevalue. this way the check would fail. |
Thanks for reviewing my pull request so fast 🙂
I was told the exact opposite for #1694. Besides, copying and pasting code makes it harder to maintain if needing to fix something since each occurrence of the copied code will need to be fixed/updated individually. If you prefer to have fully independent bridges, I could look into moving the common code into libs such html.php so that both (I'll respond to code review items individually, but will need to submit a new Pull Request to address them because this one was merged) Will look into providing a good example site for automated test. |
you dont have to do anything at this point. im only giving helpful tips |
* [SitemapBridge] Add SitemapBridge This bridge is a variant of CssSelectorBridge. Instead of retrieving article list from home page, retrieves article list from SEO sitemap.xml. Requires CssSelectorBridge to be installed. * [SitemapBridge] Code linting
Might be good to have a look at https://blog.feistel.party/2022/05/20/sitemaps-and-meta-tags-may-substitute-for-rss.html which has examples (from NHK news) of multiple sitemaps in a robots.txt, a sitemapindex, and sitemap-news. Is CssSelectorBridge flexible enough to grab article info from the meta tags? Another possibly unusual and interesting example to have a look at is https://neocities.org/robots.txt which points to a gzip'ed sitemapindex which further points to a large number of gzip'ed sitemaps. I don't know whether that's common. I suspect not. |
Extracting metadata from tags intented for social network embeds is a good idea! Edit: SitemapBridge already supports nested sitemaps 🙂 |
Is the wording at https://www.sitemaps.org/protocol.html I do think they mean files served as application/gzip rather than the HTTP content encoding. Probably isn't too common. There are some other unusual things in the spec. |
…3687) (#3706) * [CssSelectorBridge] Metadata from social embed (#3602, #3687) Implement the following metadata sources: - Facebook Open Graph - Twitter <meta> tags - Standard <meta> tags - JSON linked data (ld+json) The following metadata is supported: - Canonical URL (may help removing garbage from URLs) - Article title - Truncated summary - Published/Updated timestamp - Enclosure/Thumbnail image - Author Name or Twitter handle SitemapBridge will also automatically benefit from this commit. * [php8backports] Add array_is_list() Needed this function for ld+json implementation in CssSelectorBridge. * [SitemapBridge] Add option to discard thumbnail * [CssSelectorBridge] Fix linting issues
This bridge is a variant of CssSelectorBridge (Requires CssSelectorBridge to be installed).
Instead of retrieving article list from home page HTML, retrieves article list from SEO sitemap.xml.
The bridge expands articles with user-provided content selector using the same code from CssSelectorBridge.