-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML API: Add CSS selector support #7857
base: trunk
Are you sure you want to change the base?
HTML API: Add CSS selector support #7857
Conversation
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
184ad41
to
8d9d9e0
Compare
* - The following combinators: | ||
* - descendant (e.g. `.parent .descendant`) | ||
* - child (`.parent > .child`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will be supported initially. Maybe no combinators at all, only simple selectors.
These two combinators can probably be supported by the HTML API as long as they non-final selector is an element selector:
div > .className
✅ supported.className > div
⛔️ not supported
Tags can be "seen" via bookmarks, while things like IDs, classes, attributes etc. would require seeking or more advanced internal tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the problem that we'd need to keep track of all the attributes requested by the selector in the entire breadcrumbs trail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. Either breadcrumbs would need to store all of their attributes or we'd need to seek around the document to check them which would be very costly.
All of the supported selectors are things that can be checked from given tag in the document without any seeking as the HTML API is implemented now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we knew the selector upfront, we could always store the attributes needed to match it, e.g. store the class
attribute if we know we'll only ever look for div > .className
. Only that wouldn't be practical.
However, keeping track of classnames in all the breadcrumbs elements seems okay-ish resource-wise.
Perhaps we could even have a "CSS-enabled" mode where we'd keep track of all the breadcrumbs attributes shorter than, say, 100 bytes? It wouldn't be perfect but should suffice for most attribute-based matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be ways to support more selectors, one good idea is to set some state to track data for certain selectors as the document is traversed.
One tricky thing with that implementation is that it may be necessary to return to the start of the document depending on the selectors used and scan again from there so that the necessary data is collected during traversal. This probably involves analyzing the selectors, deciding what data needs to be stored, seeking to the beginning if necessary, seeking back to the current location, and then trying to find the selector.
The complexity here is significant. It can be overcome, but I think the set of selectors supported in this PR as it is now is a good starting place with very limited matching complexity. Iteration to improve support can be added in future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Everything I said I meant for the future, no need to hold up this work :)
On seeking - sounds reasonable, although it won't combine well with streaming. What would an example of a selector that requires backtracking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would an example of a selector that requires backtracking?
This is mostly necessary to support selecting from arbitrary positions in the document. If we knew we were at the start of the document, we could determine what additional state needed to be tracked and start collecting it.
- The sibling combinators
+
and~
select based on preceding elements, so it would be necessary to back up to check for matches. - Support for subclass selectors (ID, class, and attribute) in non-final position, like
#my-id div
would require backing up to parse some attributes along with breadcrumbs. - If pseudo-class selectors were added, I'd expect many of them to require more understanding of the document structure.
370539b
to
9ecaab5
Compare
8659898
to
e5e4c7e
Compare
e5e4c7e
to
b7e032e
Compare
Experimental support for selectors in WordPress/wordpress-develop#7857
* @param string[] $breadcrumbs Breadcrumbs. | ||
* @return bool True if a match is found, otherwise false. | ||
*/ | ||
private function explore_matches( array $selectors, array $breadcrumbs ): bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a recursive function. It's bounded by the length of the breadcrumbs list and the depth of the complex selector, but it may be good to set some limits here.
I'm not sure how useful it would be, but I think it would be interesting to add namespaces to type selectors. We could declare 4 namespaces to support:
|
I've changed the class structures so that all the selectors extend an abstract base class that provides some constants and utilities for parsing selectors. Each selector class now defines how it is parsed to construct an instance of itself. Most of the classes expose properties that should be private and only used in their internal match methods. The properties are inspected in tests now, but this should be changed. |
dd6d8cd
to
33b8333
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some leftover thoughts we can get back to later, or in person
* | ||
* [attr|=value] | ||
*/ | ||
const MATCH_EXACT_OR_HYPHEN_PREFIXED = 'exact-or-hyphen-prefixed'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reads like the hyphen comes first, but in the CSS selector, it specifically connotes that a hyphen follows the match. would HYPHEN_SUFFIXED
be more accurate?
|
||
if ( null === $this->value ) { | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it reads as if null
is a sentinel value for “as long as the attribute exists” but without reading the code I would have assumed passing null
would mean “only match if the attribute doesn’t exist on the tag”
what do you think about this?
* @return bool True if the processor's current position matches the selector. | ||
*/ | ||
public function matches( WP_HTML_Tag_Processor $processor ): bool { | ||
$att_value = $processor->get_attribute( $this->name ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a critical point, but we have largely used attribute
or attr
in the HTML API and in Gutenberg code when dealing with attributes. Do we need to save bits here?
return true; | ||
} | ||
} | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while this seems fine for now, I suspect that at some point we will prefer to crawl through the attribute value comparing as we go with substr_compare()
rather than allocate each substring. we can remember that in some documents we see DoS attacks due to long attribute values.
$this->whitespace_delimited_list()
could also return a starting byte offset into the value instead of the substring, which would resolve this without changing much of the calling interface.
} | ||
|
||
$starts_with = "{$this->value}-"; | ||
return 0 === substr_compare( $att_value, $starts_with, 0, strlen( $starts_with ), $case_insensitive ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this whole thing could be collapsed into a single call to substr_compare()
with a final test of the following character…
$exact_length = strlen( $this->value );
$matches_prefix = substr_compare( $att_value, $this->value, 0, $exact_length, $case_insensitive );
return (
0 === $matches_prefix &&
( strlen( $att_value ) === $exact_length || '-' === $att_value[ $exact_length ] )
);
* will be updated if the parse is successful. | ||
* @return static|null The selector instance, or null if the parse was unsuccessful. | ||
*/ | ||
public static function parse( string $input, int &$offset ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not immediately a fan of mutating the offset passed into the function. did you consider some other examples of passing in something like &$bytes_parsed
? for example, the HTML decoder does this which leaves the input variables untouched while still communicating &$match_byte_length
} | ||
++$updated_offset; | ||
|
||
self::parse_whitespace( $input, $updated_offset ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above: would be nice if instead of parse_whitespace
we had this function return the length of the whitespace bytes and then this line could use the same variable names as the HTML API with
$at += self::skip_whitespace( $input, $at );
$attr_matcher = WP_CSS_Attribute_Selector::MATCH_SUFFIXED_BY; | ||
$updated_offset += 2; | ||
break; | ||
case '*': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I explored this long ago, I actually felt like the symbols in use in CSS provided reasonable literal values in the code vs. the use of const
s. it didn’t have the same explanatory power in the code, but it provided a more direct correspondence with the CSS, which was nice, plus involved fewer translations/CPU cycles.
* @param string $selector_string Selector string. | ||
* @return Generator<void> A generator pausing on each tag matching the selector. | ||
*/ | ||
public function select_all( $selector_string ): Generator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure this is necessary because the interface of select()
would provide the same interface as this, but without the dummy $_
variable, would it not?
while ( $processor->select( 'meta[property^="og:" i]' ) ) {
…
}
except here we also have a generator and non-true/false return values. is there a reason to add a second almost-identical interface to the main loop?
I ask because I think it’s important that developers not have too many ways to do the same thing, and we’ve already tried to communicate the main interface to the HTML API as being this new "loop" in a while ( $processor->next_token() )
or while ( $processor->next_tag() )
construct.
This is not ready for final review but is ready for early feedback, especially around the open questions listed below.
Introduce CSS selector based traversal of HTML documents in the HTML API. Add new
select_all
andselect
methods to the Tag Processor and HTML Processor.A subset of the CSS selector grammar is available as described here:
wordpress-develop/src/wp-includes/html-api/class-wp-css-compound-selector-list.php
Lines 24 to 39 in 355c9a2
Notable variations from selectors specification:
Pseudo-element selectors are not supported. Pseudo elements will not exist in the HTML and it's unclear what benefit they would bring.
Pseudo-class selectors are not supported. Pseudo classes could be useful, but the logic to parse and match pseudo-class selectors would add significant complexity. There's also a lot of variety in pseudo-selectors, and rather than supporting simpler selectors (e.g.
:empty
) and not supporting more complex selectors (e.g.:nth-child()
), pseudo-class selectors are completely unsupported. This is a clear and simple rule that greatly simplifies the implementation.Complex selectors have limited support. Complex selectors are combined selectors using one of the combinators:
(whitespace),
>
,+
, or~
. The Tag Processor does not support complex selectors at all, it has no concept of document structure, and all complex selectors are structural. The HTML Processor does not support the sibling combinators+
or~
, it only supports a parent/child or ancestor/descendant relationship. These selectors can be handled without tracking additional state in the document by analyzing breadcrumbs. Finally, only type selectors are allowed in non-final position, again because this allows matching against breadcrumbs without tracking additional state:body heading > h1.page-title[attribute]
#page > main
,.page main
ul li ~ li
,ul li + li
.Importantly, the selectors supported by the HTML Processor should be sufficient to support all core block attribute selectors according to this PR:
wordpress-develop/src/wp-includes/html-api/class-wp-html-attribute-sourcer.php
Lines 18 to 49 in 88e7d30
Most of the listed selectors are also supported by the Tag Processor with the exception of the complex selectors:
Open questions
Implementation
The implementation introduces a number of classes. The classes roughly correspond to different parts of the selector grammar. Parsing is handled in the
_list
classes that represent the top of the grammar. Matching logic is handled by each selector class. The selector classes implement amatches
interface.I'm not happy with how the implementation is distributed. I'd like to move in one of two directions:Move the parsing logic for selectors into the selector classes, so each selector class is responsible for parsing itself.
I've implemented this change.
ORMove matching logic into the top level*_list
classes to that the lower level selectors are only data containers.My preference at this time (and the original implementation) would be the former.Selector traversal APIs
select_all
is implemented as a generator and (except_doing_it_wrong
) has no way to differentiate between "nothing matched" and "the selector was invalid or unsupported". It simply yields nothing in both cases.select
usesselect_all
internally and has the same limitation.select_all
expects the document position to remain the same in order to visit all matching tags. Because none of the supported selectors rely on stateful logic, this is not an issue at this time.Todo:
sirreal/html-api-debugger#5 can be used for testing the parsing and matching behavior.
Trac ticket: https://core.trac.wordpress.org/ticket/62653
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.