Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Add CSS selector support #7857

Open
wants to merge 134 commits into
base: trunk
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
134 commits
Select commit Hold shift + click to select a range
0e8c4fb
WIP class skeleton
sirreal Nov 19, 2024
2d3d283
Document class
sirreal Nov 20, 2024
40222d3
Do not support namespaced selectors
sirreal Nov 21, 2024
6092642
Flesh out stuff
sirreal Nov 22, 2024
3e3b2b2
Starting to actually parse
sirreal Nov 22, 2024
967557f
Add ident tests
sirreal Nov 22, 2024
2ec1db3
Fix ident non-ascii bug
sirreal Nov 22, 2024
ee2c7ce
Use class after defined
sirreal Nov 22, 2024
0f708ba
Fix some char stuff
sirreal Nov 22, 2024
3cb455d
Improve tests
sirreal Nov 22, 2024
5609e50
Housekeeping
sirreal Nov 22, 2024
4f25bc2
Require new file in WP
sirreal Nov 22, 2024
943293f
Fix offset type
sirreal Nov 22, 2024
24c9744
Add more tests and invalid tests
sirreal Nov 22, 2024
a7c10b9
Fix wrong offset var usage
sirreal Nov 22, 2024
dd718b7
comment tweak
sirreal Nov 22, 2024
5884aca
Implement codepoint escape with strspn
sirreal Nov 22, 2024
a9a077f
Test with UPPER HEX
sirreal Nov 22, 2024
5f53e0a
Add ID tests
sirreal Nov 25, 2024
effbbbe
Improve tests
sirreal Nov 25, 2024
62ec5bb
Add class selector tests
sirreal Nov 25, 2024
153f009
Add class selector
sirreal Nov 25, 2024
fcc6401
Simplify id selector parse
sirreal Nov 25, 2024
21c67e5
Improve ident tests
sirreal Nov 25, 2024
728d798
Add type selector tests
sirreal Nov 25, 2024
e1e8e09
Add docs and remove unreachable line
sirreal Nov 25, 2024
13ac3c1
Add type selector class
sirreal Nov 25, 2024
a3c25e8
Add attribute selector tests
sirreal Nov 25, 2024
ad5c600
improve attr tests
sirreal Nov 25, 2024
6758704
Fix expectation argument order
sirreal Nov 25, 2024
e97842c
Add test and fix is_ident
sirreal Nov 25, 2024
ef00856
Add parse_string stub
sirreal Nov 26, 2024
463e799
Add attribute selector parsing
sirreal Nov 26, 2024
0f5b28c
Fix test expectations
sirreal Nov 26, 2024
f4a491a
More and improved attribute tests
sirreal Nov 26, 2024
b680b1b
Implement parse_string
sirreal Nov 26, 2024
e7da05f
Add string parse tests
sirreal Nov 26, 2024
d5e7e60
Remove covers annotations
sirreal Nov 26, 2024
08187c6
Remove unused line
sirreal Nov 26, 2024
5a5066c
Improve tests for 100% coverage on parse methods
sirreal Nov 26, 2024
2f8bd19
Improve documentation
sirreal Nov 26, 2024
8b0ac55
Fix parse return type and return annotations
sirreal Nov 26, 2024
dffcac6
Update documentation links and grammar
sirreal Nov 27, 2024
9f81744
Update documentation and class name
sirreal Nov 27, 2024
d4c6f38
Add selector class
sirreal Nov 27, 2024
6432056
Implement complex selector
sirreal Nov 27, 2024
5c746cd
Working and tested
sirreal Nov 27, 2024
501102a
Selector parsing should allow cap I,S modifier
sirreal Nov 28, 2024
f98fbb3
CSS Add matches to selector classes
sirreal Nov 28, 2024
c8f16e1
Match is successful on _any_ match in selector list
sirreal Nov 28, 2024
c689c9c
PICKME: Add is_quirks_mode method to processor
sirreal Nov 28, 2024
1221efa
ID matches depend on quirks mode
sirreal Nov 28, 2024
e5e94b1
has_class may return null, coerce to bool
sirreal Nov 28, 2024
1e888ba
Update docs to only allow subclass selectors in final complex selecto…
sirreal Nov 28, 2024
dd4fcb0
Restrict complex selectors to only allow subclass selectors in final …
sirreal Nov 28, 2024
256c55a
Work on complex selector handling
sirreal Nov 28, 2024
465cc36
Implement descendent selector matching
sirreal Nov 28, 2024
467d45d
Add null check for subclass selectors
sirreal Nov 29, 2024
44bfc64
CSS selector reformat ternaries
sirreal Nov 29, 2024
ca4531c
Implement ~= attribute matching
sirreal Nov 29, 2024
489db93
CSS fix return type
sirreal Nov 29, 2024
e57a211
Fix static analysis problems
sirreal Nov 29, 2024
509e648
Fix and annotate things (static analysis)
sirreal Nov 29, 2024
58c1698
update tests
sirreal Nov 29, 2024
c9b9145
Id attribute must be a string to match id selector
sirreal Nov 29, 2024
e5cac63
Coerce boolean attributes to ""
sirreal Nov 29, 2024
2bafae9
Fix a few more static analysis things
sirreal Nov 29, 2024
8fe57e3
Add select method
sirreal Nov 28, 2024
ab2fe0d
Unify parsing under single class
sirreal Dec 3, 2024
6a6969f
Rename files to align with class name
sirreal Dec 3, 2024
27ca891
Add html processor select test suite
sirreal Dec 3, 2024
9ff2769
Fix select types
sirreal Dec 3, 2024
d1a276b
Update class doc
sirreal Dec 4, 2024
4909b56
Improve select_ method arguments, docs, implementation
sirreal Dec 4, 2024
1d45225
Split classes into their own files
sirreal Dec 4, 2024
0b277b4
Remove redundant see phpdoc annotations
sirreal Dec 4, 2024
0c53c42
Fix docs and return type on select_all
sirreal Dec 4, 2024
d966e9a
Improve html select test docs
sirreal Dec 4, 2024
5201ba9
Add select support to tag processor
sirreal Dec 4, 2024
2036a83
Simplify whitspace splitting function
sirreal Dec 4, 2024
3421a4e
Remove unreachable code
sirreal Dec 4, 2024
784b2d9
Add a lot of selector integration tests
sirreal Dec 4, 2024
4d4c5fe
Extract normalize input method
sirreal Dec 4, 2024
dbc37fc
tests
sirreal Dec 4, 2024
d241f31
Add nonfinal subclass selector test
sirreal Dec 4, 2024
663070b
Fix logic bug in child selector exploration
sirreal Dec 5, 2024
5478af9
Improve selector integration tests
sirreal Dec 5, 2024
4f6bf94
Try abstract class instead of interface
sirreal Dec 5, 2024
fe07dfd
Revert "Try abstract class instead of interface"
sirreal Dec 5, 2024
143e092
Clean up and document attribute selector
sirreal Dec 5, 2024
32ee2a7
Update ticket number in tests
sirreal Dec 5, 2024
5922494
Improve some types
sirreal Dec 5, 2024
e492aa6
Fix and improve string token parsing
sirreal Dec 5, 2024
81c6758
Update attribute selector tests
sirreal Dec 5, 2024
7bccf3e
Revert "Update attribute selector tests"
sirreal Dec 5, 2024
3949cc5
Improve some complex selector match tests
sirreal Dec 5, 2024
c696889
Add and use matches_tag type selector method
sirreal Dec 9, 2024
c193551
Improve complex selector structure
sirreal Dec 9, 2024
9dd8114
Rework structure of complex_selector class
sirreal Dec 9, 2024
b134308
Improve documentation
sirreal Dec 9, 2024
94c06ef
Document complex selector class
sirreal Dec 9, 2024
f46fced
Document matches functions
sirreal Dec 9, 2024
1bacfd7
Simplify condition in compound::matches
sirreal Dec 9, 2024
a274ea0
Change class require order
sirreal Dec 9, 2024
12a0a99
Annotate matches processor argument type
sirreal Dec 9, 2024
0e2b34a
Document class selector and update class_name property
sirreal Dec 9, 2024
dea1029
Document ID selector class, rename id property
sirreal Dec 9, 2024
d268f4c
Document type selector class and rename type property
sirreal Dec 9, 2024
d89fbd9
Document compound selector
sirreal Dec 9, 2024
8ced3aa
Improve attribute selector docs and types
sirreal Dec 9, 2024
ca1a129
Update matches docs
sirreal Dec 9, 2024
71fd62a
Document complex selector class
sirreal Dec 9, 2024
4a3e084
Merge branch 'trunk' into html-api/add-css-selector-parser
sirreal Dec 9, 2024
25dbb19
PHP < 7.4 does not like this annotation
sirreal Dec 9, 2024
70cf7f7
Update since annotations to 6.8.0
sirreal Dec 9, 2024
355c9a2
Update attr-modifier to match selectors grammar
sirreal Dec 9, 2024
7ef67c1
Merge branch 'trunk' into html-api/add-css-selector-parser
sirreal Dec 10, 2024
abb4d25
Merge branch 'trunk' into html-api/add-css-selector-parser
sirreal Dec 11, 2024
9ac05b4
Merge branch 'trunk' into html-api/add-css-selector-parser
sirreal Dec 11, 2024
3206e0b
Move parsing back to selector classes
sirreal Dec 11, 2024
46646b5
Update tests for class parsing
sirreal Dec 11, 2024
f217eb0
Use whitepsace chars constant
sirreal Dec 11, 2024
6154742
parse_whitespace should be protected
sirreal Dec 11, 2024
577b3a3
Update interface to abstract class require
sirreal Dec 11, 2024
5ea93ab
Document base class
sirreal Dec 11, 2024
adfebdf
Invert and comment confusing compound selector condition
sirreal Dec 11, 2024
db469e6
Use switch in compound selector parsing
sirreal Dec 11, 2024
400263a
Fix up some todo-s
sirreal Dec 11, 2024
483a819
Make most selector constructors private
sirreal Dec 11, 2024
1f64168
Fix test class implementation of abstract class
sirreal Dec 11, 2024
3bfb8a1
Remove php 8+ ?static return types
sirreal Dec 11, 2024
8d2aef2
Fix typo in Exception class name
sirreal Dec 11, 2024
33b8333
Remove ?static return type from test
sirreal Dec 11, 2024
d7e840c
Merge branch 'trunk' into html-api/add-css-selector-parser
sirreal Jan 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
367 changes: 367 additions & 0 deletions src/wp-includes/html-api/class-wp-css-attribute-selector.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,367 @@
<?php
/**
* HTML API: WP_CSS_Attribute_Selector class
*
* @package WordPress
* @subpackage HTML-API
* @since 6.8.0
*/

/**
* CSS attribute selector.
*
* This class is used to test for matching HTML tags in a {@see WP_HTML_Tag_Processor}.
*
* @since 6.8.0
*
* @access private
*/
final class WP_CSS_Attribute_Selector extends WP_CSS_Selector_Parser_Matcher {
/**
* The attribute value is matched exactly.
*
* @example
*
* [att=val]
*/
const MATCH_EXACT = 'exact';

/**
* The attribute value matches any value in a whitespace separated list of words exactly.
*
* @example
*
* [attr~=value]
*/
const MATCH_ONE_OF_EXACT = 'one-of';

/**
* The attribute value is matched exactly or matches the beginning of the attribute
* immediately followed by a hyphen.
*
* @example
*
* [attr|=value]
*/
const MATCH_EXACT_OR_HYPHEN_PREFIXED = 'exact-or-hyphen-prefixed';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads like the hyphen comes first, but in the CSS selector, it specifically connotes that a hyphen follows the match. would HYPHEN_SUFFIXED be more accurate?


/**
* The attribute value matches the start of the attribute.
*
* @example
*
* [attr^=value]
*/
const MATCH_PREFIXED_BY = 'prefixed';

/**
* The attribute value matches the end of the attribute.
*
* @example
*
* [attr$=value]
*/
const MATCH_SUFFIXED_BY = 'suffixed';

/**
* The attribute value is contained in the attribute.
*
* @example
*
* [attr*=value]
*/
const MATCH_CONTAINS = 'contains';

/**
* Modifier for case sensitive matching.
*
* @example
*
* [attr=value s]
*/
const MODIFIER_CASE_SENSITIVE = 'case-sensitive';

/**
* Modifier for case insensitive matching.
*
* @example
*
* [attr=value i]
*/
const MODIFIER_CASE_INSENSITIVE = 'case-insensitive';

/**
* The name of the attribute to match.
*
* @var string
*/
public $name;

/**
* The attribute matcher.
*
* Allowed string values are the class constants:
* - {@see WP_CSS_Attribute_Selector::MATCH_EXACT}
* - {@see WP_CSS_Attribute_Selector::MATCH_ONE_OF_EXACT}
* - {@see WP_CSS_Attribute_Selector::MATCH_EXACT_OR_HYPHEN_PREFIXED}
* - {@see WP_CSS_Attribute_Selector::MATCH_PREFIXED_BY}
* - {@see WP_CSS_Attribute_Selector::MATCH_SUFFIXED_BY}
* - {@see WP_CSS_Attribute_Selector::MATCH_CONTAINS}
*
* @var string|null
*/
public $matcher;

/**
* The attribute value to match.
*
* @var string|null
*/
public $value;

/**
* The attribute modifier.
*
* Allowed string values are the class constants:
* - {@see WP_CSS_Attribute_Selector::MODIFIER_CASE_SENSITIVE}
* - {@see WP_CSS_Attribute_Selector::MODIFIER_CASE_INSENSITIVE}
*
* @var string|null
*/
public $modifier;

/**
* Constructor.
*
* @param string $name The attribute name.
* @param string|null $matcher The attribute matcher.
* Must be one of the class MATCH_* constants or null.
* @param string|null $value The attribute value to match.
* @param string|null $modifier The attribute case modifier.
* Must be one of the class MODIFIER_* constants or null.
*/
private function __construct( string $name, ?string $matcher = null, ?string $value = null, ?string $modifier = null ) {
$this->name = $name;
$this->matcher = $matcher;
$this->value = $value;
$this->modifier = $modifier;
}

/**
* Determines if the processor's current position matches the selector.
*
* @param WP_HTML_Tag_Processor $processor The processor.
* @return bool True if the processor's current position matches the selector.
*/
public function matches( WP_HTML_Tag_Processor $processor ): bool {
$att_value = $processor->get_attribute( $this->name );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a critical point, but we have largely used attribute or attr in the HTML API and in Gutenberg code when dealing with attributes. Do we need to save bits here?

if ( null === $att_value ) {
return false;
}

if ( null === $this->value ) {
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it reads as if null is a sentinel value for “as long as the attribute exists” but without reading the code I would have assumed passing null would mean “only match if the attribute doesn’t exist on the tag”

what do you think about this?


if ( true === $att_value ) {
$att_value = '';
}

$case_insensitive = self::MODIFIER_CASE_INSENSITIVE === $this->modifier;

switch ( $this->matcher ) {
case self::MATCH_EXACT:
return $case_insensitive
? 0 === strcasecmp( $att_value, $this->value )
: $att_value === $this->value;

case self::MATCH_ONE_OF_EXACT:
foreach ( $this->whitespace_delimited_list( $att_value ) as $val ) {
if (
$case_insensitive
? 0 === strcasecmp( $val, $this->value )
: $val === $this->value
) {
return true;
}
}
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while this seems fine for now, I suspect that at some point we will prefer to crawl through the attribute value comparing as we go with substr_compare() rather than allocate each substring. we can remember that in some documents we see DoS attacks due to long attribute values.

$this->whitespace_delimited_list() could also return a starting byte offset into the value instead of the substring, which would resolve this without changing much of the calling interface.


case self::MATCH_EXACT_OR_HYPHEN_PREFIXED:
// Attempt the full match first
if (
$case_insensitive
? 0 === strcasecmp( $att_value, $this->value )
: $att_value === $this->value
) {
return true;
}

// Partial match
if ( strlen( $att_value ) < strlen( $this->value ) + 1 ) {
return false;
}

$starts_with = "{$this->value}-";
return 0 === substr_compare( $att_value, $starts_with, 0, strlen( $starts_with ), $case_insensitive );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this whole thing could be collapsed into a single call to substr_compare() with a final test of the following character…

$exact_length   = strlen( $this->value );
$matches_prefix = substr_compare( $att_value, $this->value, 0, $exact_length, $case_insensitive );
return (
	0 === $matches_prefix &&
	( strlen( $att_value ) === $exact_length || '-' === $att_value[ $exact_length ] )
);


case self::MATCH_PREFIXED_BY:
return 0 === substr_compare( $att_value, $this->value, 0, strlen( $this->value ), $case_insensitive );

case self::MATCH_SUFFIXED_BY:
return 0 === substr_compare( $att_value, $this->value, -strlen( $this->value ), null, $case_insensitive );

case self::MATCH_CONTAINS:
return false !== (
$case_insensitive
? stripos( $att_value, $this->value )
: strpos( $att_value, $this->value )
);
}
}

/**
* Splits a string into a list of whitespace delimited values.
*
* This is useful for the {@see WP_CSS_Attribute_Selector::MATCH_ONE_OF_EXACT} matcher.
*
* @param string $input
*
* @return Generator<string>
*/
private function whitespace_delimited_list( string $input ): Generator {
// Start by skipping whitespace.
$offset = strspn( $input, self::WHITESPACE_CHARACTERS );

while ( $offset < strlen( $input ) ) {
// Find the byte length until the next boundary.
$length = strcspn( $input, self::WHITESPACE_CHARACTERS, $offset );
$value = substr( $input, $offset, $length );

// Move past trailing whitespace.
$offset += $length + strspn( $input, self::WHITESPACE_CHARACTERS, $offset + $length );

yield $value;
}
}

/**
* Parses a selector string to create a selector instance.
*
* To create an instance of this class, use the {@see WP_CSS_Compound_Selector_List::from_selectors()} method.
*
* @param string $input The selector string.
* @param int $offset The offset into the string. The offset is passed by reference and
* will be updated if the parse is successful.
* @return static|null The selector instance, or null if the parse was unsuccessful.
*/
public static function parse( string $input, int &$offset ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not immediately a fan of mutating the offset passed into the function. did you consider some other examples of passing in something like &$bytes_parsed? for example, the HTML decoder does this which leaves the input variables untouched while still communicating &$match_byte_length

// Need at least 3 bytes [x]
if ( $offset + 2 >= strlen( $input ) ) {
return null;
}

$updated_offset = $offset;

if ( '[' !== $input[ $updated_offset ] ) {
return null;
}
++$updated_offset;

self::parse_whitespace( $input, $updated_offset );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above: would be nice if instead of parse_whitespace we had this function return the length of the whitespace bytes and then this line could use the same variable names as the HTML API with

$at += self::skip_whitespace( $input, $at );

$attr_name = self::parse_ident( $input, $updated_offset );
if ( null === $attr_name ) {
return null;
}
self::parse_whitespace( $input, $updated_offset );

if ( $updated_offset >= strlen( $input ) ) {
return null;
}

if ( ']' === $input[ $updated_offset ] ) {
$offset = $updated_offset + 1;
return new WP_CSS_Attribute_Selector( $attr_name );
}

// need to match at least `=x]` at this point
if ( $updated_offset + 3 >= strlen( $input ) ) {
return null;
}

if ( '=' === $input[ $updated_offset ] ) {
++$updated_offset;
$attr_matcher = WP_CSS_Attribute_Selector::MATCH_EXACT;
} elseif ( '=' === $input[ $updated_offset + 1 ] ) {
switch ( $input[ $updated_offset ] ) {
case '~':
$attr_matcher = WP_CSS_Attribute_Selector::MATCH_ONE_OF_EXACT;
$updated_offset += 2;
break;
case '|':
$attr_matcher = WP_CSS_Attribute_Selector::MATCH_EXACT_OR_HYPHEN_PREFIXED;
$updated_offset += 2;
break;
case '^':
$attr_matcher = WP_CSS_Attribute_Selector::MATCH_PREFIXED_BY;
$updated_offset += 2;
break;
case '$':
$attr_matcher = WP_CSS_Attribute_Selector::MATCH_SUFFIXED_BY;
$updated_offset += 2;
break;
case '*':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I explored this long ago, I actually felt like the symbols in use in CSS provided reasonable literal values in the code vs. the use of consts. it didn’t have the same explanatory power in the code, but it provided a more direct correspondence with the CSS, which was nice, plus involved fewer translations/CPU cycles.

$attr_matcher = WP_CSS_Attribute_Selector::MATCH_CONTAINS;
$updated_offset += 2;
break;
default:
return null;
}
} else {
return null;
}

self::parse_whitespace( $input, $updated_offset );
$attr_val =
self::parse_string( $input, $updated_offset ) ??
self::parse_ident( $input, $updated_offset );

if ( null === $attr_val ) {
return null;
}

self::parse_whitespace( $input, $updated_offset );
if ( $updated_offset >= strlen( $input ) ) {
return null;
}

$attr_modifier = null;
switch ( $input[ $updated_offset ] ) {
case 'i':
case 'I':
$attr_modifier = WP_CSS_Attribute_Selector::MODIFIER_CASE_INSENSITIVE;
++$updated_offset;
break;

case 's':
case 'S':
$attr_modifier = WP_CSS_Attribute_Selector::MODIFIER_CASE_SENSITIVE;
++$updated_offset;
break;
}

if ( null !== $attr_modifier ) {
self::parse_whitespace( $input, $updated_offset );
if ( $updated_offset >= strlen( $input ) ) {
return null;
}
}

if ( ']' === $input[ $updated_offset ] ) {
$offset = $updated_offset + 1;
return new self( $attr_name, $attr_matcher, $attr_val, $attr_modifier );
}

return null;
}
}
Loading
Loading