-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HTML parsing features #11
Open
mallardduck
wants to merge
35
commits into
Astrotomic:main
Choose a base branch
from
mallardduck:html-parsing
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 7 commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
67cd354
Apply 8.1 hotfixes from unmerged patch
mallardduck 8c15b01
Initial HTML replacer code
mallardduck 6a8cb4d
remove unused property
mallardduck b8af7e5
generate new emoji bytes
mallardduck b8bedbf
clean up code
mallardduck bb84284
Add test to cover image alt/title attributes
mallardduck eb121c0
refactor to use XPath to solve filtering text nodes problem
mallardduck 92d9136
Remove try-guy now that it's unused
mallardduck 4538750
refactor to ensure we allow HTML fragments too
mallardduck cd6e190
refactor tests to split up HTML pages and HTML fragments
mallardduck 60c987f
Use internal tag as means of warning?
mallardduck f7616c0
Refactor method name to slightly better option
mallardduck e818b5c
fix code styles
mallardduck b1f83c7
make styleCI happy
mallardduck 29f7d0a
Refactor to fix missed fragments and expand tests
mallardduck eeb5f0a
reorder code
mallardduck fcdd93d
Refactor new tests and add failing tests for current issues.
mallardduck 2d77cdc
fix styles
mallardduck e0f2540
track the Pest helper file
mallardduck bc61a6a
fix pest file styles
mallardduck b3a57be
Add tests that cover the edge case I've been chasing
mallardduck 62fdc48
Refactor how HTML fragments are handled
mallardduck 3dd8fb5
Ensure extra spaces are not added
mallardduck b2bc8bb
Update tests with fixed results
mallardduck 2e3278d
Manually correct snapshots to desired state
mallardduck 55badec
Skip HTML fragment tests that cause errors
mallardduck c446d6f
Refactor exception
mallardduck 1b5b3e0
remove dumper from composer file
mallardduck 2ee59bf
Always use static builder method instead of new
mallardduck 891c25f
Improve fragment parsing and enable more tests
mallardduck 2b77947
Correct HTML pages without meta charset tag
mallardduck d5b6869
refactor UTF8 tag adding and enable test
mallardduck 894b79f
Add test to cover when incorrect content type is corrected
mallardduck 590ac97
Add ext-dom to suggested
mallardduck 4468c8e
adjust styles
mallardduck File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
<?php | ||
|
||
namespace Astrotomic\Twemoji; | ||
|
||
use Astrotomic\Twemoji\Concerns\Configurable; | ||
use RuntimeException; | ||
use Wa72\HtmlPageDom\HtmlPage; | ||
use Wa72\HtmlPageDom\HtmlPageCrawler; | ||
|
||
class HtmlReplacer | ||
{ | ||
use Configurable; | ||
|
||
public static string $shouldNotBeParsed = "/^(?:iframe|noframes|noscript|script|select|style|textarea)$/"; | ||
|
||
public function __construct() | ||
{ | ||
if (! class_exists(HtmlPageCrawler::class)) { | ||
throw new RuntimeException( | ||
sprintf('Cannot use %s method unless `wa72/htmlpagedom` is installed.', __METHOD__) | ||
); | ||
} | ||
} | ||
|
||
public function parse(string $html): string | ||
{ | ||
// Parse the html | ||
$parsedHtml = new HtmlPage($html); | ||
$body = $parsedHtml->getBody(); | ||
|
||
if ($body->children()->count() === 0) { | ||
return $html; | ||
} | ||
|
||
// Use xpath to filter only the "TextNodes" within each "Element" | ||
$textNodes = $body->filterXPath('.//*[normalize-space(text())]'); | ||
mallardduck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
$textNodes->each(function (HtmlPageCrawler $node) { | ||
// Bail early if attempt to get inner text fails... | ||
try { | ||
$nodeInnterText = $node->innerText(); | ||
} catch (\Throwable $throwable) { | ||
return $node; | ||
} | ||
|
||
$twemojiContent = (new EmojiText($nodeInnterText)) | ||
->base($this->base) | ||
->type($this->type) | ||
->toHtml(); | ||
$node->makeEmpty()->setInnerHtml($twemojiContent); | ||
|
||
return $node; | ||
}); | ||
|
||
return $parsedHtml->save(); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
<?php | ||
|
||
dataset('html', [ | ||
<<<'HTML' | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head></head> | ||
<body></body> | ||
</html> | ||
HTML, | ||
<<<'HTML' | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="UTF-8"> | ||
<meta name="viewport" content="width=device-width, initial-scale=1.0"> | ||
<meta http-equiv="X-UA-Compatible" content="ie=edge"> | ||
<title>HTML 5🚀 Boilerplate</title> | ||
<link rel="stylesheet" href="style.css"> | ||
</head> | ||
<body></body> | ||
</html> | ||
HTML, | ||
<<<'HTML' | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="UTF-8"> | ||
<meta name="viewport" content="width=device-width, initial-scale=1.0"> | ||
<meta http-equiv="X-UA-Compatible" content="ie=edge"> | ||
<title>HTML 5🚀 Boilerplate</title> | ||
<link rel="stylesheet" href="style.css"> | ||
</head> | ||
<body> | ||
<h1>Do a quick kickflip! 🛹</h1> | ||
<p>This is HTML text that should be replaced, but the emoji in the head should not.</p> | ||
<h2>Time for a CRAB RAVE!</h2> | ||
<p>🦀🦀🦀🦀🦀</p> | ||
<p>🦀🦀🦀</p> | ||
<p>🦀🦀🦀🦀🦀</p> | ||
<h2>🙏🐘</h2> | ||
</body> | ||
</html> | ||
HTML, | ||
<<<'HTML' | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="UTF-8"> | ||
<title>Test with Emoji in ALT text</title> | ||
</head> | ||
<body> | ||
<h1>Hello Friends 👋</h1> | ||
<img src="http://fillmurray.lucidinternets.com/200/300" alt="A random image of Bill Murray 🍻" title="maybe an image of bill murry with a raised glass 🍺"> | ||
<h2>Time for a ElePHPant RAVE!</h2> | ||
<p>🐘🐘🐘🐘</p> | ||
<p>🐘🐘🐘</p> | ||
<p>🐘🐘🐘🐘🐘</p> | ||
<p>🐘🐘</p> | ||
</body> | ||
</html> | ||
HTML, | ||
<<<'HTML' | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="UTF-8"> | ||
<title>Test with Emoji in ALT text</title> | ||
</head> | ||
<body> | ||
<header> | ||
<h1>Hello Friends 👋</h1> | ||
<img src="http://fillmurray.lucidinternets.com/200/300" alt="A random image of Bill Murray 🍻" title="maybe an image of bill murry with a raised glass 🍺"> | ||
</header> | ||
<main> | ||
<section> | ||
<h2>Time for a ElePHPant RAVE!</h2> | ||
<p>🐘🐘🐘🐘</p> | ||
<p>🐘🐘🐘</p> | ||
<p>🐘🐘🐘🐘🐘</p> | ||
<p>🐘🐘</p> | ||
</section> | ||
</main> | ||
</body> | ||
</html> | ||
HTML, | ||
]); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
<?php | ||
|
||
use Astrotomic\Twemoji\HtmlReplacer; | ||
use function Spatie\Snapshots\assertMatchesHtmlSnapshot; | ||
|
||
it('can parse HTML content', function (string $html) { | ||
$htmlReplacer = (new HtmlReplacer())->png(); | ||
assertMatchesHtmlSnapshot($htmlReplacer->parse($html)); | ||
})->with('html'); |
5 changes: 5 additions & 0 deletions
5
...apshots__/HtmlTest__it_can_parse_HTML_content_with_(DOCTYPE_htmlnhtml_langhtml)_1__1.html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head></head> | ||
<body></body> | ||
</html> |
11 changes: 11 additions & 0 deletions
11
...apshots__/HtmlTest__it_can_parse_HTML_content_with_(DOCTYPE_htmlnhtml_langhtml)_2__1.html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="UTF-8"> | ||
<meta name="viewport" content="width=device-width, initial-scale=1.0"> | ||
<meta http-equiv="X-UA-Compatible" content="ie=edge"> | ||
<title>HTML 5🚀 Boilerplate</title> | ||
<link rel="stylesheet" href="style.css"> | ||
</head> | ||
<body></body> | ||
</html> |
22 changes: 22 additions & 0 deletions
22
...apshots__/HtmlTest__it_can_parse_HTML_content_with_(DOCTYPE_htmlnhtml_langhtml)_3__1.html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="UTF-8"> | ||
<meta name="viewport" content="width=device-width, initial-scale=1.0"> | ||
<meta http-equiv="X-UA-Compatible" content="ie=edge"> | ||
<title>HTML 5🚀 Boilerplate</title> | ||
<link rel="stylesheet" href="style.css"> | ||
</head> | ||
<body> | ||
<h1>Do a quick kickflip! <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f6f9.png" alt="🛹" width="72" height="72" loading="lazy" class="twemoji"> | ||
</h1> | ||
<p>This is HTML text that should be replaced, but the emoji in the head should not.</p> | ||
<h2>Time for a CRAB RAVE!</h2> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"></p> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"></p> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"></p> | ||
<h2> | ||
<img src="https://twemoji.maxcdn.com/v/latest/72x72/1f64f.png" alt="🙏" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"> | ||
</h2> | ||
</body> | ||
</html> |
17 changes: 17 additions & 0 deletions
17
...apshots__/HtmlTest__it_can_parse_HTML_content_with_(DOCTYPE_htmlnhtml_langhtml)_4__1.html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="UTF-8"> | ||
<title>Test with Emoji in ALT text</title> | ||
</head> | ||
<body> | ||
<h1>Hello Friends <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f44b.png" alt="👋" width="72" height="72" loading="lazy" class="twemoji"> | ||
</h1> | ||
<img src="http://fillmurray.lucidinternets.com/200/300" alt="A random image of Bill Murray 🍻" title="maybe an image of bill murry with a raised glass 🍺"> | ||
<h2>Time for a ElePHPant RAVE!</h2> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"></p> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"></p> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"></p> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"></p> | ||
</body> | ||
</html> |
23 changes: 23 additions & 0 deletions
23
...apshots__/HtmlTest__it_can_parse_HTML_content_with_(DOCTYPE_htmlnhtml_langhtml)_5__1.html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="UTF-8"> | ||
<title>Test with Emoji in ALT text</title> | ||
</head> | ||
<body> | ||
<header> | ||
<h1>Hello Friends <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f44b.png" alt="👋" width="72" height="72" loading="lazy" class="twemoji"> | ||
</h1> | ||
<img src="http://fillmurray.lucidinternets.com/200/300" alt="A random image of Bill Murray 🍻" title="maybe an image of bill murry with a raised glass 🍺"> | ||
</header> | ||
<main> | ||
<section> | ||
<h2>Time for a ElePHPant RAVE!</h2> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"></p> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"></p> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"></p> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="🐘" width="72" height="72" loading="lazy" class="twemoji"></p> | ||
</section> | ||
</main> | ||
</body> | ||
</html> |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to support full HTML docs and HTML fragments, then this method should:
$html
is a full DOM page, thenThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in PHP partial HTML is more common than a full document. Except you are implementing it as some kind of middleware to parse the whole HTML response.
But in general it should support both if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this by using the more general HTML parser, then adding a step where we check if the input HTML is a Page/Doc and selecting the
body
from that. As I was already replacing based on a HTML fragment (the body), supporting fragments as input was rather simple.