Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTML parsing features #11

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
67cd354
Apply 8.1 hotfixes from unmerged patch
mallardduck Oct 2, 2022
8c15b01
Initial HTML replacer code
mallardduck Oct 2, 2022
6a8cb4d
remove unused property
mallardduck Oct 2, 2022
b8af7e5
generate new emoji bytes
mallardduck Oct 2, 2022
b8bedbf
clean up code
mallardduck Oct 2, 2022
bb84284
Add test to cover image alt/title attributes
mallardduck Oct 2, 2022
eb121c0
refactor to use XPath to solve filtering text nodes problem
mallardduck Oct 2, 2022
92d9136
Remove try-guy now that it's unused
mallardduck Oct 2, 2022
4538750
refactor to ensure we allow HTML fragments too
mallardduck Oct 4, 2022
cd6e190
refactor tests to split up HTML pages and HTML fragments
mallardduck Oct 4, 2022
60c987f
Use internal tag as means of warning?
mallardduck Oct 4, 2022
f7616c0
Refactor method name to slightly better option
mallardduck Oct 4, 2022
e818b5c
fix code styles
mallardduck Oct 4, 2022
b1f83c7
make styleCI happy
mallardduck Oct 4, 2022
29f7d0a
Refactor to fix missed fragments and expand tests
mallardduck Oct 4, 2022
eeb5f0a
reorder code
mallardduck Oct 4, 2022
fcdd93d
Refactor new tests and add failing tests for current issues.
mallardduck Oct 17, 2022
2d77cdc
fix styles
mallardduck Oct 17, 2022
e0f2540
track the Pest helper file
mallardduck Oct 17, 2022
bc61a6a
fix pest file styles
mallardduck Oct 17, 2022
b3a57be
Add tests that cover the edge case I've been chasing
mallardduck Oct 17, 2022
62fdc48
Refactor how HTML fragments are handled
mallardduck Oct 17, 2022
3dd8fb5
Ensure extra spaces are not added
mallardduck Oct 17, 2022
b2bc8bb
Update tests with fixed results
mallardduck Oct 17, 2022
2e3278d
Manually correct snapshots to desired state
mallardduck Oct 17, 2022
55badec
Skip HTML fragment tests that cause errors
mallardduck Oct 17, 2022
c446d6f
Refactor exception
mallardduck Oct 17, 2022
1b5b3e0
remove dumper from composer file
mallardduck Oct 17, 2022
2ee59bf
Always use static builder method instead of new
mallardduck Oct 17, 2022
891c25f
Improve fragment parsing and enable more tests
mallardduck Oct 17, 2022
2b77947
Correct HTML pages without meta charset tag
mallardduck Oct 17, 2022
d5b6869
refactor UTF8 tag adding and enable test
mallardduck Oct 17, 2022
894b79f
Add test to cover when incorrect content type is corrected
mallardduck Oct 17, 2022
590ac97
Add ext-dom to suggested
mallardduck Oct 17, 2022
4468c8e
adjust styles
mallardduck Oct 17, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,15 @@
"ext-mbstring": "*"
},
"require-dev": {
"pestphp/pest": "^0.3.0",
"pestphp/pest": "^1.21",
"s9e/regexp-builder": "^1.4",
"spatie/emoji": "^2.3.0",
"spatie/pest-plugin-snapshots": "^1.0"
"spatie/pest-plugin-snapshots": "^1.0",
"wa72/htmlpagedom": "^2.0 || ^3.0"
},
"suggest": {
"spatie/emoji": "*"
"spatie/emoji": "*",
"wa72/htmlpagedom": "*"
},
"minimum-stability": "dev",
"prefer-stable": true,
Expand All @@ -38,7 +40,10 @@
}
},
"config": {
"sort-packages": true
"sort-packages": true,
"allow-plugins": {
"pestphp/pest-plugin": true
}
},
"scripts": {
"generate": "php ./generate.php",
Expand Down
57 changes: 57 additions & 0 deletions src/HtmlReplacer.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
<?php

namespace Astrotomic\Twemoji;

use Astrotomic\Twemoji\Concerns\Configurable;
use RuntimeException;
use Wa72\HtmlPageDom\HtmlPage;
use Wa72\HtmlPageDom\HtmlPageCrawler;

class HtmlReplacer
{
use Configurable;

public static string $shouldNotBeParsed = "/^(?:iframe|noframes|noscript|script|select|style|textarea)$/";

public function __construct()
{
if (! class_exists(HtmlPageCrawler::class)) {
throw new RuntimeException(
sprintf('Cannot use %s method unless `wa72/htmlpagedom` is installed.', __METHOD__)
);
}
}

public function parse(string $html): string
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to support full HTML docs and HTML fragments, then this method should:

  1. Immediately determine if the input $html is a full DOM page, then
  2. either use HtmlPage (used here) and work based on the Body, or
  3. use the HtmlPageCrawler to parse the fragment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in PHP partial HTML is more common than a full document. Except you are implementing it as some kind of middleware to parse the whole HTML response.
But in general it should support both if possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this by using the more general HTML parser, then adding a step where we check if the input HTML is a Page/Doc and selecting the body from that. As I was already replacing based on a HTML fragment (the body), supporting fragments as input was rather simple.

{
// Parse the html
$parsedHtml = new HtmlPage($html);
$body = $parsedHtml->getBody();

if ($body->children()->count() === 0) {
return $html;
}

// Use xpath to filter only the "TextNodes" within each "Element"
$textNodes = $body->filterXPath('.//*[normalize-space(text())]');
mallardduck marked this conversation as resolved.
Show resolved Hide resolved

$textNodes->each(function (HtmlPageCrawler $node) {
// Bail early if attempt to get inner text fails...
try {
$nodeInnterText = $node->innerText();
} catch (\Throwable $throwable) {
return $node;
}

$twemojiContent = (new EmojiText($nodeInnterText))
->base($this->base)
->type($this->type)
->toHtml();
$node->makeEmpty()->setInnerHtml($twemojiContent);

return $node;
});

return $parsedHtml->save();
}
}
3 changes: 2 additions & 1 deletion src/Twemoji.php
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ public function __construct(array $codepoints)

public static function emoji(string $emoji): self
{
$chars = preg_split('//u', $emoji, null, PREG_SPLIT_NO_EMPTY);
$chars = preg_split('//u', $emoji, -1, PREG_SPLIT_NO_EMPTY);

$codepoints = array_map(
fn (string $code): string => dechex(mb_ord($code)),
Expand Down Expand Up @@ -58,6 +58,7 @@ public function url(): string
);
}

#[\ReturnTypeWillChange]
public function jsonSerialize()
{
return $this->url();
Expand Down
2 changes: 1 addition & 1 deletion src/emoji_bytes.regexp

Large diffs are not rendered by default.

87 changes: 87 additions & 0 deletions tests/Datasets/HtmlContent.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
<?php

dataset('html', [
<<<'HTML'
<!DOCTYPE html>
<html lang="en">
<head></head>
<body></body>
</html>
HTML,
<<<'HTML'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>HTML 5🚀 Boilerplate</title>
<link rel="stylesheet" href="style.css">
</head>
<body></body>
</html>
HTML,
<<<'HTML'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>HTML 5🚀 Boilerplate</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<h1>Do a quick kickflip! 🛹</h1>
<p>This is HTML text that should be replaced, but the emoji in the head should not.</p>
<h2>Time for a CRAB RAVE!</h2>
<p>🦀🦀🦀🦀🦀</p>
<p>🦀🦀🦀</p>
<p>🦀🦀🦀🦀🦀</p>
<h2>🙏🐘</h2>
</body>
</html>
HTML,
<<<'HTML'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Test with Emoji in ALT text</title>
</head>
<body>
<h1>Hello Friends 👋</h1>
<img src="http://fillmurray.lucidinternets.com/200/300" alt="A random image of Bill Murray 🍻" title="maybe an image of bill murry with a raised glass 🍺">
<h2>Time for a ElePHPant RAVE!</h2>
<p>🐘🐘🐘🐘</p>
<p>🐘🐘🐘</p>
<p>🐘🐘🐘🐘🐘</p>
<p>🐘🐘</p>
</body>
</html>
HTML,
<<<'HTML'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Test with Emoji in ALT text</title>
</head>
<body>
<header>
<h1>Hello Friends 👋</h1>
<img src="http://fillmurray.lucidinternets.com/200/300" alt="A random image of Bill Murray 🍻" title="maybe an image of bill murry with a raised glass 🍺">
</header>
<main>
<section>
<h2>Time for a ElePHPant RAVE!</h2>
<p>🐘🐘🐘🐘</p>
<p>🐘🐘🐘</p>
<p>🐘🐘🐘🐘🐘</p>
<p>🐘🐘</p>
</section>
</main>
</body>
</html>
HTML,
]);
9 changes: 9 additions & 0 deletions tests/Unit/HtmlTest.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
<?php

use Astrotomic\Twemoji\HtmlReplacer;
use function Spatie\Snapshots\assertMatchesHtmlSnapshot;

it('can parse HTML content', function (string $html) {
$htmlReplacer = (new HtmlReplacer())->png();
assertMatchesHtmlSnapshot($htmlReplacer->parse($html));
})->with('html');
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<!DOCTYPE html>
<html lang="en">
<head></head>
<body></body>
</html>
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>HTML 5&#128640; Boilerplate</title>
<link rel="stylesheet" href="style.css">
</head>
<body></body>
</html>
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>HTML 5&#128640; Boilerplate</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<h1>Do a quick kickflip! <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f6f9.png" alt="&#128761;" width="72" height="72" loading="lazy" class="twemoji">
</h1>
<p>This is HTML text that should be replaced, but the emoji in the head should not.</p>
<h2>Time for a CRAB RAVE!</h2>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"></p>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"></p>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="&#129408;" width="72" height="72" loading="lazy" class="twemoji"></p>
<h2>
<img src="https://twemoji.maxcdn.com/v/latest/72x72/1f64f.png" alt="&#128591;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji">
</h2>
</body>
</html>
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Test with Emoji in ALT text</title>
</head>
<body>
<h1>Hello Friends <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f44b.png" alt="&#128075;" width="72" height="72" loading="lazy" class="twemoji">
</h1>
<img src="http://fillmurray.lucidinternets.com/200/300" alt="A random image of Bill Murray &#127867;" title="maybe an image of bill murry with a raised glass &#127866;">
<h2>Time for a ElePHPant RAVE!</h2>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"></p>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"></p>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"></p>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"></p>
</body>
</html>
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Test with Emoji in ALT text</title>
</head>
<body>
<header>
<h1>Hello Friends <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f44b.png" alt="&#128075;" width="72" height="72" loading="lazy" class="twemoji">
</h1>
<img src="http://fillmurray.lucidinternets.com/200/300" alt="A random image of Bill Murray &#127867;" title="maybe an image of bill murry with a raised glass &#127866;">
</header>
<main>
<section>
<h2>Time for a ElePHPant RAVE!</h2>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"></p>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"></p>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"></p>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f418.png" alt="&#128024;" width="72" height="72" loading="lazy" class="twemoji"></p>
</section>
</main>
</body>
</html>