-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HTML parsing features #11
base: main
Are you sure you want to change the base?
Conversation
} | ||
} | ||
|
||
public function parse(string $html): string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to support full HTML docs and HTML fragments, then this method should:
- Immediately determine if the input
$html
is a full DOM page, then - either use HtmlPage (used here) and work based on the Body, or
- use the HtmlPageCrawler to parse the fragment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in PHP partial HTML is more common than a full document. Except you are implementing it as some kind of middleware to parse the whole HTML response.
But in general it should support both if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this by using the more general HTML parser, then adding a step where we check if the input HTML is a Page/Doc and selecting the body
from that. As I was already replacing based on a HTML fragment (the body), supporting fragments as input was rather simple.
tests/Datasets/HtmlContent.php
Outdated
HTML, | ||
]); | ||
|
||
dataset('html-fragments', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add the most simple fragments here first? And add the more complex/combined ones to the end?
<p>🚀</p>
<img src="" alt="🎉"/>
<a href="" title="🎈">Link ⛓️</a>
So some of these single-element examples.
I can also imagine that example which shouldn'T be replaced:
<script>document.innerHTML = '🤷♂️';</script>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added those more simple fragments to the top and even moved the fragment dataset to be the "primary" source.
This feedback clued me off to a flaw in my "early return" logic. So now I'm making sure I check for existence of "Text" nodes rather than children in general. Since your first example has no actual children but does have Text nodes I can work on that was being skipped with my original logic.
Only issue this highlighted is that currently the code does affect the script tag contents. I'll circle back later today/tomorrow to look into that aspect of things. Seems like I'll need to get creative about creating some sort of exclude list to skip Text nodes inside elements like script
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for <style>
I think. But like said, this will be experimental - so It's also fine to add a ->wip()
todo()
or whatever Pest test for such things. To highlight what has to be fixed/fulfilled to get it out of experimental mode.
How about passing Text only without any nodes?
This is some fancy-💃 Markdown/WYSIWYG text with surrounding <p> tags disabled. 🎉
Or without surrounding elements but one somewhere in the middle?
This is some fancy-💃 Markdown/WYSIWYG text with surrounding <code><p></code> tags disabled. 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I haven't gotten back around to creating an exclude list concept, however I'm going to start taking that on next. Just wanted to report back on my findings related to passing those "plain text" examples. Unfortunately it seems that passing text not wrapped in a node is interpreted in a variety of ambiguous ways.
For instance, the first example technically should have those braces escaped as <
and >
- otherwise browsers (and PHP DOM) will automatically parse the P tag and add a close. So I've opted to test with the escaped codes and this line is interpreted as XML. So internally it's wrapped with a _root
and as such the XPath to find the inner text nodes fails. Ultimately the result is the text gets returned untouched and Emoji's stay emojis. The second example is basically the same too - just without needing to escape the HTML tag since the code
tag has a close.
Long story short, it got complicated really fast lol. So I think I need to take a step back and focus on the HTML parsing situation and get a handle on how various inputs are being parsed. The DOM library I picked for us uses Symfony/dom-crawler as it's core which in turn uses Mastermind\HTML5 library to compensate for PHP's core dom having some failings.
All that said, my next step will be to create my own test repo to assess the overall situation for PHP's HTML parsing abilities. Since I think we will want a more clear picture about what behaviour I'm seeing from each mechanic. For instance, how is PHP core ext-dom
treating things, vs how Mastermind\HTML5
parser, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in general I can say that the symfony Dom crawler is an outstanding package and one of if not the best solution for HTML in PHP.
Like said before: don't escalate this too far. With tests that are meant to fail we at least have a good list of things needed for a final release. And even if you fix some of those, take the chance for more PRs. 😉🎃
Regarding the plain text thing: how about a detector with some cases/match arms?
- full HTML document
- partial HTML but starting/ending with a tag or even surrounded by one
- text with HTML sprinkles - solution could be to surround it with
<div>
, do the job and afterwards take$div->innerHTML
Again, these are just ideas - I'm also personally more happy to merge several smaller and focused PRs than one beast of a PR. As it's also easier to review these smaller ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for that perspective! Always appreciate the sage advice and I think you nailed it. I'll start to refocus this PR to cover less of the edge cases and focus on the most common uses. Then I can adjust the tests a bit to give us a guiding star for future enhancements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And one last thing: personally I would prefer datasets only for the same thing. Meaning really the same test case.
it does not replace emojis in HTML attributes
<a title="🚀">hello</a>
<img alt="🖼️"/>
it replaces emojis in text nodes
<p>🎉</p>
<div>💃</div>
So I'm more happy with a lot of copy'n'paste test cases with better/easier descriptions than one test case running through a massive dataset and no one really knows what's tested with that data-point and what is still missing or not covered. So every expectation to the lib should be one test case - and we can bullet proof that expectation by adding multiple variants of it. Like <a>
and <img/>
for the attributes. And later someone adds <button>
to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the tests now and followed your advice of individual tests. Great idea as this allows me to more obviously target edge cases. So I've taken advantage of that by adding some explicit expectations to tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good news!
I've caught the root of the issue I've been chasing that I thought was something I did wrong. Specifically the way HTML attributes were being cast/converted to their HTML entities equivalent instead of staying unicode.
It turns out that this is caused by DOMDocument itself. I thought it was being caused by either: a) the xpath I found to use, or b) the HTML5 transforming library I opted to choose. In the end, reading the comments on PHP's docs for DOMDocument was my salvation! Specifically the 15 year old comment from xuanbn
that said:
I have discovered that, to help loadHTML() process utf file correctly, the meta tag should come first, before any utf string appear. For example, this HTML file.
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title> Vietnamese - Tiếng Việt</title>
</head>
<body></body>
</html>
With that new info I added an additional test that shows how the HTML being input determines how DOMDocument treats things up on calling saveHtml()
. So I think this gets us to a lot better place where I've created a working solution, but we show that it's experimental and highlight the issue with a (skipped) failing test.
30c7f40
to
e0f2540
Compare
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta http-equiv="content-type" content="text/html; charset=utf-8"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "correct" meta tag to set charset to UTF8.
This will allow DOMDocument
to preserve the raw Emoji in the document.
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="UTF-8"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "incorrect" method to set charset to UTF8.
Unfortunately it will not be respected by DOMDocument
and Emoji will become HTML entities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be fixed before merge - we should just wrap all fragments in our own HTML page that uses the correct charset meta tag. This will ensure they are encoded correctly and then we can capture the fragment from our "template".
<h1>Do a quick kickflip! <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f6f9.png" alt="🛹" width="72" height="72" loading="lazy" class="twemoji"></h1> | ||
<p>This is HTML text that should be replaced, but the emoji in the head should not.</p> | ||
<h2>Time for a CRAB RAVE!</h2> | ||
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"></p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See here, how the alt
HTML attribute is still a valid Emoji.
@@ -0,0 +1 @@ | |||
<p>This is some fancy-<img src="https://twemoji.maxcdn.com/v/latest/72x72/1f483.png" alt="💃" width="72" height="72" loading="lazy" class="twemoji"> Markdown/WYSIWYG text with surrounding <p> tags enabled. <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f389.png" alt="🎉" width="72" height="72" loading="lazy" class="twemoji"></p></p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However with this fragment over here it's obvious DOMDocument
executes the normal behavior. The alt
attribute is incorrectly encoded to html entity.
Just let me know when you want me to review and merge it^^ |
@Gummibeer - Hiya, Think that I've got things in a far more functional state now. The solution I've landed on seems to handle 80% of the cases via the main "happy path" - that is works perfect without a hitch. Then there's the ones with incorrect or missing meta |
This is a WIP PR to address #2 - however I will resubmit/force-push this branch once we have #10 merged in to fix PHP 8.1 support fully.
As it stands now, here are the remaining concerns to address:
How to flag as experimental? Is there an PHPDoc annotation I can apply to the class?Might be addressed, waiting for confirmation.Twemoji
but I opted for a new Replacer class. This way the new functionality doesn't directly modify the base Twemoji behaviour. (If this works well for HTML I can work on a follow up PR to address adding a Markdown Replacer class)How should the HTML replacer filter nodes for emoji replacement? I need to do more testing around this and add more test data to the pest datasets. This will help us be more informed on how PHP will filter things out best.Solved: Now using XPath to filter only theText
type nodes from the document. In the DOM the Text node type represents the inner Text of an element. So by filtering the page first to the body, then selecting all text elements we can safely catch all emoji.