Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ProtectEsiTags Document filter #343

Merged
merged 2 commits into from
Oct 1, 2021
Merged

Conversation

ediamin
Copy link
Collaborator

@ediamin ediamin commented Sep 6, 2021

The PHP DOMDocument alters the ESI tags that use colon in the tag and remove the esi: part and hence breaks the system. This PR adds a new Document filter to fix this issue.

Fixes #202

@ediamin ediamin requested a review from schlessera September 6, 2021 13:28
@westonruter
Copy link
Member

Should this be disabled by default? The vast majority of sites wouldn't be needing it. But sites that do should easily be able to add it.

@ediamin
Copy link
Collaborator Author

ediamin commented Sep 7, 2021

Should this be disabled by default? The vast majority of sites wouldn't be needing it. But sites that do should easily be able to add it.

Good point! I think we can have the filter class in our core but make it optional by not including it in the Document filter classes list. This way, people will be able to consume the feature later in their code.

@schlessera please share your thought on this.

src/Dom/Document/Filter/ProtectEsiTags.php Show resolved Hide resolved
src/Dom/Document/Filter/ProtectEsiTags.php Outdated Show resolved Hide resolved
tests/Dom/DocumentTest.php Outdated Show resolved Hide resolved
@schlessera
Copy link
Collaborator

Should this be disabled by default? The vast majority of sites wouldn't be needing it. But sites that do should easily be able to add it.

We have a lot of work-arounds in place that only a minority will profit from. But unless this is a huge performance overhead, I'd prefer to keep it enabled by default, rather than to have unnecessary support requests and people with frustrating hard-to-diagnose breakage from time to time.

@westonruter
Copy link
Member

We have a lot of work-arounds in place that only a minority will profit from. But unless this is a huge performance overhead, I'd prefer to keep it enabled by default, rather than to have unnecessary support requests and people with frustrating hard-to-diagnose breakage from time to time.

To me, ESI seems like a different category of filter. If you aren't intending to use ESI tags in the first place, then they should not be passed through or else they'll cause validation errors.

@kristoferbaxter
Copy link

Absolutely agree with @westonruter here, are there places we would consider esi tags valid without an edge server?

Would validation be the edge server's concern in this model?

@westonruter
Copy link
Member

If someone is using ESI then it would be their responsibility to ensure they're composing a valid AMP page. Any validation would have to be happening on the edge server and it could not be done on the origin server in PHP here.

My concern was that PHP sanitization (#195) and validation (#194) wouldn't be able to see the ESI tags in the DOM. But after looking at the changes here more deeply, I see that since it's basically just converting esi: XML namespaced tags into tags with an esi- tag prefix. So then these will still be available in the DOM. So sanitization could still process them. And since the validator doesn't operate on the DOM (yet), then this isn't relevant either. When implementing sanitization/validation, there would have to be opt-in flags to ignore validation errors related to encountering esi-* tags.

I was originally thinking the implementation would replace ESI tags with placeholders anywhere they appear, as I seemed to remember that ESI tags could be used inside HTML attributes. But after trying to find examples of that, I can't find anything. I didn't think that ESI parsers actually parsed HTML, but apparently they do. I may have been led astray by old server-side includes. Or maybe I'm just wrong. 😄

So if the additional processing time for doing two search-and-replacements isn't a concern (once before parsing and once after serialization), then I'm fine with it being enabled by default.

@schlessera
Copy link
Collaborator

If you aren't intending to use ESI tags in the first place, then they should not be passed through or else they'll cause validation errors.

This is not about letting them pass through or not. This is only about making sure we're not corrupting the DOM when they are encountered. They were being passed through already, only the subsequent HTML was parsed in a broken way. This ensures that we can parse a document that contains ESI as well without creating garbage or fatals.

@schlessera
Copy link
Collaborator

Would validation be the edge server's concern in this model?

Yes, or our own validator's concern depending on setup. The above PR is really just about "not breaking during parsing".

@schlessera
Copy link
Collaborator

Absolutely agree with @westonruter here, are there places we would consider esi tags valid without an edge server?

The processing here will most likely happen on the origin server. So if directives for edge servers are included, I think we should keep these and not break them, unless we're going for strict valid AMP. For non-strict validation, ESI tags should be perfectly fine, I think, as we're not expecting the content to run from an AMP cache, but be distributed through some CDN.

@westonruter
Copy link
Member

Yes, or our own validator's concern depending on setup. The above PR is really just about "not breaking during parsing".

If this is the case, aren't there other examples of XML-namespaced tags that should also be accounted for? I'm thinking of <o:p> for example, which is in the validator spec but we don't currently handle it. That appears to be the only case of an XML-namespaced tag that the validator is expecting. Then there's also the namespaced SVG attributes, but the PHP DOM actually parses these successfully, so that's good.

There are some instances of XML namespaces being used for tags in HTML documents, however. There's the (deprecated) XFBML, for example. These would have the same parsing problem as ESI tags, so maybe the solution should be generalized from ProtectEsiTags to rather be ProtectXmlNamespacedTags?

@ediamin ediamin force-pushed the add/202-esi-document-filter branch from 50e787a to c059a53 Compare September 28, 2021 15:47
@schlessera schlessera force-pushed the add/202-esi-document-filter branch from c059a53 to 8749350 Compare October 1, 2021 08:56
@schlessera schlessera added this to the 0.8.0 milestone Oct 1, 2021
@schlessera schlessera added the DOM label Oct 1, 2021
@schlessera schlessera merged commit 462b562 into main Oct 1, 2021
@schlessera schlessera deleted the add/202-esi-document-filter branch October 1, 2021 09:06
@schlessera
Copy link
Collaborator

These would have the same parsing problem as ESI tags, so maybe the solution should be generalized from ProtectEsiTags to rather be ProtectXmlNamespacedTags?

I'd keep these as separate filters. With the ESI tags, they are technically not really XML namespace tags, they are not even meant to be part of the document. They are more like Mustache tags in that they are placeholders that have to be replaced before the "actual" HTML/XML hits the browser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Varnish <esi> tags are being stripped by DOMDocument
4 participants