Introduce `unvisitableExtensions` to remove `isHTML` implementation #1230

edwinv · 2024-03-25T16:34:03Z

Fixes #608

When Turbo navigates to a new URL, locationIsVisitable is called to check if the location is visitable. One part of the check is checking if the requested URL returns HTML. We can never determine with 100% confidence if a request will return HTML. A page with a .html extension might still return another content-type. A page with a .foobar extension might return HTML. Using a check on the file extension is therefore imprecise and prone to bugs.

Turbo already handles non-HTML responses. When a response doesn't contain an HTML part, a contentTypeMismatch is raised. This causes the page to reload to the requested URL. So we could remove the isHTML check completely and still have a functioning situation. The major downside is having two requests: one Turbo fetch and one regular browser request.

A user could prevent these double requests by adding data-turbo="false" to these links. But there might be some sensible default by never handling known non-HTML extensions as a Turbo request. This PR introduces the Turbo. unvisitableExtensions setting to enable this.

By default, a range of common non-HTML extensions are included in this setting, like .png and .jpg. If users have many links to other extensions and don't want to sprinkle their HTML with data-turbo="false" attribute, they can add new extensions to the set:

Turbo. unvisitableExtensions.add(".mp3")

Because these changes are added to the lower-level checks, they work both for regulars visits and the newly introduced page refresh morphs.

Mentioning @packagethief @dhh @jorgemanrubia for some visibility.

packagethief · 2024-03-25T16:59:56Z

I'm wary of using an exclusion list and assuming that most things are HTML by default. I think the list will just keep growing. But I do like the API you've proposed better than isHTML's existing regex. What would you think of using that API with an HTML extension allowlist? Then applications can more easily configure the extensions they want treated as HTML:

Turbo.htmlExtensions.add(".foobar")

The implementation would be as you've proposed, but would check for inclusion in the HTML extensions, rather than exclusion. We'd ship with the defaults already defined by the regex:

export const htmlExtensions = new Set([ ".html", ".htm", ".xhtml", ".php" ])

export function locationIsVisitable(location, rootLocation) {
  return isPrefixedBy(location, rootLocation) && htmlExtensions.has(getExtension(location))
}

EDIT: I guess that won't work when there's no extension – the implementation would need to account for that.

edwinv · 2024-03-25T18:11:36Z

On the one hand finding the extension is always a educated guess. In our application we have URL's with a filter, containing a date range presentation. Example: /invoices/filter/period:202401..202403. This is a valid URL and is perfectly handled by Rails routing. Turbo thinks the extension is .202403. This is a dynamic part of the URL and can never be added to an htmlExtensions array on the frontend. Turbo thinks it is no HTML and will always reload the page. Up until recently this wasn't an issue: while navigating the user just got a fresh page. With Turbo streams refreshing/morphing the page, the user now faces a complete refresh while we are just trying to update a small part of it. This is degrading the user experience.

On the other hand, we need to think about fallback situations. If the treatAsNonHtml set is not covering all situations, the page will still load. It takes two requests, but the user won't notice. The major downside is wasting traffic and a bit of a delay for the user. Compare this to an incomplete htmlExtensions. Then the user will have a degraded experience due to full page reloads, while morphing is expected.

So, I'm still in favor of the proposed set for non-HTML extensions. I'm aware it is not covering all possible extensions, but when it does not cover an extension, the user won't notice and the experience stays the same.

I like the htmlExtensions name, so maybe we should rename treatAsNonHtml to a simpler name like nonHtmlExtensions or neverLoadWithTurbo.

packagethief · 2024-03-25T18:37:20Z

So, I'm still in favor of the proposed set for non-HTML extensions. I'm aware it is not covering all possible extensions, but when it does not cover an extension, the user won't notice and the experience stays the same.

Thanks for the explanation. I hadn't considered dynamic paths like the one you described. I think your reasoning makes sense.

I think we'd need a much larger default exclusion list, however. Just looking at Basecamp and HEY, we have a pretty extensive list of extensions we'll auto-link to: zip, tar, gz, bz2, rar, 7z, dmg, exe, msi, pkg, deb, iso, bmp, mp4, mov, avi, mkv, wmv, heic, heif, mp3, wav, ogg, aac, wma, webm, ogv, mpg, mpeg. While these could return HTML, it's extremely unlikely, and we wouldn't want to start making two requests to resolve them.

packagethief · 2024-03-25T19:01:27Z

Thinking about it more, I think there are more downsides to an exclusion list than the inclusion regex we have now. It makes more sense to define the extensions that we're confident are HTML than it does to define the ones that aren't. The former is going to be a much smaller set than the latter for all practical purposes.

The example with a pathname like /period:202401..202403 feels like an uncommon case, and one that could still be supported by augmenting the regex we have now. I'd be in favor of a nicer API that didn't involve replacing the regex though, like a way to define additional extension matchers.

edwinv · 2024-03-25T19:39:18Z

For the inclusion list/regex to work, it should have perfect coverage. My situation is very specific, but there have been plenty of other examples like FQDN like example.com, Rails ids in paths, .deploy, file extensions, and usernames with a dot in it. I don't see how a default list/regex is going to cover all these situations.

If the list/regex is a configuration, a user could extend the list and cover their needs. But even then it is hard to get it covered. Many paths contain dynamic parts that are hard to contain in a list or regex.

But I'm also thinking about the developer experience. A new developer starting with Turbo expects Turbo to work all the time. When a part of the requests/refreshes are suddenly page reloads, it is not clear at first hand what is happening. This was my firsthand experience last few days when users complained about random refreshes and we couldn't find the cause because the . is only in specific situations in the URL.
Having a inclusion list/regex makes things more flexible for the developer, but also opens the door for new bugs. What if the regex matches all extensions by accident, then all of the sudden Turbo starts loading full PDF files or large binaries.

I don't see a downside for a long(er) exclusion list. We probably can find a reasonable list of say 100 most used file extensions on the web that give no HTML response. If we miss something, the end-user won't notice other than the double request. And the developer can safely extend the list by providing a specific extension.

What I was thinking about, is the introduction of the (first) Turbo configuration. Instead of going with the Turbo.someConfig route, we can also introduce a specific <meta> tag for the configuration. This is more in line with how other configurations for Turbo are made.

brunoprietog · 2024-04-01T10:46:33Z

Is it too complicated to add a slash at the end? In Rails you can easily do it with the trailing_slash option, at least. It would automatically solve your problem.

edwinv · 2024-04-05T09:31:36Z

Is it too complicated to add a slash at the end? In Rails you can easily do it with the trailing_slash option, at least. It would automatically solve your problem.

Yes, this has all kinds of side effect in our application. Furthermore I believe a library like Turbo shouldn't dictate if slashes are added or not. Rails might have an easy solution for this, but isn't this library intended to be used by other frameworks too?

edwinv · 2024-05-13T08:04:31Z

How are we going to move forward with this PR? I haven't read any arguments that have convinced me the exclusion list is not going to work, other than some minor inconveniences with maintaining the list. In my point of view this downside does not outweigh the downside of unexpected behavior with full page refreshes due to some random . being interpreted as a file extension. I still consider detecting file types by their extension to be a code smell.

gjtorikian · 2024-08-06T12:28:40Z

Given the original issue, and other solutions that have been proposed (e.g. #519), what are the requirements for a solution to fix this two+ year old issue? It seems to me that we shouldn’t let perfect be the enemy of good: yes, maintaining an exclusion list sucks, and maybe the list in this PR doesn’t cover every case. But there’s currently no clean way to implement what the PR is fixing.

I'd be in favor of a nicer API that didn't involve replacing the regex though, like a way to define additional extension matchers.

This was the last message on the topic—is your belief the same, @packagethief ? Because it sounds like the API/regex is the issue, not the overall idea. And if that is the issue, then I think that API can be improved in the future; but if you disagree, then maybe this PR can be amended to introduced functionality to make this configuration easier.

packagethief · 2024-08-06T14:59:18Z

It seems to me that we shouldn’t let perfect be the enemy of good: yes, maintaining an exclusion list sucks

I agree with that. I've come around on the idea of an exclusion list. As long as the default is reasonably comprehensive (there's a longer list in #1230 (comment)) and there's an official way to configure it (which this PR includes), it should be less onerous than having to define your own permitted extensions.

If I may though, treatAsNonHtml feels a little opaque. This is really about whether a particular path should be visitable. With that in mind, I think a name like Turbo.unvisitableExtensions would be more intention-revealing.

gjtorikian · 2024-08-07T12:46:26Z

Sorry for the direct ping @edwinv — just wondering if you had time to tidy up the work here!

…our default set.

edwinv · 2024-08-07T13:01:19Z

@gjtorikian I think the PR is finished, now we have agreed on the exclusion list. I've renamed the variable (thanks for the suggestion @packagethief) and included some more extensions to the list to cover more cases. Ready for a final review!

edwinv · 2024-08-07T13:07:48Z

CI fails on a test that does not fail locally with me, was not failing on previous commits and is unrelated to my changes. It seems like a random failure, but I can't retry it to verify this.

gjtorikian · 2024-08-07T13:47:21Z

Yeah, a quick search shows that the same test failed last month: #1169 (comment)

I reckon it’s flakey.

gjtorikian · 2024-08-22T17:34:50Z

With the cleanup made and the flakey test flaking, is this ready to be merged/released?

packagethief · 2024-08-22T18:45:51Z

With the cleanup made and the flakey test flaking, is this ready to be merged/released?

Yes! Sorry for losing track. Looks like there's a small conflict to resolve.

brunoprietog · 2024-08-26T02:23:56Z

The conflict is caused by #1217, which means that this should be moved to Turbo.config.drive.unvisitableExtensions.

edwinv · 2024-09-02T10:05:42Z

The conflict has been resolved in line with the #1217 changes.

edwinv · 2024-09-02T11:46:30Z

I've identified an issue with the tests in #1217 due to the complete config being false instead of an object. Fixed it in my last commit. This is unrelated to my changes, but caused exceptions in the tests.

On a side note: it took quite some time to find the issue due to Playwright not failing in JS error that occur in the browser. I'm not very familiar with Playwright, but wouldn't it be better to catch all errors and console logs and raise them in the tests? Otherwise some exceptions might not be noticed because we just ignore them.

edwinv · 2024-09-12T07:40:01Z

All checks are green, ready to merge I guess? I prefer to be ahead of potential new conflicts due to other changes being merged.

packagethief

Thanks @edwinv. Just a couple of style tweaks and I think we're ready to merge. Thanks for sticking with it 🙏

src/tests/functional/visit_tests.js

Co-authored-by: Jeffrey Hardy <[email protected]>

Introduce treatAsNonHtml to remove isHTML implementation

970f251

chillenberger mentioned this pull request Mar 26, 2024

Dan doc nav no scroll postgresml/postgresml#1388

Merged

2 tasks

edwinv added 3 commits August 7, 2024 14:49

Rename treatAsNonHtml to unvisitableExtensions and expand list

09b1e36

Merge branch 'hotwired:main' into ishtml_refresh

fe00499

Sort extensions in Set and add some more commonly used extensions to …

cdd500c

…our default set.

edwinv changed the title ~~Introduce treatAsNonHtml to remove isHTML implementation~~ Introduce unvisitableExtensions to remove isHTML implementation Aug 7, 2024

edwinv added 2 commits September 2, 2024 12:49

Merge branch 'main' into ishtml_refresh

e16dfd4

Move unvisitableExtensions config to config

f7b4867

edwinv added 2 commits September 2, 2024 12:10

Remove semicolon

28911ad

Fix config error in test

10adafb

packagethief approved these changes Sep 12, 2024

View reviewed changes

src/tests/functional/visit_tests.js Outdated Show resolved Hide resolved

src/tests/functional/visit_tests.js Outdated Show resolved Hide resolved

src/tests/functional/visit_tests.js Outdated Show resolved Hide resolved

Apply suggestions from code review

30cead8

Co-authored-by: Jeffrey Hardy <[email protected]>

packagethief merged commit b8b6662 into hotwired:main Sep 12, 2024
1 check passed

brunoprietog mentioned this pull request Dec 13, 2024

Customize isHTML url pattern not only "htm|html|xhtml|php" #789

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `unvisitableExtensions` to remove `isHTML` implementation #1230

Introduce `unvisitableExtensions` to remove `isHTML` implementation #1230

edwinv commented Mar 25, 2024 •

edited

Loading

packagethief commented Mar 25, 2024 •

edited

Loading

edwinv commented Mar 25, 2024

packagethief commented Mar 25, 2024

packagethief commented Mar 25, 2024

edwinv commented Mar 25, 2024

brunoprietog commented Apr 1, 2024

edwinv commented Apr 5, 2024

edwinv commented May 13, 2024

gjtorikian commented Aug 6, 2024

packagethief commented Aug 6, 2024

gjtorikian commented Aug 7, 2024

edwinv commented Aug 7, 2024

edwinv commented Aug 7, 2024

gjtorikian commented Aug 7, 2024

gjtorikian commented Aug 22, 2024

packagethief commented Aug 22, 2024

brunoprietog commented Aug 26, 2024

edwinv commented Sep 2, 2024

edwinv commented Sep 2, 2024

edwinv commented Sep 12, 2024

packagethief left a comment

Introduce unvisitableExtensions to remove isHTML implementation #1230

Introduce unvisitableExtensions to remove isHTML implementation #1230

Conversation

edwinv commented Mar 25, 2024 • edited Loading

packagethief commented Mar 25, 2024 • edited Loading

edwinv commented Mar 25, 2024

packagethief commented Mar 25, 2024

packagethief commented Mar 25, 2024

edwinv commented Mar 25, 2024

brunoprietog commented Apr 1, 2024

edwinv commented Apr 5, 2024

edwinv commented May 13, 2024

gjtorikian commented Aug 6, 2024

packagethief commented Aug 6, 2024

gjtorikian commented Aug 7, 2024

edwinv commented Aug 7, 2024

edwinv commented Aug 7, 2024

gjtorikian commented Aug 7, 2024

gjtorikian commented Aug 22, 2024

packagethief commented Aug 22, 2024

brunoprietog commented Aug 26, 2024

edwinv commented Sep 2, 2024

edwinv commented Sep 2, 2024

edwinv commented Sep 12, 2024

packagethief left a comment

Choose a reason for hiding this comment

Introduce `unvisitableExtensions` to remove `isHTML` implementation #1230

Introduce `unvisitableExtensions` to remove `isHTML` implementation #1230

edwinv commented Mar 25, 2024 •

edited

Loading

packagethief commented Mar 25, 2024 •

edited

Loading