Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only grab necessary subresources #33

Open
Treora opened this issue Sep 13, 2018 · 4 comments
Open

Only grab necessary subresources #33

Treora opened this issue Sep 13, 2018 · 4 comments
Labels
snapshot quality Improving fidelity/size/durability/etc of the output

Comments

@Treora
Copy link
Contributor

Treora commented Sep 13, 2018

Currently, we inline all resolutions listed in an <img>'s srcset, all <audio> and <video> sources, all stylesheets, etcetera. This makes snapshots huge. The upside is that the snapshot will be as rich as the original, and more likely to work and look as intended in various browsers and screen resolutions. Depending on the application, one or the other factor may be more important, so it would be nice to make configurable how much we grab. Some preliminary thoughts on this:

  • One reasonable desire is to grab only things that are currently in use (if this can be tested for). This could help a lot with speeding freeze-dry up, as those things may be available from cache.

  • For images with multiple resolutions, we could read element.currentSrc, and only grab that one. And/or perhaps get the one with highest resolution.

  • For audio and video, the sources are usually different file formats; currentSrc seems a reasonable choice again, or some prewired preference to pick a widely supported and/or well compressed format (again a possible trade-off).

  • For stylesheets, we may filter by media queries, both in a media attribute on a <link> (to omit the whole stylesheet), or @media at-rules inside stylesheets (to omit the subresources it affects). The next question is then what media queries to fliter for; type (screen/print), window size; possibly again only take what is currently active.

  • For fonts, we could take only the ones currently used/loaded (how? the status attributes of fonts in document.fonts?). And we could hard-code a preference for some well compressed and/or widely supported file format.

@Treora Treora added the snapshot quality Improving fidelity/size/durability/etc of the output label Apr 6, 2019
@JYone3A
Copy link

JYone3A commented Feb 1, 2023

Another optimization could be done for url()s in CSS:

  • include a new <style> block at the top of <head>
  • store any inlined dataurl there as a CSS variable
  • use this variable in the url() definition currently being inlined
  • save a map somewhere - url => css variable
  • if the same URL should be inlined again, directly use the existing CSS variable

Could <link> elements to CSS files be replaced with <style>, and the CSS be included as plaintext, not base64 converted? Should save another 30% for CSS.

Another thing I came across is in a responsive image, where the srcset value was pointing to an image that did not exist. Therefore the website did return a 404 error page, but then this was inlined instead of being omitted which added some 2MBs to the dump. So checking the MIME type when loading image sources, and omitting non-images, could further help?

@Treora
Copy link
Contributor Author

Treora commented Feb 2, 2023

Thanks for the tips!

Could <link> elements to CSS files be replaced with <style>, and the CSS be included as plaintext, not base64 converted? Should save another 30% for CSS.

The very first implementation of freeze-dry did convert linked stylesheets to <style> elements. I changed it to the current behaviour for two reasons:

  1. We’d have to ensure the stylesheet does not corrupt the HTML (see issue Inlining corrupt stylesheets can corrupt html #17). This could probably be done though.
  2. I tried to make the resulting tree match its original as closely as possible. However there are several optimisations that would often be desirable, so this ‘as-close-as-possible’ goal should probably become optional, so people can choose depending on their use case.

Another thing I came across is in a responsive image, where the srcset value was pointing to an image that did not exist. Therefore the website did return a 404 error page, but then this was inlined instead of being omitted which added some 2MBs to the dump. So checking the MIME type when loading image sources, and omitting non-images, could further help?

See issue #31. This should be easy to solve. We could e.g. check for response.ok in Resource.fromLink and throw if it is false.. I don’t have time to test it now, but be most welcome to try it out if you like!

PS I’m also always curious to know what people use freeze-dry for. It’s a little sad to only hear about its issues.. ;)

@JYone3A
Copy link

JYone3A commented Feb 7, 2023

Besides catching 404s, seems like setting textContent on style nodes does work, this should allow the Browser to catch anything invalid?

I will at some point try to setup everything so I can play around with the code, however, never used Typescript before, so lets see when I have enough time to check that out.

PS I’m also always curious to know what people use freeze-dry for. It’s a little sad to only hear about its issues.. ;)

I can imagine. I'm making a (internal) Firefox extension including Apache WebAnnotator and freezeDry (and probably I'll also use Mozillas Readability to save a plaintext version) that talks to an internal tool to save the data.

The idea/plan is to use this whenever we have to research some info on the web that later goes into our database, to be able to document where this info came from. The frozen websites including the annotations will then be shown in an iframe within the tool.

@Treora
Copy link
Contributor Author

Treora commented Feb 8, 2023

Besides catching 404s, seems like setting textContent on style nodes does work, this should allow the Browser to catch anything invalid?

Maybe? Pay attention to check escaping of > characters as &gt; etc. It could even happen that updating the DOM works fine, but serialising to HTML and parsing it again breaks.

Nice to hear your use case! Also good to hear another use case of Apache Annotator (you might have noticed my name a lot in there too..).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
snapshot quality Improving fidelity/size/durability/etc of the output
Projects
None yet
Development

No branches or pull requests

2 participants