Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make idempotent. #13

Closed
Treora opened this issue Jul 22, 2017 · 8 comments
Closed

Make idempotent. #13

Treora opened this issue Jul 22, 2017 · 8 comments

Comments

@Treora
Copy link
Contributor

Treora commented Jul 22, 2017

Freeze-drying an already freeze-dried page would ideally not have any effect. Not sure if that's the case now.

@reficul31: may be nice to add a test for this in the integration tests, that takes the output (snapshot) and applies freezeDry to it again.

@erikrose
Copy link

Differences I see in a random, complex, real-world page (number 9 in the fathom-popups corpus, for my own reference):

  1. A <base href> is added, having a file:// URL of the already-saved page on disk.
  2. A Google Ads iframe had "display: none !important;" added to its style attr. Its width attr was moved to be the last in the tag but otherwise unchanged. A hidden="" attr was added.
  3. A second Google Ads iframe had exactly the same differences.
  4. A bunch of other iframes had the order of their height and width attrs switched.
  5. One more iframe had "!important" added to its inline style, a hidden="" added, and its height and width inline style properties changed from 0 to "0px".

So really, it looks pretty good. If we…

  • figure out a heuristic for not adding another base URL,
  • watch attr ordering (or is it even important?), and
  • solve the mystery of those hiddenness properties getting added to iframes

…we're in good shape!

@erikrose
Copy link

Okay, I solved most of the mysteries: it was one of my many ad blockers scribbling on things. :-) The only differences remaining are:

  • The base href
  • The </body> tag migrated from in front of the final iframe (probably invalid) to after it. But I'm not sure how the same serializer would fail to move it the first time.

@Treora
Copy link
Contributor Author

Treora commented Mar 15, 2018

Thanks for investigating! The base tag is an issue indeed. It is more generally causing problems (see #5, #6), so we should probably think about desired behaviour and solve all these issues together.

As for the iframe you report about: I assume some script added this element in the invalid position. Freeze-dry just keeps it where it is, but when opening the snapshot your browser will fix the DOM for you, hence the next snapshot will be different. Not sure this is a problem we should try to solve, what do you think?

As a sidenote: Considering that looking at the thing may change the thing, there is something to say for actually not making freeze-dry idempotent, but instead remembering the full provenance of a page: you make a snapshot of a page, which itself was a snapshot of a page, and the final snapshot thus adds metadata to remember both steps. Things would get rather complicated however, so I guess I'd rather find a simpler but consistent model where try to stay as close as possible to the 'original' document, and accepting that minor information losses may occur. (also, I'd hope one only needs to archive a web page once, and then stores and publishes any changes using proper version control)

@erikrose
Copy link

but when opening the snapshot your browser will fix the DOM for you

Confirmed! That's what's going on. Very astute of you. :-) I'm fine with such "minor information losses".

So, actually, given that my corpus is in a VCS and I can pick and choose diffs before committing, I have no practical problems with freeze-dry's idempotency at the moment. Hooray!

@erikrose
Copy link

Gave 0.2.0 a try today. For reference, the inlined, base64'd CSS from https://grinchcentral.com/ gets changed when the initial freeze-dried page is loaded into a browser and then re-saved. See if you can spot the difference in these excerpts:

Original:

        .social a[href*='digg.com'] {background-image: url('data:text/html; charset=iso-8859-1;base64,PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9JRVRGLy9EVEQgSFRNTCAyLjAvL0VOIj4KPGh0bWw+PGhlYWQ+Cjx0aXRsZT40MDQgTm90IEZvdW5kPC90aXRsZT4KPC9oZWFkPjxib2R5Pgo8aDE+Tm90IEZvdW5kPC9oMT4KPHA+VGhlIHJlcXVlc3RlZCBVUkwgL3RoZW1lL2ltYWdlcy9pY29ucy9kaWdnLnBuZyB3YXMgbm90IGZvdW5kIG9uIHRoaXMgc2VydmVyLjwvcD4KPGhyPgo8YWRkcmVzcz5BcGFjaGUvMi40LjI1IChEZWJpYW4pIFNlcnZlciBhdCB3d3cuZ3JpbmNoY2VudHJhbC5jb20gUG9ydCA0NDM8L2FkZHJlc3M+CjwvYm9keT48L2h0bWw+Cg==');}

Re-saved:

        .social a[href*='digg.com'] {background-image: url('data:text/html;charset=iso-8859-1;base64,PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9JRVRGLy9EVEQgSFRNTCAyLjAvL0VOIj4KPGh0bWw+PGhlYWQ+Cjx0aXRsZT40MDQgTm90IEZvdW5kPC90aXRsZT4KPC9oZWFkPjxib2R5Pgo8aDE+Tm90IEZvdW5kPC9oMT4KPHA+VGhlIHJlcXVlc3RlZCBVUkwgL3RoZW1lL2ltYWdlcy9pY29ucy9kaWdnLnBuZyB3YXMgbm90IGZvdW5kIG9uIHRoaXMgc2VydmVyLjwvcD4KPGhyPgo8YWRkcmVzcz5BcGFjaGUvMi40LjI1IChEZWJpYW4pIFNlcnZlciBhdCB3d3cuZ3JpbmNoY2VudHJhbC5jb20gUG9ydCA0NDM8L2FkZHJlc3M+CjwvYm9keT48L2h0bWw+Cg==');}

It's the lousy " " after the text/html;. And that's the only difference, for the whole base64'd string. Weird!

The only other thing I notice so far is that, when resaving one of my hellish real-world example pages, full of iframes and ads and other hostility, is that the </head><body ...> tags migrate from further down in the document (after some no-doubt-evil iframes) to above the iframes. I think the latter is more valid, so I have no problem with it. But I bring it up in case you're surprised by it.

Thanks for the great work!

@Treora
Copy link
Contributor Author

Treora commented Aug 7, 2018

@erikrose As for the space inside the data URL, I have no clue why that appeared there. At least I can reproduce the result (Firefox 61, freeze-dry 0.2.0 through WebMemex 0.2.7 or 0.2.8). I guess that somewhere internally, Response.blob() creates a Blob whose .type may or may not contain a space before the charset depending on some circumstances (sounds like a bug in Firefox?); which we (using blob-util) insert into the data URL. By the way, the background image has type text/html because it is actually a 404 page; we should probably check the response type and status code! (edit: see #31)

As for the migration of invalid tags to valid places, that is expected. Browsers apply such fixes while rendering, as discussed above.

@Treora
Copy link
Contributor Author

Treora commented Aug 7, 2018

I added a test for idempotency in commit dc14ef4. At least for the simple example page, freeze-dry appears perfectly idempotent!

Probably there will be many cases where idempotency breaks down, but I will close this issue for now, and we can reopen it (or open a new one) when we find particular cases that need attention.

@Treora Treora closed this as completed Aug 7, 2018
@erikrose
Copy link

erikrose commented Aug 7, 2018

Thanks, Gerben! Sounds reasonable to me. Again, nice work; having a decent way to serialize web pages makes all the rest of my mad-scientist schemes possible. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants