-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make idempotent. #13
Comments
Differences I see in a random, complex, real-world page (number 9 in the fathom-popups corpus, for my own reference):
So really, it looks pretty good. If we…
…we're in good shape! |
Okay, I solved most of the mysteries: it was one of my many ad blockers scribbling on things. :-) The only differences remaining are:
|
Thanks for investigating! The As for the As a sidenote: Considering that looking at the thing may change the thing, there is something to say for actually not making freeze-dry idempotent, but instead remembering the full provenance of a page: you make a snapshot of a page, which itself was a snapshot of a page, and the final snapshot thus adds metadata to remember both steps. Things would get rather complicated however, so I guess I'd rather find a simpler but consistent model where try to stay as close as possible to the 'original' document, and accepting that minor information losses may occur. (also, I'd hope one only needs to archive a web page once, and then stores and publishes any changes using proper version control) |
Confirmed! That's what's going on. Very astute of you. :-) I'm fine with such "minor information losses". So, actually, given that my corpus is in a VCS and I can pick and choose diffs before committing, I have no practical problems with freeze-dry's idempotency at the moment. Hooray! |
Gave 0.2.0 a try today. For reference, the inlined, base64'd CSS from https://grinchcentral.com/ gets changed when the initial freeze-dried page is loaded into a browser and then re-saved. See if you can spot the difference in these excerpts: Original:
Re-saved:
It's the lousy " " after the The only other thing I notice so far is that, when resaving one of my hellish real-world example pages, full of iframes and ads and other hostility, is that the Thanks for the great work! |
@erikrose As for the space inside the data URL, I have no clue why that appeared there. At least I can reproduce the result (Firefox 61, freeze-dry 0.2.0 through WebMemex 0.2.7 or 0.2.8). I guess that somewhere internally, As for the migration of invalid tags to valid places, that is expected. Browsers apply such fixes while rendering, as discussed above. |
I added a test for idempotency in commit dc14ef4. At least for the simple example page, freeze-dry appears perfectly idempotent! Probably there will be many cases where idempotency breaks down, but I will close this issue for now, and we can reopen it (or open a new one) when we find particular cases that need attention. |
Thanks, Gerben! Sounds reasonable to me. Again, nice work; having a decent way to serialize web pages makes all the rest of my mad-scientist schemes possible. :-) |
Freeze-drying an already freeze-dried page would ideally not have any effect. Not sure if that's the case now.
@reficul31: may be nice to add a test for this in the integration tests, that takes the output (snapshot) and applies freezeDry to it again.
The text was updated successfully, but these errors were encountered: