Document idempotency #147

cooljeanius · 2022-11-29T05:26:15Z

Is it safe to run this script multiple times on the same archive? Is it safe to run new versions of the script on versions of an archive that have already been processed by an old version of the script? Whether or not the script is idempotent should be documented in the README, IMO.

lenaschimmel · 2022-11-29T10:33:39Z

These are good questions and I agree that this should be documented in the README. I'll try to give a quick (but not short) answer here, and I or someone else will probably update the README later. I've highlighted the sections which are (IMHO) most important.

Regarding updates: yes, we really try to make sure that upgrading the script and re-running it on the same folder is safe and convenient. But there's no guarantee, and with the quick pace of development and without automated software tests, there might be problems sometimes.

The script should be idempotent, and in it's current version it probably is in the broad sense, but with some catches.

Say, you run the script, answer its questions a certain way, and everything goes well. Now, if you run it again, and give the same answer to its questions, then the result will be the same as after the first run.

The script is also (somewhat) incremental: if some resources could not be downloaded on the first run, the result is incomplete. The second may be able to fetch more online resources, and generate a result that is (more) complete. This is definitely true for media resources (images and videos).

But there's the catch: some online resources are not cached / saved properly, and if at the time of the second run the online availability of certain resources, e.g. Twitter user profiles, is worse than on the first run, the resulting .md and .html files might be less complete then after the first run.

My work on this script focuses on improving this, because I think will become more important if / when Twitters API become less reliable over time. Ideally, it should be safe to run this script even when Twitter is offline, or returns only empty / nonsensical responses. We're definitely not there yet, and it's not trivial.

In issue #144 and in this branch / fork we're working on downloading and saving referenced tweets. The new tweets are merged with the ones in the archive (without modifying the original file), so that the locally available data only becomes more complete over time. But there are still some bugs, so that the tweet cache grows a bit each time, instead of settling on a "complete" state and staying there forever.

If you want to be 100% sure: copy the full output of the script before running it again. If you want to be 99,5% sure: copy the .html and .md output. The media folder will be fine anyway.

llemay · 2022-12-03T21:22:35Z

Thank you, I had this actual literal question and just came here to ask it. 👍🏻

cooljeanius · 2022-12-19T22:46:18Z

These are good questions and I agree that this should be documented in the README. I'll try to give a quick (but not short) answer here, and I or someone else will probably update the README later. I've highlighted the sections which are (IMHO) most important.

Regarding updates: yes, we really try to make sure that upgrading the script and re-running it on the same folder is safe and convenient. But there's no guarantee, and with the quick pace of development and without automated software tests, there might be problems sometimes.

The script should be idempotent, and in it's current version it probably is in the broad sense, but with some catches.

Say, you run the script, answer its questions a certain way, and everything goes well. Now, if you run it again, and give the same answer to its questions, then the result will be the same as after the first run.

The script is also (somewhat) incremental: if some resources could not be downloaded on the first run, the result is incomplete. The second may be able to fetch more online resources, and generate a result that is (more) complete. This is definitely true for media resources (images and videos).

But there's the catch: some online resources are not cached / saved properly, and if at the time of the second run the online availability of certain resources, e.g. Twitter user profiles, is worse than on the first run, the resulting .md and .html files might be less complete then after the first run.

My work on this script focuses on improving this, because I think will become more important if / when Twitters API become less reliable over time. Ideally, it should be safe to run this script even when Twitter is offline, or returns only empty / nonsensical responses. We're definitely not there yet, and it's not trivial.

In issue #144 and in this branch / fork we're working on downloading and saving referenced tweets. The new tweets are merged with the ones in the archive (without modifying the original file), so that the locally available data only becomes more complete over time. But there are still some bugs, so that the tweet cache grows a bit each time, instead of settling on a "complete" state and staying there forever.

If you want to be 100% sure: copy the full output of the script before running it again. If you want to be 99,5% sure: copy the .html and .md output. The media folder will be fine anyway.

OK, could this info go in the wiki or somewhere?

cooljeanius · 2024-09-07T14:33:46Z

These are good questions and I agree that this should be documented in the README. I'll try to give a quick (but not short) answer here, and I or someone else will probably update the README later. I've highlighted the sections which are (IMHO) most important.
Regarding updates: yes, we really try to make sure that upgrading the script and re-running it on the same folder is safe and convenient. But there's no guarantee, and with the quick pace of development and without automated software tests, there might be problems sometimes.
The script should be idempotent, and in it's current version it probably is in the broad sense, but with some catches.
Say, you run the script, answer its questions a certain way, and everything goes well. Now, if you run it again, and give the same answer to its questions, then the result will be the same as after the first run.
The script is also (somewhat) incremental: if some resources could not be downloaded on the first run, the result is incomplete. The second may be able to fetch more online resources, and generate a result that is (more) complete. This is definitely true for media resources (images and videos).
But there's the catch: some online resources are not cached / saved properly, and if at the time of the second run the online availability of certain resources, e.g. Twitter user profiles, is worse than on the first run, the resulting .md and .html files might be less complete then after the first run.
My work on this script focuses on improving this, because I think will become more important if / when Twitters API become less reliable over time. Ideally, it should be safe to run this script even when Twitter is offline, or returns only empty / nonsensical responses. We're definitely not there yet, and it's not trivial.
In issue #144 and in this branch / fork we're working on downloading and saving referenced tweets. The new tweets are merged with the ones in the archive (without modifying the original file), so that the locally available data only becomes more complete over time. But there are still some bugs, so that the tweet cache grows a bit each time, instead of settling on a "complete" state and staying there forever.
If you want to be 100% sure: copy the full output of the script before running it again. If you want to be 99,5% sure: copy the .html and .md output. The media folder will be fine anyway.

OK, could this info go in the wiki or somewhere?

@timhutton where do you think would be the best place to put such documentation?

allefeld mentioned this issue Feb 3, 2023

How to maintain an ongoing archive? #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document idempotency #147

Document idempotency #147

cooljeanius commented Nov 29, 2022

lenaschimmel commented Nov 29, 2022

llemay commented Dec 3, 2022

cooljeanius commented Dec 19, 2022

cooljeanius commented Sep 7, 2024

Document idempotency #147

Document idempotency #147

Comments

cooljeanius commented Nov 29, 2022

lenaschimmel commented Nov 29, 2022

llemay commented Dec 3, 2022

cooljeanius commented Dec 19, 2022

cooljeanius commented Sep 7, 2024