Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document idempotency #147

Open
cooljeanius opened this issue Nov 29, 2022 · 4 comments
Open

Document idempotency #147

cooljeanius opened this issue Nov 29, 2022 · 4 comments

Comments

@cooljeanius
Copy link

Is it safe to run this script multiple times on the same archive? Is it safe to run new versions of the script on versions of an archive that have already been processed by an old version of the script? Whether or not the script is idempotent should be documented in the README, IMO.

@lenaschimmel
Copy link
Collaborator

These are good questions and I agree that this should be documented in the README. I'll try to give a quick (but not short) answer here, and I or someone else will probably update the README later. I've highlighted the sections which are (IMHO) most important.

Regarding updates: yes, we really try to make sure that upgrading the script and re-running it on the same folder is safe and convenient. But there's no guarantee, and with the quick pace of development and without automated software tests, there might be problems sometimes.

The script should be idempotent, and in it's current version it probably is in the broad sense, but with some catches.

Say, you run the script, answer its questions a certain way, and everything goes well. Now, if you run it again, and give the same answer to its questions, then the result will be the same as after the first run.

The script is also (somewhat) incremental: if some resources could not be downloaded on the first run, the result is incomplete. The second may be able to fetch more online resources, and generate a result that is (more) complete. This is definitely true for media resources (images and videos).

But there's the catch: some online resources are not cached / saved properly, and if at the time of the second run the online availability of certain resources, e.g. Twitter user profiles, is worse than on the first run, the resulting .md and .html files might be less complete then after the first run.

My work on this script focuses on improving this, because I think will become more important if / when Twitters API become less reliable over time. Ideally, it should be safe to run this script even when Twitter is offline, or returns only empty / nonsensical responses. We're definitely not there yet, and it's not trivial.

In issue #144 and in this branch / fork we're working on downloading and saving referenced tweets. The new tweets are merged with the ones in the archive (without modifying the original file), so that the locally available data only becomes more complete over time. But there are still some bugs, so that the tweet cache grows a bit each time, instead of settling on a "complete" state and staying there forever.

If you want to be 100% sure: copy the full output of the script before running it again. If you want to be 99,5% sure: copy the .html and .md output. The media folder will be fine anyway.

@llemay
Copy link

llemay commented Dec 3, 2022

Thank you, I had this actual literal question and just came here to ask it. 👍🏻

@cooljeanius
Copy link
Author

These are good questions and I agree that this should be documented in the README. I'll try to give a quick (but not short) answer here, and I or someone else will probably update the README later. I've highlighted the sections which are (IMHO) most important.

Regarding updates: yes, we really try to make sure that upgrading the script and re-running it on the same folder is safe and convenient. But there's no guarantee, and with the quick pace of development and without automated software tests, there might be problems sometimes.

The script should be idempotent, and in it's current version it probably is in the broad sense, but with some catches.

Say, you run the script, answer its questions a certain way, and everything goes well. Now, if you run it again, and give the same answer to its questions, then the result will be the same as after the first run.

The script is also (somewhat) incremental: if some resources could not be downloaded on the first run, the result is incomplete. The second may be able to fetch more online resources, and generate a result that is (more) complete. This is definitely true for media resources (images and videos).

But there's the catch: some online resources are not cached / saved properly, and if at the time of the second run the online availability of certain resources, e.g. Twitter user profiles, is worse than on the first run, the resulting .md and .html files might be less complete then after the first run.

My work on this script focuses on improving this, because I think will become more important if / when Twitters API become less reliable over time. Ideally, it should be safe to run this script even when Twitter is offline, or returns only empty / nonsensical responses. We're definitely not there yet, and it's not trivial.

In issue #144 and in this branch / fork we're working on downloading and saving referenced tweets. The new tweets are merged with the ones in the archive (without modifying the original file), so that the locally available data only becomes more complete over time. But there are still some bugs, so that the tweet cache grows a bit each time, instead of settling on a "complete" state and staying there forever.

If you want to be 100% sure: copy the full output of the script before running it again. If you want to be 99,5% sure: copy the .html and .md output. The media folder will be fine anyway.

OK, could this info go in the wiki or somewhere?

@cooljeanius
Copy link
Author

These are good questions and I agree that this should be documented in the README. I'll try to give a quick (but not short) answer here, and I or someone else will probably update the README later. I've highlighted the sections which are (IMHO) most important.
Regarding updates: yes, we really try to make sure that upgrading the script and re-running it on the same folder is safe and convenient. But there's no guarantee, and with the quick pace of development and without automated software tests, there might be problems sometimes.
The script should be idempotent, and in it's current version it probably is in the broad sense, but with some catches.
Say, you run the script, answer its questions a certain way, and everything goes well. Now, if you run it again, and give the same answer to its questions, then the result will be the same as after the first run.
The script is also (somewhat) incremental: if some resources could not be downloaded on the first run, the result is incomplete. The second may be able to fetch more online resources, and generate a result that is (more) complete. This is definitely true for media resources (images and videos).
But there's the catch: some online resources are not cached / saved properly, and if at the time of the second run the online availability of certain resources, e.g. Twitter user profiles, is worse than on the first run, the resulting .md and .html files might be less complete then after the first run.
My work on this script focuses on improving this, because I think will become more important if / when Twitters API become less reliable over time. Ideally, it should be safe to run this script even when Twitter is offline, or returns only empty / nonsensical responses. We're definitely not there yet, and it's not trivial.
In issue #144 and in this branch / fork we're working on downloading and saving referenced tweets. The new tweets are merged with the ones in the archive (without modifying the original file), so that the locally available data only becomes more complete over time. But there are still some bugs, so that the tweet cache grows a bit each time, instead of settling on a "complete" state and staying there forever.
If you want to be 100% sure: copy the full output of the script before running it again. If you want to be 99,5% sure: copy the .html and .md output. The media folder will be fine anyway.

OK, could this info go in the wiki or somewhere?

@timhutton where do you think would be the best place to put such documentation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants