You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
filename,
file_extension,
filesize,
url,
download_time,
base_url,
archive_url (once the URLs are archived) @cjyetman can you validate which of these fields you think are/ aren't useful to output (or if some are missing)?
And also to be clear, this manifest should relate to only the raw URL correct?
I guess all of these are relevant.... maybe file_extension is a bit overkill.
filename is critical so you know which file you're talking about
filesize is good to have so that you can verify the file you're looking at is actually the same one being described, because the file could have been modified and you wouldn't be able to tell. Maybe a checksum would be better, but that's bit more difficult to verify for an average user
url is the precise location the file was downloaded from. I think this is pretty fundamental to recording the provenance of the data/file
download_time is the precise time that the file was downloaded. This is important because files found at URLs are not necessarily stable, and often change over time, so the URL is not really enough to precisely record the provenance of the data/file
base_url was originally included here because we're capturing a JSON file that technically is not intended for anyone to access directly, and is not linked to or findable by any "normal" web browsing. Instead, the JSON file is used to feed a table on the page found at the "base_url". I have been in situations before where someone else, or a future version of myself, asked "where did you get this from? I can't find it anywhere on that site?", and base_url was the answer.
archive_url if the page is getting archived (on archive.org), this is also a convenience for anyone in the future trying to find this file or update this process, especially if the file has moved or completely disappeared from the site. One would be able to download the file again from this URL, exactly as it was at the time the archive was made. It's also a good indicator the file WAS archived, which is good to know.
These are all things to precisely record the provenance of the file, and facilitate someone in the future trying to understand something about where it came from, what it means, how to find a new version that's equivalent, etc.
Also... I think I had file_extension because sometimes JSON files like this don't even have an extension, because they come from an AJAX request or something... so it's convenient to know what type of file the original developer of the code/archiver expected the file to be, especially if the filename/URL is some random string of characters with no discernible meaning.
Supersedes https://github.com/RMI-PACTA/pacta.data.preparation/issues/165
Full context copied manually:
@jdhoffa:
@cjyetman:
@cjyetman:
AB#9894
The text was updated successfully, but these errors were encountered: