Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading an entire website results is missing pages and content. #2

Open
Bryson14 opened this issue Mar 9, 2024 · 3 comments
Open

Comments

@Bryson14
Copy link

Bryson14 commented Mar 9, 2024

I am trying to recover a website that was created for a non-profit organization with WordPress. It was hosted on a third-party site but the organization has lost its admin access and somehow broke the site. I'm trying to recover the site as it was in January of 2024 when the site was working. When trying to recover the website from archive.org, I ran the CLI utility, but it didn't download all the pages I was expecting.

I ran: wayback_machine_downloader http://sorensonlegacyfoundation.org --to 20240101. It downloads 250 files, but there are still lots of HTML pages that are missing. Like the entry file index.html is there, but /what-we-fund, how-to-apply, and other about 10 other pages are not there.

Looking over the raw text files in vs code, I confirmed that these pages are missing and not just nested away somewhere by searching for specific text unique to each page.

Is there something I'm missing or should I just download each page individually from archive.org?

@WiiNewU
Copy link

WiiNewU commented Mar 10, 2024

@Bryson14 Hi, I created a post under "Rate limiting?" that might help with the temporally blocking from IA that your having so you can finish your download.
#1

@Bryson14
Copy link
Author

That worked well to add the network limiter! But I think that just because I'm on linux, this worked, but it wouldn't be a good solution for Mac or PC.

@morgant
Copy link

morgant commented Jun 3, 2024

I have implemented some rate limiting (see Issue #1 & PR #5) and retry (see PR #6) functionality, which should help with this.

That said, it looks like this utility still can try to download multiple different timestamped versions of an individual page (especially when using the --to/--from options, which may cause issues related to your use-case. I'm not sure if the Wayback Machine API returns the files in a chronologically sorted order, especially in a descending (newest to oldest) order. I'll look into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants