Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small implementation bug with out of global context relative paths #579

Open
that-ben opened this issue Sep 17, 2024 · 0 comments
Open

Small implementation bug with out of global context relative paths #579

that-ben opened this issue Sep 17, 2024 · 0 comments

Comments

@that-ben
Copy link

Hi! I caught your bot scraping the MR website and I wanted to take a minute to help you improve it. I found that it has a small implementation bug that causes it to crawl incorrect relative links when they are out of global context. It's kind of rare, but as you will see below, your bot discovered MP3 sound effects after scraping this JS file: https://www.macintoshrepository.org/assets/js/ben_chat_v2.js

The issue is that your bot thinks that those MP3 files are located at /assets/assets/audio/logged_in.mp3 which obviously does not exist. Your bot is lacking global context, which means that the paths in that JS file are relative to where that JS file is supposed to be loaded, not relative to the JS file itself, which is what your bot currently thinks. Since the JS file is loaded from / or /applications/ then ../assets/audio/logged_in.mp3 becomes /assets/audio/logged_in.mp3 and not /assets/assets/audio/logged_in.mp3 😁

JS file:

audio_tick = new Howl({src:['../assets/audio/chatpost.mp3']});
audio_tear_short = new Howl({src:['../assets/audio/tear_short.mp3']});
audio_tear_long = new Howl({src:['../assets/audio/tear_long.mp3']});
audio_eep = new Howl({src:['../assets/audio/eep.mp3']});
audio_logged_in = new Howl({src:['../assets/audio/logged_in.mp3']});
audio_magnet_unlock = new Howl({src:['../assets/audio/magnet_unlock.mp3']});
audio_priv_msg = new Howl({src:['../assets/audio/svrmsg.mp3']});

lots of 404 errors in the logs:

domlogs/macintoshrepository.org-ssl_log:152.53.39.37 - - [17/Sep/2024:06:10:37 -0400] "GET /assets/assets/audio/logged_in.mp3 HTTP/1.1" 404 100206 "https://www.macintoshrepository.org/assets/js/ben_chat_v2.js" "ArchiveTeam ArchiveBot/20231201.ad9703c (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
domlogs/macintoshrepository.org-ssl_log:152.53.39.37 - - [17/Sep/2024:06:10:38 -0400] "GET /assets/assets/audio/magnet_unlock.mp3 HTTP/1.1" 404 100200 "https://www.macintoshrepository.org/assets/js/ben_chat_v2.js" "ArchiveTeam ArchiveBot/20231201.ad9703c (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
domlogs/macintoshrepository.org-ssl_log:152.53.39.37 - - [17/Sep/2024:06:10:38 -0400] "GET /assets/assets/audio/eep.mp3 HTTP/1.1" 404 100196 "https://www.macintoshrepository.org/assets/js/ben_chat_v2.js" "ArchiveTeam ArchiveBot/20231201.ad9703c (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
domlogs/macintoshrepository.org-ssl_log:152.53.39.37 - - [17/Sep/2024:06:10:39 -0400] "GET /assets/assets/audio/chatpost.mp3 HTTP/1.1" 404 100203 "https://www.macintoshrepository.org/assets/js/ben_chat_v2.js" "ArchiveTeam ArchiveBot/20231201.ad9703c (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
domlogs/macintoshrepository.org-ssl_log:152.53.39.37 - - [17/Sep/2024:06:10:39 -0400] "GET /assets/assets/audio/tear_short.mp3 HTTP/1.1" 404 100206 "https://www.macintoshrepository.org/assets/js/ben_chat_v2.js" "ArchiveTeam ArchiveBot/20231201.ad9703c (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
domlogs/macintoshrepository.org-ssl_log:152.53.39.37 - - [17/Sep/2024:06:10:40 -0400] "GET /assets/assets/audio/svrmsg.mp3 HTTP/1.1" 404 100194 "https://www.macintoshrepository.org/assets/js/ben_chat_v2.js" "ArchiveTeam ArchiveBot/20231201.ad9703c (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
domlogs/macintoshrepository.org-ssl_log:152.53.39.37 - - [17/Sep/2024:06:10:40 -0400] "GET /assets/assets/audio/tear_long.mp3 HTTP/1.1" 404 100210 "https://www.macintoshrepository.org/assets/js/ben_chat_v2.js" "ArchiveTeam ArchiveBot/20231201.ad9703c (wpull 2.0.3) and not Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"

BTW, I was very impressed by the real time tracker on http://www.archivebot.com
It's crazy how much real time DATA it pushes to the browser very 0.5s! Incredible work 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant