-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve "bot-detection" evasion techniques #2198
Comments
thanks @stevenengland , it was deleted because of the pushy disrespectful comments from a a some people using this software that we all love I want remind everyone that the software is fully opensource and i'm open to any new PR's that you want to submit, if you want However being a bully towards myself, demanding things from myself, when I dont know you, when I'm donating my free software to help you in your daily life, when i'm providing free support in my own time right here on github will be absolutely not tolerated and you will be banned from posting in this project on github |
On the topic of "IMPLEMENT XYZ PLUGIN THAT I FOUND ON GITHUB!!" I'm open to it - but i've tried all those plugins and I am unable to see any improvement in reducing error rates when you use the same IP address please, try to think logically about it and find some way to prove to me that XYZ plugin reduces error/access rates other than just making demands that I do something for you, for free, without any evidence that it's going to help
it is not directly changedetection's fault, the anti-robot (yes remember you ARE USING A ROBOT) protection across the internet is getting stronger and stronger, and companies are investing hundreds of please remember this |
Part 3, please remember that BrightData are the leading proxy providers who have also invested millions of USD into solving the browser fingerprint problem, if the site is so important to you then you really should - for now - consider their offers Scraping Browser is the most powerful way Following by Residential Proxies (not cheap datacentre proxies!) https://brightdata.com/integration/changedetection once again - BrightData have spent millions of USD solving this problem, and companies like CloudFlare have also invested 100's of millions of USD into blocking robots such as changedetection Please remember this. please do not just make random demands that I spend my own personal time, for you, for free to implement some project that you found on google - that you most likely do not understand, with zero evidence that the project may or may not help |
Hi again, no offense, I stepped into the thread late and I hope that I am not the person you recognized to be pushy because that was not my intention. And because I know the thread I must say, I personally also did not find the other comments really pushy (also not thaaat nice but also not pushy or rude) but of cause you may feel differently when reading them. Anyway: Let me rephrase my intention: I thought that it is one of the main goals of the project to also provide "stealth-ability". And if so I wanted to remark, that this goal can't be achieved for more and more sites anymore except if you use paid services out there. If "stealth-ability" is not a goal it is fine as well as it is fine if you say you do not have the resources for it. But in the sense of a feature request: I just want to know if you do not want to follow the path or just don't have the resources. I understood: You would be open for this but don't have the resources, would appreciate PRs. So the FR here could be left open? |
yes, but only where you can prove that the PR helped you, without changing your IP address |
That would be my requirement as well. Because there are bot detections out there that block my request from behind a dynamic IP at the very first attempt crawling a site with CD. Whereas from a real Browser behind this dynamic IP all subsequent calls of the page are succeeding. So there ist clearly a way of fingerprinting CD browsers without even using the IP information. |
mentioned here a long time ago, but havent got a PR #1930 |
Looks interesting. Thanks for the hint. |
I've noticed this too. I'm getting banned using multiple different IPs so it is fingerprinting the CD browser somehow. If there's a way for me to help, I'd be happy to. Perhaps switching my Playwright container to use something like this? https://github.com/CheshireCaat/playwright-with-fingerprints |
I just used my own anti bot bypass solution: https://github.com/unixfox/pupflare The only issue is that the HTML code given to the client is a bit broken so you get a page without CSS. |
This is JS-only :( so its not possible to use it - the solution needs to be something that can work with python, I already started this port https://github.com/dgtlmoon/pyppeteerstealth of the existing https://www.npmjs.com/package/puppeteer-extra-plugin-stealth project that will work with changedetection.io please read the https://github.com/dgtlmoon/pyppeteerstealth page the other part that is NOT yet solved is the JA3 fingerprinting of the actual TCP connection behaviour of the operating system and browser...... So there are four sides to this
|
This is something bypasses Cloudflare challenges. Works pretty much fine with Jackett |
I actually use that now with my arrr apps. Thanks for your comment but I
don’t think that’s going to work in this use case.
…On Sun, Feb 25, 2024 at 4:13 AM Ashutosh Prusty ***@***.***> wrote:
This is something bypasses Cloudflare challenges. Works pretty much fine
with Jackett
https://github.com/FlareSolverr/FlareSolverr/blob/master/src/flaresolverr_service.py#L252
—
Reply to this email directly, view it on GitHub
<#2198 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRLEACRVCWWZT7HEHGHYBTYVL6B7AVCNFSM6AAAAABDL3GHMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSHA3DMMZSGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hello. PS. Last week was horrible - almost all my watches stopped working due to the CloudFlare captcha:( PPS. BTW, please, correct the following on https://github.com/dgtlmoon/changedetection.io/wiki/Playwright-content-fetcher Docker Compose based to Docker Compose based |
one thing to add to this. not sure if there is a way to do it automatically but whenever i had issues with cloudflare blocking any bots i used to find the direct ip of the server behind cloudflare and put that into the local hosts file so it went direct to the end server and not via cloudflare. |
How did you find the direct IP?
…On Tue, Apr 16, 2024 at 10:00 AM siparker ***@***.***> wrote:
one thing to add to this. not sure if there is a way to do it
automatically but whenever i had issues with cloudflare blocking any bots i
used to find the direct ip of the server behind cloudflare and put that
into the local hosts file so it went direct to the end server and not via
cloudflare.
im sure this is potentially blocked now in some form or another but works
for some sites quite well still for me.
—
Reply to this email directly, view it on GitHub
<#2198 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRLEAH3MEBSDOXQYM32K3LY5UVGTAVCNFSM6AAAAABDL3GHMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGE3DQNBYHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
dgtlmoon great project first and foremost. For everyone else, I switched over to sockpuppeteer and had 10 or more watches give me varying 4** errors, I guess it is profiled much easier but I just pulled headers from my actual browser to include on any watches which were failing. Changing headers has solved them all. I am not requesting any same url less than 10 minutes though. You definitely want proxies if you are hammering them. |
@drabgail |
Just the one from the docker compose on the changedetection repo: dgtlmoon/sockpuppetbrowser:latest I was using browserless/chrome:latest before. It worked better, didn't need any additional tweaks and was almost never flagged. I started getting 'websocket closed' errors which I couldn't debug last week and noticed the repo was showing a different container to use so switched. |
@drabgail |
If you can paste which headers+values you used that solved the access problems, that would be super nice!
…On 24 May 2024 15:13:48 UTC, iG8R ***@***.***> wrote:
@drabgail
Thanks a lot!
--
Reply to this email directly or view it on GitHub:
#2198 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
|
I went here on my browser (just current version of edge, don't judge)... ...and copied whatever I got for these headers: I'll give you my exact headers if you want but it's probably best to keep the variation and have people user whatever their setup provides as this is more a 'real user' representation. If you're planning to incorporate this more than just the same headers then maybe a 'copy my headers' button somewhere on either the main settings or per watch request settings?.. The browser would of course know them. |
I would search dns changes for when they activated cloudflare. if its wordpress there was a pingback technique you could use for the website to reveal its ip also. ill try and find the info on that if i still have it saves somewhere. just a few examples. |
Can you tell me how you added these or where you added these headers? |
I am opening this issue because https://github.com/dgtlmoon/changedetection.io/discussions/1979 was deleted apparently without a final state (rejected/accepted).
Obersvation: Pure changedetection (no paid services, no proxies in place, ...) is more and more incapable of scraping websites. Because sites have the ability to detect that CD is a bot crawling the site. There are potential counter measures out there that need to be evaluated. More details were in the discussion mentioned.
If this feature request will be rejected it automatically means, that users will be inreasingly be forced to use paid services. Which is fine but I would lile to know where this repository is heading to.
Thanks in advance :)
So there are four sides to this
The text was updated successfully, but these errors were encountered: