Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve "bot-detection" evasion techniques #2198

Open
stevenengland opened this issue Feb 16, 2024 · 24 comments
Open

Improve "bot-detection" evasion techniques #2198

stevenengland opened this issue Feb 16, 2024 · 24 comments
Labels
enhancement New feature or request

Comments

@stevenengland
Copy link

stevenengland commented Feb 16, 2024

I am opening this issue because https://github.com/dgtlmoon/changedetection.io/discussions/1979 was deleted apparently without a final state (rejected/accepted).

Obersvation: Pure changedetection (no paid services, no proxies in place, ...) is more and more incapable of scraping websites. Because sites have the ability to detect that CD is a bot crawling the site. There are potential counter measures out there that need to be evaluated. More details were in the discussion mentioned.

If this feature request will be rejected it automatically means, that users will be inreasingly be forced to use paid services. Which is fine but I would lile to know where this repository is heading to.

Thanks in advance :)


So there are four sides to this


@stevenengland stevenengland added the enhancement New feature or request label Feb 16, 2024
@dgtlmoon
Copy link
Owner

thanks @stevenengland , it was deleted because of the pushy disrespectful comments from a a some people using this software that we all love

I want remind everyone that the software is fully opensource and i'm open to any new PR's that you want to submit, if you want

However being a bully towards myself, demanding things from myself, when I dont know you, when I'm donating my free software to help you in your daily life, when i'm providing free support in my own time right here on github will be absolutely not tolerated and you will be banned from posting in this project on github

@dgtlmoon
Copy link
Owner

dgtlmoon commented Feb 16, 2024

On the topic of "IMPLEMENT XYZ PLUGIN THAT I FOUND ON GITHUB!!"

I'm open to it - but i've tried all those plugins and I am unable to see any improvement in reducing error rates when you use the same IP address

please, try to think logically about it and find some way to prove to me that XYZ plugin reduces error/access rates other than just making demands that I do something for you, for free, without any evidence that it's going to help

is more and more incapable of scraping websites. Because sites have the ability to detect that CD is a bot crawling the site.

it is not directly changedetection's fault, the anti-robot (yes remember you ARE USING A ROBOT) protection across the internet is getting stronger and stronger, and companies are investing hundreds of $millions$ into detecting automated browsers (robots), and I am just one free software project on the internet

please remember this

@dgtlmoon
Copy link
Owner

dgtlmoon commented Feb 16, 2024

Part 3, please remember that BrightData are the leading proxy providers who have also invested millions of USD into solving the browser fingerprint problem, if the site is so important to you then you really should - for now - consider their offers

Scraping Browser is the most powerful way

https://changedetection.io/tutorial/using-bright-datas-scraping-browser-pass-captchas-and-other-protection-when-monitoring

Following by Residential Proxies (not cheap datacentre proxies!)

https://brightdata.com/integration/changedetection

once again - BrightData have spent millions of USD solving this problem, and companies like CloudFlare have also invested 100's of millions of USD into blocking robots such as changedetection

Please remember this. please do not just make random demands that I spend my own personal time, for you, for free to implement some project that you found on google - that you most likely do not understand, with zero evidence that the project may or may not help

@stevenengland
Copy link
Author

Hi again, no offense, I stepped into the thread late and I hope that I am not the person you recognized to be pushy because that was not my intention. And because I know the thread I must say, I personally also did not find the other comments really pushy (also not thaaat nice but also not pushy or rude) but of cause you may feel differently when reading them.

Anyway: Let me rephrase my intention: I thought that it is one of the main goals of the project to also provide "stealth-ability". And if so I wanted to remark, that this goal can't be achieved for more and more sites anymore except if you use paid services out there. If "stealth-ability" is not a goal it is fine as well as it is fine if you say you do not have the resources for it. But in the sense of a feature request: I just want to know if you do not want to follow the path or just don't have the resources. I understood: You would be open for this but don't have the resources, would appreciate PRs. So the FR here could be left open?

@dgtlmoon
Copy link
Owner

I understood: You would be open for this but don't have the resources, would appreciate PRs. So the FR here could be left open?

yes, but only where you can prove that the PR helped you, without changing your IP address

@stevenengland
Copy link
Author

That would be my requirement as well. Because there are bot detections out there that block my request from behind a dynamic IP at the very first attempt crawling a site with CD. Whereas from a real Browser behind this dynamic IP all subsequent calls of the page are succeeding. So there ist clearly a way of fingerprinting CD browsers without even using the IP information.

@dgtlmoon
Copy link
Owner

mentioned here a long time ago, but havent got a PR #1930

@stevenengland
Copy link
Author

Looks interesting. Thanks for the hint.

@jlhjlh
Copy link

jlhjlh commented Feb 23, 2024

I've noticed this too. I'm getting banned using multiple different IPs so it is fingerprinting the CD browser somehow.

If there's a way for me to help, I'd be happy to.

Perhaps switching my Playwright container to use something like this? https://github.com/CheshireCaat/playwright-with-fingerprints

@unixfox
Copy link

unixfox commented Feb 23, 2024

I just used my own anti bot bypass solution: https://github.com/unixfox/pupflare
And it works flawlessly with changedetection.

The only issue is that the HTML code given to the client is a bit broken so you get a page without CSS.

@dgtlmoon
Copy link
Owner

dgtlmoon commented Feb 23, 2024

Perhaps switching my Playwright container to use something like this? https://github.com/CheshireCaat/playwright-with-fingerprints

This is JS-only :( so its not possible to use it - the solution needs to be something that can work with python, I already started this port https://github.com/dgtlmoon/pyppeteerstealth of the existing https://www.npmjs.com/package/puppeteer-extra-plugin-stealth project that will work with changedetection.io

please read the https://github.com/dgtlmoon/pyppeteerstealth page

the other part that is NOT yet solved is the JA3 fingerprinting of the actual TCP connection behaviour of the operating system and browser......

So there are four sides to this

  • The browser side fingerprint (useragent, sec-ua user agent header, and all the other GPU card fingerprinting ettc tcetc)
  • "smart" work arounds for getting around Cloudflare (with some head-ful browser that first grabs the right pass-through-cookies etc)
  • The fingerprint of the actual TCP/IP connection, look up JA3 https://github.com/LyleMi/ja3proxy
  • Final extra bit - the reputation of your IP address

@dgtlmoon dgtlmoon changed the title Implement anti bot detection Improve "bot-detection" evasion techniques Feb 23, 2024
@addicted-ai
Copy link

This is something bypasses Cloudflare challenges. Works pretty much fine with Jackett
https://github.com/FlareSolverr/FlareSolverr/blob/master/src/flaresolverr_service.py#L252

@jlhjlh
Copy link

jlhjlh commented Feb 25, 2024 via email

@iG8R
Copy link

iG8R commented Mar 20, 2024

Hello.
In the following discussion ultrafunkamsterdam/undetected-chromedriver#1388 (comment) regarding the CloudFlare detection such a project was mentioned https://github.com/g1879/DrissionPage.
Maybe it's worth taking a look at it?

PS. Last week was horrible - almost all my watches stopped working due to the CloudFlare captcha:(
I tried working through both Chromedriver and Playwright, but no luck so far.

PPS. BTW, please, correct the following on https://github.com/dgtlmoon/changedetection.io/wiki/Playwright-content-fetcher

Docker Compose based
In docker-compose.yml uncomment PLAYWRIGHT_DRIVER_URL under environment, and the playwright-chrome section under services.

to

Docker Compose based
In docker-compose.yml uncomment environment and PLAYWRIGHT_DRIVER_URL under it, and the playwright-chrome section under services.

@siparker
Copy link

one thing to add to this. not sure if there is a way to do it automatically but whenever i had issues with cloudflare blocking any bots i used to find the direct ip of the server behind cloudflare and put that into the local hosts file so it went direct to the end server and not via cloudflare.
im sure this is potentially blocked now in some form or another but works for some sites quite well still for me.

@jlhjlh
Copy link

jlhjlh commented Apr 16, 2024 via email

@drabgail
Copy link

dgtlmoon great project first and foremost.

For everyone else, I switched over to sockpuppeteer and had 10 or more watches give me varying 4** errors, I guess it is profiled much easier but I just pulled headers from my actual browser to include on any watches which were failing. Changing headers has solved them all. I am not requesting any same url less than 10 minutes though. You definitely want proxies if you are hammering them.

@iG8R
Copy link

iG8R commented May 24, 2024

@drabgail
Could you please elaborate what sockpuppeteer you use?

@drabgail
Copy link

@drabgail Could you please elaborate what sockpuppeteer you use?

Just the one from the docker compose on the changedetection repo: dgtlmoon/sockpuppetbrowser:latest
When provided with headers it works on just about everything..

image

I was using browserless/chrome:latest before. It worked better, didn't need any additional tweaks and was almost never flagged. I started getting 'websocket closed' errors which I couldn't debug last week and noticed the repo was showing a different container to use so switched.

@iG8R
Copy link

iG8R commented May 24, 2024

@drabgail
Thanks a lot!

@dgtlmoon
Copy link
Owner

dgtlmoon commented May 24, 2024 via email

@drabgail
Copy link

I went here on my browser (just current version of edge, don't judge)...
https://www.supermonitoring.com/blog/check-browser-http-headers/

...and copied whatever I got for these headers:
Accept:
User-Agent:
Content-Type:
Upgrade-Insecure-Requests: 1
Sec-Ch-Ua-Platform:
Sec-Ch-Ua-Mobile:
Sec-Ch-Ua:
Cache-Control: max-age=0
Accept-Encoding:

I'll give you my exact headers if you want but it's probably best to keep the variation and have people user whatever their setup provides as this is more a 'real user' representation. If you're planning to incorporate this more than just the same headers then maybe a 'copy my headers' button somewhere on either the main settings or per watch request settings?.. The browser would of course know them.

@siparker
Copy link

siparker commented Jun 7, 2024

How did you find the direct IP?

I would search dns changes for when they activated cloudflare.
search for any subdomains that might be on same server but cloudflare is not active for.
ftp.
dev.
cpanel.
webmail.

if its wordpress there was a pingback technique you could use for the website to reveal its ip also. ill try and find the info on that if i still have it saves somewhere.

just a few examples.

@amdjml
Copy link

amdjml commented Oct 15, 2024

I went here on my browser (just current version of edge, don't judge)... https://www.supermonitoring.com/blog/check-browser-http-headers/

...and copied whatever I got for these headers: Accept: User-Agent: Content-Type: Upgrade-Insecure-Requests: 1 Sec-Ch-Ua-Platform: Sec-Ch-Ua-Mobile: Sec-Ch-Ua: Cache-Control: max-age=0 Accept-Encoding:

I'll give you my exact headers if you want but it's probably best to keep the variation and have people user whatever their setup provides as this is more a 'real user' representation. If you're planning to incorporate this more than just the same headers then maybe a 'copy my headers' button somewhere on either the main settings or per watch request settings?.. The browser would of course know them.

Can you tell me how you added these or where you added these headers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants