Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

got-scraping inefficient against Cloudflare #65

Open
Cooya opened this issue Apr 5, 2022 · 49 comments
Open

got-scraping inefficient against Cloudflare #65

Cooya opened this issue Apr 5, 2022 · 49 comments
Assignees
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@Cooya
Copy link

Cooya commented Apr 5, 2022

Recently I have encounter some changes In Cloudflare antibot protection. While using got-scraping, I am now unable to send requests to websites protected by Cloudflare.I have to use Puppeteer to get through.

It is mentioned as well in this comment.

Any idea of how Cloudflare can be that good for detecting TLS configuration generated by got-scraping ?

@szmarczak
Copy link
Contributor

Are you using proxies? If you aren't then probably you might hit the rate limiter so it returns a JS challenge, which must be run in a real browser.

@Cooya
Copy link
Author

Cooya commented Apr 6, 2022

Yes I am using datacenter and residential proxies. None of them work.

I think Cloudflare reached a point where they now send JS challenge to every client which does not have a common JA3 fingerprint, which explains why got-scraping is inefficient.
This article confirm that hypothesis.

Unfortunately, as Firefox and Chrome have their own SSL library (with different ciphers), it is impossible in NodeJS to mimic JA3 fingerprints of Firefox and Chrome.

@yuriolive
Copy link

yuriolive commented Apr 13, 2022

I'm with the same issue, testing with Postman I saw that the order of the headers is important. We should make sure we have Host header as the first one. got-scraping changes the order of the headers, I'm debugging here to see if this is the issue. I don't think this is related to JA3 fingerprinting because we can also do POST and GET requests using simple curl to Cloudflare websites with the correct cookies.

@szmarczak
Copy link
Contributor

Do you have an example domain? I cannot reproduce this yet

@szmarczak
Copy link
Contributor

got-scraping changes the order of the headers,

Yes, it's reordering so the order is be the same as the browsers have.

@yuriolive
Copy link

yuriolive commented Apr 13, 2022

Do you have an example domain? I cannot reproduce this yet

Yes, you can try in https://www.g2.com

@yuriolive
Copy link

I think the sortHeaders is hard coded in this line

this.transformRequest(request, { sortHeaders: true });

@szmarczak
Copy link
Contributor

Yes, you can try in https://www.g2.com

Thanks, I was finally able to reproduce this. However it randomly goes through and randomly stops. Fixing this now.

@yuriolive
Copy link

yuriolive commented Apr 13, 2022

Cloudflare protection has some crazy things. If you have the cf_clearance cookie the order of the headers doesn't matter, but if you just have the __cf_bm cookie the order matters. Sometimes it just set the __cf_bm and other times the page set both. You can check here if you want to see more about Cloudflare cookies. Also we have to make sure we use the same IP and User Agent that we got the cookies.

@szmarczak
Copy link
Contributor

szmarczak commented Apr 14, 2022

I couldn't make it work with Chrome values. They're using their own implementation of SSL so it may be impossible to fix in Node. However, it seems Firefox works very nicely:

Node 17 required

const http2 = require('http2');

const session = http2.connect('https://www.g2.com', {
    ciphers: [
        // Firefox v91
        'TLS_AES_128_GCM_SHA256',
        'TLS_CHACHA20_POLY1305_SHA256',
        'TLS_AES_256_GCM_SHA384',
        'ECDHE-ECDSA-AES128-GCM-SHA256',
        'ECDHE-RSA-AES128-GCM-SHA256',
        'ECDHE-ECDSA-CHACHA20-POLY1305',
        'ECDHE-RSA-CHACHA20-POLY1305',
        'ECDHE-ECDSA-AES256-GCM-SHA384',
        'ECDHE-RSA-AES256-GCM-SHA384',
        // Legacy:
        'ECDHE-ECDSA-AES256-SHA',
        'ECDHE-ECDSA-AES128-SHA',
        'ECDHE-RSA-AES128-SHA',
        'ECDHE-RSA-AES256-SHA',
        'AES128-GCM-SHA256',
        'AES256-GCM-SHA384',
        'AES128-SHA',
        'AES256-SHA',
    ].join(':'),
    ecdhCurve: [
        'X25519',
        'prime256v1',
        'secp384r1',
        'secp521r1',
        'ffdhe2048',
        'ffdhe3072',
    ].join(':'),
    signatureAlgorithms: [
        'ecdsa_secp256r1_sha256',
        'ecdsa_secp384r1_sha384',
        'ecdsa_secp521r1_sha512',
        'rsa_pss_rsae_sha256',
        'rsa_pss_rsae_sha384',
        'rsa_pss_rsae_sha512',
        'rsa_pkcs1_sha256',
        'rsa_pkcs1_sha384',
        'rsa_pkcs1_sha512',
        'ecdsa_sha1',
        'rsa_pkcs1_sha1',
    ].join(':'),
    minVersion: 'TLSv1.2',
    maxVersion: 'TLSv1.3',
    alpnProtocols: ['h2', 'http/1.1'],
    servername: 'www.g2.com',
});

const req = session.request({
    'Host': 'www.g2.com',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding':' gzip, deflate, br',
    'DNT': '1',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
}, {endStream: false});

req.on('response', headers => {
    console.log(headers[':status']);
});

req.resume();
req.end();

@yuriolive
Copy link

yuriolive commented Apr 14, 2022

I couldn't make it work with Chrome values. They're using their own implementation of SSL so it may be impossible to fix in Node. However, it seems Firefox works very nicely:

Node 17 required

const http2 = require('http2');

const session = http2.connect('https://www.g2.com', {
    ciphers: [
        // Firefox v91
        'TLS_AES_128_GCM_SHA256',
        'TLS_CHACHA20_POLY1305_SHA256',
        'TLS_AES_256_GCM_SHA384',
        'ECDHE-ECDSA-AES128-GCM-SHA256',
        'ECDHE-RSA-AES128-GCM-SHA256',
        'ECDHE-ECDSA-CHACHA20-POLY1305',
        'ECDHE-RSA-CHACHA20-POLY1305',
        'ECDHE-ECDSA-AES256-GCM-SHA384',
        'ECDHE-RSA-AES256-GCM-SHA384',
        // Legacy:
        'ECDHE-ECDSA-AES256-SHA',
        'ECDHE-ECDSA-AES128-SHA',
        'ECDHE-RSA-AES128-SHA',
        'ECDHE-RSA-AES256-SHA',
        'AES128-GCM-SHA256',
        'AES256-GCM-SHA384',
        'AES128-SHA',
        'AES256-SHA',
    ].join(':'),
    ecdhCurve: [
        'X25519',
        'prime256v1',
        'secp384r1',
        'secp521r1',
        'ffdhe2048',
        'ffdhe3072',
    ].join(':'),
    signatureAlgorithms: [
        'ecdsa_secp256r1_sha256',
        'ecdsa_secp384r1_sha384',
        'ecdsa_secp521r1_sha512',
        'rsa_pss_rsae_sha256',
        'rsa_pss_rsae_sha384',
        'rsa_pss_rsae_sha512',
        'rsa_pkcs1_sha256',
        'rsa_pkcs1_sha384',
        'rsa_pkcs1_sha512',
        'ecdsa_sha1',
        'rsa_pkcs1_sha1',
    ].join(':'),
    minVersion: 'TLSv1.2',
    maxVersion: 'TLSv1.3',
    alpnProtocols: ['h2', 'http/1.1'],
    servername: 'www.g2.com',
});

const req = session.request({
    'Host': 'www.g2.com',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding':' gzip, deflate, br',
    'DNT': '1',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
}, {endStream: false});

req.on('response', headers => {
    console.log(headers[':status']);
});

req.resume();
req.end();

I don't think is related to JA3 fingerprint. I get the cookies from my browser that has a different fingerprint than Postman. But I'm still being able to do the request in Postman if I just make sure that I had at least the Host, Cookie and User-Agent headers in this order. Normally Postman put the Host header at the end, you have to override. But having the same JA3 is a good idea, probably other bot protections like DataDome and Akamai, use this. I saw other repo that tries to simulate the JA3 using Go https://github.com/zedd3v/mytls . They have a extensive list of hashes here https://github.com/zedd3v/mytls/blob/master/ja3.json .

@szmarczak
Copy link
Contributor

Managed to do Chrome:

const http2 = require('http2');

const session = http2.connect('https://www.g2.com', {
    ciphers: [
        // Chrome v92
        'TLS_AES_128_GCM_SHA256',
        'TLS_AES_256_GCM_SHA384',
        'TLS_CHACHA20_POLY1305_SHA256',
        'ECDHE-ECDSA-AES128-GCM-SHA256',
        'ECDHE-RSA-AES128-GCM-SHA256',
        'ECDHE-ECDSA-AES256-GCM-SHA384',
        'ECDHE-RSA-AES256-GCM-SHA384',
        'ECDHE-ECDSA-CHACHA20-POLY1305',
        'ECDHE-RSA-CHACHA20-POLY1305',
        // Legacy:
        'ECDHE-RSA-AES128-SHA',
        'ECDHE-RSA-AES256-SHA',
        'AES128-GCM-SHA256',
        'AES256-GCM-SHA384',
        'AES128-SHA',
        'AES256-SHA',
    ].join(':'),
    ecdhCurve: [
        'X25519',
        'prime256v1',
        'secp384r1',
    ].join(':'),
    signatureAlgorithms: [
        'ecdsa_secp256r1_sha256',
        'rsa_pss_rsae_sha256',
        'rsa_pkcs1_sha256',
        'ecdsa_secp384r1_sha384',
        'rsa_pss_rsae_sha384',
        'rsa_pkcs1_sha384',
        'rsa_pss_rsae_sha512',
        'rsa_pkcs1_sha512',
    ].join(':'),
    minVersion: 'TLSv1',
    maxVersion: 'TLSv1.3',
    alpnProtocols: ['h2', 'http/1.1'],
    servername: 'www.g2.com',
});

const req = session.request({
	":method": "GET",
	":authority": "www.g2.com",
	":scheme": "https",
	":path": "/",
	"sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"100\"",
	"sec-ch-ua-mobile": "?0",
	"sec-ch-ua-platform": "\"Linux\"",
	"upgrade-insecure-requests": "1",
	"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
	"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
	"sec-fetch-site": "none",
	"sec-fetch-mode": "navigate",
	"sec-fetch-user": "?1",
	"sec-fetch-dest": "document",
	"accept-encoding": "gzip, deflate, br",
	"accept-language": "en-US,en;q=0.9"
}, {endStream: false});

req.on('response', headers => {
    console.log(headers[':status']);
});

req.resume();
req.end();

@szmarczak
Copy link
Contributor

And... it stopped working for some reason

@szmarczak
Copy link
Contributor

And it works again... weird stuff is going on 🤔

@yuriolive
Copy link

And it works again... weird stuff is going on 🤔

I think g2 website is having some downtime https://status.g2.com/ . I think GetApp website also uses Cloudflare Bot Protection https://www.getapp.com/ .

@szmarczak
Copy link
Contributor

Can you try [email protected] and run your code with updated all dependencies? Make sure header-generator uses the beta.

@yuriolive
Copy link

yuriolive commented Apr 25, 2022

Can you try [email protected] and run your code with updated all dependencies? Make sure header-generator uses the beta.

Still not working, got a different error now: Access denied Error code 1020. I will take a better look later tonight.

@szmarczak
Copy link
Contributor

Can you try firefox? You'd need to pass

	  headerGeneratorOptions: {
		    browsers: [
			    "firefox",
		    ],
	  },

in the options.

@yuriolive
Copy link

yuriolive commented May 10, 2022

Can you try firefox? You'd need to pass

	  headerGeneratorOptions: {
		    browsers: [
			    "firefox",
		    ],
	  },

in the options.

Now is working even without JS rendering to get the cookies. I think the TLS fingerprint from the chrome version I was using in the application was different and probably this was causing some issue.

@szmarczak
Copy link
Contributor

szmarczak commented May 11, 2022

If you didn't explicitly specify Chrome version then it should just work out of the box. So to recap - is it working w/ Firefox fingerprint but does not work with Chrome fingerprint?

@yuriolive
Copy link

@szmarczak It stopped working again, but was working locally. I still doesn't know what is being used by Cloudflare to detect, I think is more than just TLS and headers.

@szmarczak
Copy link
Contributor

We still could get detected. The current fingerprint is not a 1:1 match, but a very close one.

For example, Chrome uses BoringSSL while Node.js uses OpenSSL. Our TLS fingerprint has improved recently, however I think we've reached max and can't do better with the native tls module.

However I believe this could be worked around via NAPI.

The headers are what we definitely can keep improving. Sometimes the header-generator generates fingerprint matching old browsers, and that needs fixing. Also it's missing sec-ch-ua-platform header.

Also there's a chance that they're fingerprinting HTTP/2 session or/and stream settings, but that's very unlikely.

Another reason it can pass locally is that the local IP address has a higher trust score so Cloudflare is more forgivable.

I'll keep testing and will give an update tomorrow.

@szmarczak
Copy link
Contributor

szmarczak commented May 14, 2022

I've tested the two websites mentioned above (with proxy on) and couldn't reproduce the issue. Can you post the options used with got-scraping? Have you used cookies?

I only got a CloudFlare challenge when I visited g2 with a real browser (got-scraping did just fine, no block 🤔).
Interestingly, on Firefox I was getting a JS challenge while on Chrome I was struck with hcaptcha.

Edit: Changing IP didn't help when using real browsers.

Edit 2: I changed my UA to Windows and the block was gone.

@szmarczak
Copy link
Contributor

@Cooya do you still experience blocks with the newest version?

@Cooya
Copy link
Author

Cooya commented May 17, 2022

As I said previously, got-scraping is inefficient for my case (https://gensdeconfiance.com). Changing the UA will not fix anything as Cloudflare rely on JA3 signature (on this website anyway).

I am now using a Go server to send my requests, which works much better.

@szmarczak
Copy link
Contributor

As I said previously

So you haven't tried the new version?

@l10r
Copy link

l10r commented May 21, 2022

I don't have any luck on duelbits.com for instance, I get error code 403, but when trying on puppeteer for instance it works and request goes through

@szmarczak
Copy link
Contributor

Thanks for feedback @l10r, looking into it.

@l10r
Copy link

l10r commented May 22, 2022

Thanks for feedback @l10r, looking into it.

i was testing with http2 and just with the ciphers and other tls options mentioned above, after testing with gotScraping it works just fine and i get the response, I forgot to update my progress sorry and thanks a lot

@l10r
Copy link

l10r commented May 26, 2022

https://github.com/lwthiker/curl-impersonate
^ Seems like this could help you a lot @szmarczak as it seems like this has much success rate...

@szmarczak
Copy link
Contributor

We've been doing exactly the same already. The difference is that project maintains just 2 browser versions, while we support a lot more.

@l10r
Copy link

l10r commented May 30, 2022

Not really, got-scraping's TLS fingerprint is still not identical to Chrome's TLS fingerprint or Firefox's because their ClientHello message in the TLS handshake is different, the best chrome-like TLS configuration I could also was this: https://github.com/refraction-networking/utls/blob/master/u_parrots.go
It has chrome, and firefox etc....

@szmarczak
Copy link
Contributor

their ClientHello message in the TLS handshake is different

Can you point out what exactly is different in the message?

@corford
Copy link

corford commented Jun 6, 2022

@szmarczak re: However I believe this could be worked around via NAPI.

Have you considered NAPI bindings to the native cronet lib as a potential way to leverage chrome's network stack? https://chromium.googlesource.com/chromium/src/+/refs/heads/main/components/cronet/native/ (some more context: https://medium.com/@cchiappini/discover-cronet-4c7b4812407)

@ActiniumTO
Copy link

their ClientHello message in the TLS handshake is different

Can you point out what exactly is different in the message?

Compare with your response with chrome
https://tls.incolumitas.com/fps?detail=1

@x066it
Copy link

x066it commented Jul 1, 2022

Didn't work for me with firefox UA - "Failed to set ECDH curve" every time. I think it's because the node doesn't support "ffdhe2048" and "ffdhe3072". How do you guys got it work?
Node v18.3.0, got-scraping: 3.2.9

I think you can try to use Electron to spoof chrome as it's using BoringSSL under the hood

@szmarczak
Copy link
Contributor

@corford Still under discussion. Haven't decided yet. That's quite hard to do because we'll need to compile for multiple targets, so I'm not confident in this.

@ActiniumTO

  1. Ciphers are the same (to the extent Node.js allows)

image

  1. Cannot alter EC point formats with the native tls module

image

  1. Cannot select extensions with the native tls module

image

besides that everything else seems to be the same.

When I was replying to @l10r we we're talking about curl-impersonate which AFAIK doesn't differ from got-scraping solution.

@x066it You need to update your Node.js to at least v17 (preferably v18).

@pimterry
Copy link

pimterry commented Aug 3, 2022

Just to add another Cloudflare test case: https://api.pap.fr/

If you load that API endpoint in any browser you get a 404 with a basic API JSON response (that's fine - that's because it's not a full API request - but 404 is our success case).

If you load with Node.js, you get a 403 and a Cloudflare block page every time.

If you load with got-scraping, you get a 403 and a block page 90% of the time (mixed with occasional successful 404s, for some reason):

require('got-scraping').gotScraping('https://api.pap.fr/').then(res => console.log(res.statusCode))

I see the same behaviour with both Node 14 and Node 18. I've also tested responseType: 'json' but that doesn't seem to help at all (and it shouldn't be required, since it does load OK when visited directly in a browser).

I've been trying, but I haven't found any way to improve this. I have managed to use tricks from got-scraping to avoid blocks on many other Cloudflare URLs, but not this one, so I think it's a good example of some of the strictest block settings available.

I'd be very interested if anybody manages to get this working! Let me know if there's anything I can do to help.

@szmarczak
Copy link
Contributor

Thank you @pimterry for investigation! Can you try using Firefox headers only? https://crawlee.dev/docs/guides/got-scraping#headergeneratoroptions

@pimterry
Copy link

pimterry commented Aug 3, 2022

Like this?

require('got-scraping').gotScraping({
    url: 'https://api.pap.fr/',
    headerGeneratorOptions:{
        browsers: ['firefox']
    }
}).then((res) => console.log(res.statusCode));

Unfortunately that doesn't work - it returns 403 every time.

@szmarczak
Copy link
Contributor

Correct. Indeed, I'm getting a 1020 error as well. Interestingly the underlying server is HTTP/1.1, not HTTP/2.

@szmarczak
Copy link
Contributor

@pimterry It's because apify/fingerprint-suite#53 so host header is set by node and not by header-generator. If I manually override it to always be sent first, then it works. In HTTP/2 there's a pseudo header instead so that's probably why it works with other HTTP/2 websites.

@Cooya
Copy link
Author

Cooya commented Aug 3, 2022

You have to use https://github.com/Danny-Dasilva/CycleTLS for https://api.pap.fr/.

@pimterry
Copy link

pimterry commented Sep 2, 2022

You have to use https://github.com/Danny-Dasilva/CycleTLS for https://api.pap.fr/.

Just a quick update here: you actually can do this with normal Node.js HTTP, without CycleTLS. I'm not using Got-Scraping directly, but I'm using lots of the same ideas, and I've managed to get traffic working successfully for that site.

The issue is interesting: this case is different because the server doesn't support HTTP/2, but in most cases to avoid blocks you want to forcibly send HTTP/2 traffic by default, because that's what browsers do.

In my case, my requests were blocked because I was generating requests for HTTP/2 (where there is no header casing - all header names must be lowercase), and my sending code was automatically converting those to HTTP/1 when it discovered the server didn't support HTTP/2, so the header names ended up being sent as lowercase, which doesn't match normal browser headers.

When I disable HTTP/2, and send normal HTTP/1 headers with normal browser header casing, that API works perfectly and I never get blocked.

@Cooya
Copy link
Author

Cooya commented Sep 2, 2022

Very interesting, thank you for your feedback ! Are you using an HTTP client or you build your requests from scratch ?

It would be nice if we could have the option to disable HTTP/2 while using got-scraping.

@pimterry
Copy link

pimterry commented Sep 2, 2022

Are you using an HTTP client or you build your requests from scratch ?

I'm doing this as part of an HTTP intercepting proxy (https://httptoolkit.tech/), with the relevant bits implemented in Node here: https://github.com/httptoolkit/mockttp. Effectively a real browser is generating the real requests, and I handle and receive those (showing them and maybe modifying them) and then I forward most requests upstream. That requires translating details and configuring the TLS fingerprint so that this doesn't get blocked as suspicious traffic and blocked by Cloudflare et al.

It's not quite the same use case as Got-Scraping, but it's very much similar techniques, and the same concepts should apply in both scenarios.

@AndreyBykov
Copy link

@Cooya there is an option to disable http2 (http2: false): see here. It was mentioned in the readme before, not sure why it was removed.

@szmarczak
Copy link
Contributor

Also since it does extend Got, all the docs apply there as well. https://github.com/sindresorhus/got/blob/main/documentation/2-options.md#http2

@Strajk
Copy link

Strajk commented Mar 10, 2023

If I manually override it to always be sent first

@szmarczak Where would be the best place to override this? Ideally from crawler logic, without forking got-scraping or it's dependencies.

Previously, when I explicitly avoided setting the Host header from the crawler, got-scraping somehow managed to put it into the correct place (=accepted by CF).

In the past weeks/months something changed in my dev env (maybe new Node, maybe some deps), and now the Host header is again the last one.
I've been debugging the logic of got-scraping – got – header-generator – node internals for a few hours and was unable to find where the issue originates, so I've hotfixed by my modifying node_modules/got-scraping/node_modules/header-generator/data_files/headers-order.json, which is obviously not ideal.


EDIT: For now, I've "solved" it by patching the headers-order.json file in Dockerfile
https://github.com/Strajk/apify-actors-monorepo/blob/master/packages/docker-image-scraping-behemoth/Dockerfile#L14

@barjin barjin self-assigned this Mar 16, 2023
@barjin barjin pinned this issue Mar 16, 2023
@barjin barjin unpinned this issue Mar 16, 2023
@mtrunkat mtrunkat added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests