Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User timeout caused connection failure - Only on specific sites as soon as download middlewares are specified #32

Open
mmotti opened this issue Dec 19, 2020 · 0 comments

Comments

@mmotti
Copy link

mmotti commented Dec 19, 2020

Hi,

Is anybody able to help me work out what's going on? I had recently setup scrapy_proxies and scrapy-fake-useragent, but for whatever reason I seem to get timeouts of specific sites as soon as I alter the download middlewares.

In particular, the following URL: https://www.very.co.uk

I have removed all references to scrapy_proxies as part of the troubleshooting in order to rule everything out.

I am using the middlwares from the instructions, however the only way I can get to the page is to either comment out the middlewares entirely, or simply use the following (which are disabled as part of the install instructions):

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 100,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 101
}

Any ideas as to what might be the issue?

Applied Settings

DOWNLOADER_MIDDLEWARES = {
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None
}

FAKEUSERAGENT_PROVIDERS = [
    'scrapy_fake_useragent.providers.FakeUserAgentProvider',  # this is the first provider we'll try
    'scrapy_fake_useragent.providers.FakerProvider',  # if FakeUserAgentProvider fails, we'll use faker to generate a user-agent string for us
    'scrapy_fake_useragent.providers.FixedUserAgentProvider',  # fall back to USER_AGENT value
]

Timeout

c:\crawler>scrapy fetch https://www.very.co.uk
2020-12-19 19:46:28 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: crawler)
2020-12-19 19:46:28 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.2.1, Platform Windows-10-10.0.19041-SP0
2020-12-19 19:46:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-12-19 19:46:28 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'AUTOTHROTTLE_MAX_DELAY': 5,
 'AUTOTHROTTLE_START_DELAY': 1.5,
 'BOT_NAME': 'crawler',
 'DOWNLOAD_TIMEOUT': 15,
 'NEWSPIDER_MODULE': 'crawler.spiders',
 'SPIDER_MODULES': ['crawler.spiders'],
 'TELNETCONSOLE_ENABLED': False,
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'}
2020-12-19 19:46:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2020-12-19 19:46:29 [faker.factory] DEBUG: Not in REPL -> leaving logger event level as is.
2020-12-19 19:46:29 [scrapy_fake_useragent.middleware] DEBUG: Loaded User-Agent provider: scrapy_fake_useragent.providers.FakeUserAgentProvider
2020-12-19 19:46:29 [scrapy_fake_useragent.middleware] INFO: Using '<class 'scrapy_fake_useragent.providers.FakeUserAgentProvider'>' as the User-Agent provider
2020-12-19 19:46:29 [scrapy_fake_useragent.middleware] DEBUG: Loaded User-Agent provider: scrapy_fake_useragent.providers.FakeUserAgentProvider
2020-12-19 19:46:29 [scrapy_fake_useragent.middleware] INFO: Using '<class 'scrapy_fake_useragent.providers.FakeUserAgentProvider'>' as the User-Agent provider
2020-12-19 19:46:29 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware',
 'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-12-19 19:46:29 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-12-19 19:46:30 [scrapy.middleware] INFO: Enabled item pipelines:
['crawler.pipelines.CrawlerPipeline']
2020-12-19 19:46:30 [scrapy.core.engine] INFO: Spider opened
2020-12-19 19:46:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-19 19:46:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.very.co.uk> (failed 1 times): User timeout caused connection failure: Getting https://www.very.co.uk took longer than 15.0 seconds..
2020-12-19 19:47:01 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2020-12-19 19:47:01 [scrapy.core.engine] INFO: Closing spider (shutdown)
2020-12-19 19:47:04 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.very.co.uk> (failed 2 times): User timeout caused connection failure: Getting https://www.very.co.uk took longer than 15.0 seconds..
2020-12-19 19:47:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
 'downloader/request_bytes': 689,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'elapsed_time_seconds': 30.17314,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2020, 12, 19, 19, 47, 4, 612704),
 'log_count/DEBUG': 35,
 'log_count/INFO': 11,
 'retry/count': 2,
 'retry/reason_count/twisted.internet.error.TimeoutError': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2020, 12, 19, 19, 46, 34, 439564)}
2020-12-19 19:47:04 [scrapy.core.engine] INFO: Spider closed (shutdown)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant