Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Splash responses #45

Merged
merged 36 commits into from
Apr 2, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
19d1170
initial implementation of custom Splash responses
kmike Mar 29, 2016
ca52e1d
small cleanup
kmike Mar 29, 2016
82ccc1c
SplashRequest: get _original_url from request meta; improve __repr__
kmike Mar 29, 2016
637595b
expose SplashRequest as scrapyjs.SplashRequest
kmike Mar 29, 2016
eeb5972
fixed creation of custom Splash response classes
kmike Mar 29, 2016
f25075e
fixed response caching and duplication detection
kmike Mar 29, 2016
67b8c5e
ignore .scrapy folder
kmike Mar 29, 2016
ddd515d
response class is fixed by middleware for cached responses
kmike Mar 29, 2016
1ec775f
DOC typo fix
kmike Mar 29, 2016
32126fd
DOC mention that SplashRequest handles URL fragments automatically
kmike Mar 29, 2016
913aff2
add an option to return unprocessed responses
kmike Mar 30, 2016
74df321
TST enable branch coverage
kmike Mar 30, 2016
e581c66
special handling of some JSON keys in Splash responses
kmike Mar 30, 2016
86427bd
TST test that caching works for SplashResponses
kmike Mar 30, 2016
c3cd10d
TST add htmlcov to .gitignore
kmike Mar 30, 2016
a5d7070
only fix Response class if it is not fixed yet
kmike Mar 30, 2016
f0519ab
fix Content-Type header for magic responses built from 'html' key
kmike Mar 30, 2016
9b7ed67
extract headers_to_scrapy function
kmike Mar 30, 2016
fee2662
set SplashResponse cookies from 'cookies' json
kmike Mar 30, 2016
c2d7153
PY2 fixed cookie handing
kmike Mar 30, 2016
643ee3c
SplashJsonResponse: extract magic response handing to a method
kmike Mar 30, 2016
0af058b
DOC fixed SplashResponse for Splash servers which use HTTP compression
kmike Mar 30, 2016
373264a
make sure SplashRequest is never handled by AjaxCrawlMiddleware
kmike Mar 31, 2016
b43ea3a
DOC readme and example improvements
kmike Mar 31, 2016
3fc7d09
DOC typo fix
kmike Mar 31, 2016
356f19f
DOC fixed a typo in example
kmike Mar 31, 2016
12e81b7
Fixed repr of SplashRequest when if haven't reached SplashMiddleware yet
kmike Apr 1, 2016
69eb8f6
Log content of Splash Bad Request errors. See GH-37.
kmike Apr 1, 2016
6b777b5
cookies handling overhaul
kmike Apr 1, 2016
5a54996
response.cookiejar
kmike Apr 2, 2016
10162df
DOC improve cookie docs
kmike Apr 2, 2016
67c1ae8
actually set response.cookiejar
kmike Apr 2, 2016
6637779
cleanup cookie handling: drop unneeded code, add more tests
kmike Apr 2, 2016
5a8954c
TST add a test form SplashRequest repr
kmike Apr 2, 2016
d394c6f
add support for SplashRequest 'cookies' argument
kmike Apr 2, 2016
92018c4
pass headers to Splash by default
kmike Apr 2, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[run]
branch = true
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@ dist
scrapyjs.egg-info
.cache
.coverage
.scrapy
htmlcov
218 changes: 211 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,17 +46,24 @@ Configuration
SPLASH_URL = 'http://192.168.59.103:8050'

2. Enable the Splash middleware by adding it to ``DOWNLOADER_MIDDLEWARES``
in your ``settings.py`` file::
in your ``settings.py`` file and changing HttpCompressionMiddleware
priority::

DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashCookiesMiddleware': 723,
'scrapyjs.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

.. note::

Order `725` is just before `HttpProxyMiddleware` (750) in default
scrapy settings.

HttpCompressionMiddleware priority should be changed in order to allow
advanced response processing; see https://github.com/scrapy/scrapy/issues/1895
for details.

3. Set a custom ``DUPEFILTER_CLASS``::

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
Expand All @@ -81,6 +88,9 @@ Configuration
Usage
=====

Requests
--------

The easiest way to render requests with Splash is to
use ``scrapyjs.SplashRequest``::

Expand Down Expand Up @@ -118,6 +128,9 @@ Alternatively, you can use regular scrapy.Request and
'splash_url': '<url>', # optional; overrides SPLASH_URL
'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
'splash_headers': {}, # optional; a dict with headers sent to Splash
'dont_process_response': True, # optional, default is False
'dont_send_headers': True, # optional, default is False
'magic_response': False, # optional, default is True
}
})

Expand All @@ -140,7 +153,9 @@ it should be easier to use in most cases.

Note that by default Scrapy escapes URL fragments using AJAX escaping scheme.
If you want to pass a URL with a fragment to Splash then set ``url``
in ``args`` dict manually.
in ``args`` dict manually. This is handled automatically if you use
``SplashRequest``, but you need to keep that in mind if you use raw
``meta['splash']`` API.

Splash 1.8+ is required to handle POST requests; in earlier Splash versions
'http_method' and 'body' arguments are ignored. If you work with ``/execute``
Expand Down Expand Up @@ -184,6 +199,125 @@ it should be easier to use in most cases.
It is similar to ``SINGLE_SLOT`` policy, but can be different if you access
other services on the same address as Splash.

* ``meta['splash']['dont_process_response']`` - when set to True,
SplashMiddleware won't change the response to a custom scrapy.Response
subclass. By default for Splash requests one of SplashResponse,
SplashTextResponse or SplashJsonResponse is passed to the callback.

* ``meta['splash']['dont_send_headers']``: by default ScrapyJS passes
request headers to Splash in 'headers' JSON POST field. For all render.xxx
endpoints it means Scrapy header options are respected by default
(http://splash.readthedocs.org/en/stable/api.html#arg-headers). In Lua
scripts you can use ``headers`` argument of ``splash:go`` to apply the
passed headers: ``splash:go{url, headers=splash.args.headers}``.

Set 'dont_send_headers' to True if you don't want to pass ``headers``
to Splash.

* ``meta['splash']['magic_response']`` - when set to True and a JSON
response is received from Splash, several attributes of the response
(headers, body, url, status code) are filled using data returned in JSON:

* response.headers are filled from 'headers' keys;
* response.url is set to the value of 'url' key;
* response.body is set to the value of 'html' key,
or to base64-decoded value of 'body' key;
* response.status is set to the value of 'http_status' key.

Responses
---------

ScrapyJS returns Response subclasses for Splash requests:

* SplashResponse is returned for binary Splash responses - e.g. for
/render.png responses;
* SplashTextResponse is returned when the result is text - e.g. for
/render.html responses;
* SplashJsonResponse is returned when the result is a JSON object - e.g.
for /render.json responses or /execute responses when script returns
a Lua table.

To use standard Response classes set ``meta['splash']['dont_process_response']=True``
or pass ``dont_process_response=True`` argument to SplashRequest.

All these responses set ``response.url`` to the URL of the original request
(i.e. to the URL of a website you want to render), not to the URL of the
requested Splash endpoint. "True" URL is still available as
``response.real_url``.

SplashJsonResponse provide extra features:

* ``response.data`` attribute contains response data decoded from JSON;
you can access it like ``response.data['html']``.

* If Splash session handling is configured, you can access current cookies
as ``response.cookiejar``; it is a CookieJar instance.

* If Scrapy-Splash response magic is enabled in request (default),
several response attributes (headers, body, url, status code)
are set automatically from original response body:

* response.headers are filled from 'headers' keys;
* response.url is set to the value of 'url' key;
* response.body is set to the value of 'html' key,
or to base64-decoded value of 'body' key;
* response.status is set from the value of 'http_status' key.

When ``respone.body`` is updated in SplashJsonResponse
(either from 'html' or from 'body' keys) familiar ``response.css``
and ``response.xpath`` methods are available.

To turn off special handling of JSON result keys either set
``meta['splash']['magic_response']=False`` or pass ``magic_response=False``
argument to SplashRequest.

Session Handling
================

Splash itself is stateless - each request starts from a clean state.
In order to support sessions the following is required:

1. client (Scrapy) must send current cookies to Splash;
2. Splash script should make requests using these cookies and update
them from HTTP response headers or JavaScript code;
3. updated cookies should be sent back to the client;
4. client should merge current cookies wiht the updated cookies.

For (2) and (3) Splash provides ``spalsh:get_cookies()`` and
``splash:init_cookies()`` methods which can be used in Splash Lua scripts.

ScrapyJS provides helpers for (1) and (4): to send current cookies
in 'cookies' field and merge cookies back from 'cookies' response field
set ``request.meta['splash']['args']['session_id']`` to the session
identifier. If you only want a single session use the same ``session_id`` for
all request; any value like '1' or 'foo' is fine.

For ScrapyJS session handling to work you must use ``/execute`` endpoint
and a Lua script which accepts 'cookies' argument and returns 'cookies'
field in the result::

function main(splash)
splash:init_cookies(splash.args.cookies)

-- ... your script

return {
cookies = splash:get_cookies(),
-- ... other results, e.g. html
}
end

SplashRequest sets ``session_id`` automatically for ``/execute`` endpoint,
i.e. cookie handling is enabled by default if you use SplashRequest,
``/execute`` endpoint and a compatible Lua rendering script.

If you want to start from the same set of cookies, but then 'fork' sessions
set ``request.meta['splash']['args']['new_session_id']`` in addition to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both session_id and new_session_id are currently (https://github.com/scrapy-plugins/scrapy-splash/pull/45/files#diff-93e5c0fca1f417cfa28d48c75408be45R59) searched for in request.meta['splash'], not in ```request.meta['splash']['args']`. Which is more correct, the docs or the code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good catch; the code is correct. args are arguments sent to Splash; new_session_id is a SplashMiddleware option.

``session_id``. Request cookies will be fetched from cookiejar ``session_id``,
but response cookies will be merged back to the ``new_session_id`` cookiejar.

Standard Scrapy ``cookies`` argument can be used with ``SplashRequest``
to add cookies to the current Splash cookiejar.

Examples
========
Expand Down Expand Up @@ -226,9 +360,16 @@ Get HTML contents and a screenshot::

# ...
def parse_result(self, response):
data = json.loads(response.body_as_unicode())
body = data['html']
png_bytes = base64.b64decode(data['png'])
# magic responses are turned ON by default,
# so the result under 'html' key is available as response.body
html = response.body

# you can also query the html result as usual
title = response.css('title').extract_first()

# full decoded JSON data is available as response.data:
png_bytes = base64.b64decode(response.data['png'])

# ...

Run a simple `Splash Lua Script`_::
Expand Down Expand Up @@ -317,6 +458,52 @@ Note how are arguments passed to the script::
# ...


Use a Lua script to get an HTML response with cookies and headers set to
correct values::

import scrapy
from scrapyjs import SplashRequest

script = """
function last_response_headers(splash)
local entries = splash:history()
local last_entry = entries[#entries]
return last_entry.response.headers
end

function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{splash.args.url, headers=splash.args.headers})
assert(splash:wait(0.5))

return {
headers = last_response_headers(splash),
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""

class MySpider(scrapy.Spider):


# ...
yield SplashRequest(url, self.parse_result,
endpoint='execute',
args={'lua_source': script},
headers={'X-My-Header': 'value'},
)

def parse_result(self, response):
# here response.body contains result HTML;
# response.headers are filled with headers from last
# web page loaded to Splash;
# cookies from all responses and from JavaScript are collected
# and put into Set-Cookie response header, so that Scrapy
# can remember them.



.. _Splash Lua Script: http://splash.readthedocs.org/en/latest/scripting-tutorial.html


Expand Down Expand Up @@ -351,7 +538,7 @@ sure to read the observations after it::

def start_requests(self):
for url in self.start_urls:
body = json.dumps({"url": url, "wait": 0.5})
body = json.dumps({"url": url, "wait": 0.5}, sort_keys=True)
headers = Headers({'Content-Type': 'application/json'})
yield scrapy.Request(RENDER_HTML_URL, self.parse, method="POST",
body=body, headers=headers)
Expand All @@ -373,10 +560,27 @@ aware of:
in unexpected ways since delays and concurrency settings are no longer
per-domain.

3. Some options depend on each other - for example, if you use timeout_
3. As seen by Scrapy, response.url is an URL of the Splash server.
scrapy-splash fixes it to be an URL of a requested page.
"Real" URL is still available as ``response.real_url``.

4. Some options depend on each other - for example, if you use timeout_
Splash option then you may want to set ``download_timeout``
scrapy.Request meta key as well.

5. It is easy to get it subtly wrong - e.g. if you won't use
``sort_keys=True`` argument when preparing JSON body then binary POST body
content could vary even if all keys and values are the same, and it means
dupefilter and cache will work incorrectly.

6. Splash Bad Request (HTTP 400) errors are hard to debug because by default
response content is not displayed by Scrapy. SplashMiddleware logs content
of HTTP 400 Splash responses by default (it can be turned off by setting
``SPLASH_LOG_400 = False`` option).

7. Cookie handling is tedious to implement, and you can't use Scrapy
built-in Cookie middleware to handle cookies when working with Splash.

ScrapyJS utlities allow to handle such edge cases and reduce the boilerplate.

.. _HTTP API: http://splash.readthedocs.org/en/latest/api.html
Expand Down
4 changes: 4 additions & 0 deletions example/scrashtest/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,11 @@
NEWSPIDER_MODULE = 'scrashtest.spiders'

DOWNLOADER_MIDDLEWARES = {
# Engine side
'scrapyjs.middleware.SplashCookiesMiddleware': 723,
'scrapyjs.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# Downloader side
}

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
Expand Down
20 changes: 14 additions & 6 deletions example/scrashtest/spiders/dmoz.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
import scrapy
from scrapy.linkextractors import LinkExtractor

from scrapyjs import SplashRequest


class DmozSpider(scrapy.Spider):
name = "dmoz"
Expand All @@ -16,12 +18,18 @@ class DmozSpider(scrapy.Spider):
def parse(self, response):
le = LinkExtractor()
for link in le.extract_links(response):
yield scrapy.Request(link.url, self.parse_link, meta={
'splash': {
'args': {'har': 1, 'html': 0},
yield SplashRequest(
link.url,
self.parse_link,
endpoint='render.json',
args={
'har': 1,
'html': 1,
}
})
)

def parse_link(self, response):
res = json.loads(response.body_as_unicode())
print(res["har"]["log"]["pages"])
print("PARSED", response.real_url, response.url)
print(response.css("title").extract())
print(response.data["har"]["log"]["pages"])
print(response.headers.get('Content-Type'))
4 changes: 3 additions & 1 deletion scrapyjs/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import

from .middleware import SplashMiddleware, SlotPolicy
from .middleware import SplashMiddleware, SlotPolicy, SplashCookiesMiddleware
from .dupefilter import SplashAwareDupeFilter, splash_request_fingerprint
from .cache import SplashAwareFSCacheStorage
from .response import SplashResponse, SplashTextResponse, SplashJsonResponse
from .request import SplashRequest
Loading