This library provides Scrapy and JavaScript integration using Splash. The license is BSD 3-clause.
Install scrapy-splash using pip:
$ pip install scrapy-splash
Scrapy-Splash uses Splash HTTP API, so you also need a Splash instance. Usually to install & run Splash, something like this is enough:
$ docker run -p 8050:8050 scrapinghub/splash
Check Splash install docs for more info.
Add the Splash server address to settings.py
of your Scrapy project like this:
SPLASH_URL = 'http://192.168.59.103:8050'
Enable the Splash middleware by adding it to DOWNLOADER_MIDDLEWARES
in your settings.py
file and changing HttpCompressionMiddleware priority:
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.
HttpCompressionMiddleware priority should be changed in order to allow advanced response processing; see scrapy/scrapy#1895 for details.
Enable SplashDeduplicateArgsMiddleware
by adding it to SPIDER_MIDDLEWARES
in your settings.py
:
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }
This middleware is needed to support cache_args
feature; it allows to save disk space by not storing duplicate Splash arguments multiple times in a disk request queue. If Splash 2.1+ is used the middleware also allows to save network traffic by not sending these duplicate arguments to Splash server multiple times.
Set a custom REQUEST_FINGERPRINTER_CLASS
:
REQUEST_FINGERPRINTER_CLASS = 'scrapy_splash.SplashRequestFingerprinter'
There are also some additional options available. Put them into your settings.py
if you want to change the defaults:
SPLASH_COOKIES_DEBUG
is False
by default. Set to True
to enable debugging cookies in the SplashCookiesMiddleware
. This option is similar to COOKIES_DEBUG
for the built-in Scrapy cookies middleware: it logs sent and received cookies for all requests.SPLASH_LOG_400
is True
by default - it instructs to log all 400 errors from Splash. They are important because they show errors occurred when executing the Splash script. Set it to False
to disable this logging.SPLASH_SLOT_POLICY
is scrapy_splash.SlotPolicy.PER_DOMAIN
(as object, not just a string) by default. It specifies how concurrency & politeness are maintained for Splash requests, and specify the default value for slot_policy
argument for SplashRequest
, which is described below.SCRAPY_SPLASH_REQUEST_FINGERPRINTER_BASE_CLASS
is scrapy.settings.default_settings.REQUEST_FINGERPRINTER_CLASS
by default. This changes the base class the Fingerprinter uses to get a fingerprint.The easiest way to render requests with Splash is to use scrapy_splash.SplashRequest
:
yield SplashRequest(url, self.parse_result, args={ # optional; parameters passed to Splash HTTP API 'wait': 0.5, # 'url' is prefilled from request url # 'http_method' is set to 'POST' for POST requests # 'body' is set to request body for POST requests }, endpoint='render.json', # optional; default is render.html splash_url='<url>', # optional; overrides SPLASH_URL slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional )
Alternatively, you can use regular scrapy.Request and 'splash'
Request meta key:
yield scrapy.Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url # 'http_method' is set to 'POST' for POST requests # 'body' is set to request body for POST requests }, # optional parameters 'endpoint': 'render.json', # optional; default is render.json 'splash_url': '<url>', # optional; overrides SPLASH_URL 'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN, 'splash_headers': {}, # optional; a dict with headers sent to Splash 'dont_process_response': True, # optional, default is False 'dont_send_headers': True, # optional, default is False 'magic_response': False, # optional, default is True } })
Use request.meta['splash']
API in middlewares or when scrapy.Request subclasses are used (there is also SplashFormRequest
described below). For example, meta['splash']
allows to create a middleware which enables Splash for all outgoing requests by default.
SplashRequest
is a convenient utility to fill request.meta['splash']
; it should be easier to use in most cases. For each request.meta['splash']
key there is a corresponding SplashRequest
keyword argument: for example, to set meta['splash']['args']
use SplashRequest(..., args=myargs)
.
meta['splash']['args']
contains arguments sent to Splash. scrapy-splash adds some default keys/values to args
:
You can override default values by setting them explicitly.
Note that by default Scrapy escapes URL fragments using AJAX escaping scheme. If you want to pass a URL with a fragment to Splash then set url
in args
dict manually. This is handled automatically if you use SplashRequest
, but you need to keep that in mind if you use raw meta['splash']
API.
Splash 1.8+ is required to handle POST requests; in earlier Splash versions 'http_method' and 'body' arguments are ignored. If you work with /execute
endpoint and want to support POST requests you have to handle http_method
and body
arguments in your Lua script manually.
meta['splash']['cache_args']
is a list of argument names to cache on Splash side. These arguments are sent to Splash only once, then cached values are used; it allows to save network traffic and decreases request queue disk memory usage. Use cache_args
only for large arguments which don't change with each request; lua_source
is a good candidate (if you don't use string formatting to build it). Splash 2.1+ is required for this feature to work.
meta['splash']['endpoint']
is the Splash endpoint to use. In case of SplashRequest render.html is used by default. If you're using raw scrapy.Request then render.json is a default (for historical reasons). It is better to always pass endpoint explicitly.
See Splash HTTP API docs for a full list of available endpoints and parameters.
meta['splash']['splash_url']
overrides the Splash URL set in settings.py
.
meta['splash']['splash_headers']
allows to add or change headers which are sent to Splash server. Note that this option is not for setting headers which are sent to the remote website.
meta['splash']['slot_policy']
customize how concurrency & politeness are maintained for Splash requests.
Currently there are 3 policies available:
scrapy_splash.SlotPolicy.PER_DOMAIN
(default) - send Splash requests to downloader slots based on URL being rendered. It is useful if you want to maintain per-domain politeness & concurrency settings.scrapy_splash.SlotPolicy.SINGLE_SLOT
- send all Splash requests to a single downloader slot. It is useful if you want to throttle requests to Splash.scrapy_splash.SlotPolicy.SCRAPY_DEFAULT
- don't do anything with slots. It is similar to SINGLE_SLOT
policy, but can be different if you access other services on the same address as Splash.meta['splash']['dont_process_response']
- when set to True, SplashMiddleware won't change the response to a custom scrapy.Response subclass. By default for Splash requests one of SplashResponse, SplashTextResponse or SplashJsonResponse is passed to the callback.
meta['splash']['dont_send_headers']
: by default scrapy-splash passes request headers to Splash in 'headers' JSON POST field. For all render.xxx endpoints it means Scrapy header options are respected by default (http://splash.readthedocs.org/en/stable/api.html#arg-headers). In Lua scripts you can use headers
argument of splash:go
to apply the passed headers: splash:go{url, headers=splash.args.headers}
.
Set 'dont_send_headers' to True if you don't want to pass headers
to Splash.
meta['splash']['http_status_from_error_code']
- set response.status to HTTP error code when assert(splash:go(..))
fails; it requires meta['splash']['magic_response']=True
. http_status_from_error_code
option is False by default if you use raw meta API; SplashRequest sets it to True by default.
meta['splash']['magic_response']
- when set to True and a JSON response is received from Splash, several attributes of the response (headers, body, url, status code) are filled using data returned in JSON:
meta['splash']['http_status_from_error_code']
is True and assert(splash:go(..))
fails with an HTTP error response.status is also set to HTTP error code.Original URL, status and headers are available as response.real_url
, response.splash_response_status
and response.splash_response_headers
.
This option is set to True by default if you use SplashRequest. render.json
and execute
endpoints may not have all the necessary keys/values in the response. For non-JSON endpoints, only url is filled, regardless of the magic_response
setting.
Use scrapy_splash.SplashFormRequest
if you want to make a FormRequest
via splash. It accepts the same arguments as SplashRequest
, and also formdata
, like FormRequest
from scrapy:
>>> from scrapy_splash import SplashFormRequest >>> SplashFormRequest('http://example.com', formdata={'foo': 'bar'}) <POST http://example.com>
SplashFormRequest.from_response
is also supported, and works as described in scrapy documentation.
scrapy-splash returns Response subclasses for Splash requests:
To use standard Response classes set meta['splash']['dont_process_response']=True
or pass dont_process_response=True
argument to SplashRequest.
All these responses set response.url
to the URL of the original request (i.e. to the URL of a website you want to render), not to the URL of the requested Splash endpoint. "True" URL is still available as response.real_url
.
SplashJsonResponse provide extra features:
response.data
attribute contains response data decoded from JSON; you can access it like response.data['html']
.response.cookiejar
; it is a CookieJar instance.When response.body
is updated in SplashJsonResponse (either from 'html' or from 'body' keys) familiar response.css
and response.xpath
methods are available.
To turn off special handling of JSON result keys either set meta['splash']['magic_response']=False
or pass magic_response=False
argument to SplashRequest.
Splash itself is stateless - each request starts from a clean state. In order to support sessions the following is required:
For (2) and (3) Splash provides splash:get_cookies()
and splash:init_cookies()
methods which can be used in Splash Lua scripts.
scrapy-splash provides helpers for (1) and (4): to send current cookies in 'cookies' field and merge cookies back from 'cookies' response field set request.meta['splash']['session_id']
to the session identifier. If you only want a single session use the same session_id
for all request; any value like '1' or 'foo' is fine.
For scrapy-splash session handling to work you must use /execute
endpoint and a Lua script which accepts 'cookies' argument and returns 'cookies' field in the result:
function main(splash) splash:init_cookies(splash.args.cookies) -- ... your script return { cookies = splash:get_cookies(), -- ... other results, e.g. html } end
SplashRequest sets session_id
automatically for /execute
endpoint, i.e. cookie handling is enabled by default if you use SplashRequest, /execute
endpoint and a compatible Lua rendering script.
If you want to start from the same set of cookies, but then 'fork' sessions set request.meta['splash']['new_session_id']
in addition to session_id
. Request cookies will be fetched from cookiejar session_id
, but response cookies will be merged back to the new_session_id
cookiejar.
Standard Scrapy cookies
argument can be used with SplashRequest
to add cookies to the current Splash cookiejar.
Get HTML contents:
import scrapy from scrapy_splash import SplashRequest class MySpider(scrapy.Spider): name = "MySpider" start_urls = ["http://example.com", "http://example.com/foo"] def start_requests(self): for url in self.start_urls: yield SplashRequest(url, self.parse, args={'wait': 0.5}) def parse(self, response): # response.body is a result of render.html call; it # contains HTML processed by a browser. # ...
Get HTML contents and a screenshot:
import json import base64 import scrapy from scrapy_splash import SplashRequest class MySpider(scrapy.Spider): # ... splash_args = { 'wait': 1, 'html': 1, 'png': 1, 'width': 600, 'render_all': 1, } yield SplashRequest(url, self.parse_result, endpoint='render.json', args=splash_args) # ... def parse_result(self, response): # magic responses are turned ON by default, # so the result under 'html' key is available as response.body html = response.body # you can also query the html result as usual title = response.css('title').extract_first() # full decoded JSON data is available as response.data: png_bytes = base64.b64decode(response.data['png']) # ...
Run a simple Splash Lua Script:
import json import base64 from scrapy_splash import SplashRequest class MySpider(scrapy.Spider): # ... script = """ function main(splash) assert(splash:go(splash.args.url)) return splash:evaljs("document.title") end """ yield SplashRequest(url, self.parse_result, endpoint='execute', args={'lua_source': script}) # ... def parse_result(self, response): doc_title = response.text # ...
More complex Splash Lua Script example - get a screenshot of an HTML element by its CSS selector (it requires Splash 2.1+). Note how are arguments passed to the script:
import json import base64 from scrapy_splash import SplashRequest script = """ -- Arguments: -- * url - URL to render; -- * css - CSS selector to render; -- * pad - screenshot padding size. -- this function adds padding around region function pad(r, pad) return {r[1]-pad, r[2]-pad, r[3]+pad, r[4]+pad} end -- main script function main(splash) -- this function returns element bounding box local get_bbox = splash:jsfunc([[ function(css) { var el = document.querySelector(css); var r = el.getBoundingClientRect(); return [r.left, r.top, r.right, r.bottom]; } ]]) assert(splash:go(splash.args.url)) assert(splash:wait(0.5)) -- don't crop image by a viewport splash:set_viewport_full() local region = pad(get_bbox(splash.args.css), splash.args.pad) return splash:png{region=region} end """ class MySpider(scrapy.Spider): # ... yield SplashRequest(url, self.parse_element_screenshot, endpoint='execute', args={ 'lua_source': script, 'pad': 32, 'css': 'a.title' } ) # ... def parse_element_screenshot(self, response): image_data = response.body # binary image data in PNG format # ...
Use a Lua script to get an HTML response with cookies, headers, body and method set to correct values; lua_source
argument value is cached on Splash server and is not sent with each request (it requires Splash 2.1+):
import scrapy from scrapy_splash import SplashRequest script = """ function main(splash) splash:init_cookies(splash.args.cookies) assert(splash:go{ splash.args.url, headers=splash.args.headers, http_method=splash.args.http_method, body=splash.args.body, }) assert(splash:wait(0.5)) local entries = splash:history() local last_response = entries[#entries].response return { url = splash:url(), headers = last_response.headers, http_status = last_response.status, cookies = splash:get_cookies(), html = splash:html(), } end """ class MySpider(scrapy.Spider): # ... yield SplashRequest(url, self.parse_result, endpoint='execute', cache_args=['lua_source'], args={'lua_source': script}, headers={'X-My-Header': 'value'}, ) def parse_result(self, response): # here response.body contains result HTML; # response.headers are filled with headers from last # web page loaded to Splash; # cookies from all responses and from JavaScript are collected # and put into Set-Cookie response header, so that Scrapy # can remember them.
If you need to use HTTP Basic Authentication to access Splash, use the SPLASH_USER
and SPLASH_PASS
optional settings:
SPLASH_USER = 'user' SPLASH_PASS = 'userpass'
Another option is meta['splash']['splash_headers']
: it allows to set custom headers which are sent to Splash server; add Authorization header to splash_headers
if you want to change credentials per-request:
import scrapy from w3lib.http import basic_auth_header class MySpider(scrapy.Spider): # ... def start_requests(self): auth = basic_auth_header('user', 'userpass') yield SplashRequest(url, self.parse, splash_headers={'Authorization': auth})
WARNING: Don't use HttpAuthMiddleware (i.e. http_user
/ http_pass
spider attributes) for Splash authentication: if you occasionally send a non-Splash request from your spider, you may expose Splash credentials to a remote website, as HttpAuthMiddleware sets credentials for all requests unconditionally.
The obvious alternative to scrapy-splash would be to send requests directly to the Splash HTTP API. Take a look at the example below and make sure to read the observations after it:
import json import scrapy from scrapy.http.headers import Headers RENDER_HTML_URL = "http://127.0.0.1:8050/render.html" class MySpider(scrapy.Spider): start_urls = ["http://example.com", "http://example.com/foo"] def start_requests(self): for url in self.start_urls: body = json.dumps({"url": url, "wait": 0.5}, sort_keys=True) headers = Headers({'Content-Type': 'application/json'}) yield scrapy.Request(RENDER_HTML_URL, self.parse, method="POST", body=body, headers=headers) def parse(self, response): # response.body is a result of render.html call; it # contains HTML processed by a browser. # ...
It works and is easy enough, but there are some issues that you should be aware of:
RENDER_HTML_URL
instead of the target URLs. It affects concurrency and politeness settings: CONCURRENT_REQUESTS_PER_DOMAIN
, DOWNLOAD_DELAY
, etc could behave in unexpected ways since delays and concurrency settings are no longer per-domain.response.real_url
. scrapy-splash also allows to handle response.status
and response.headers
transparently on Scrapy side.download_timeout
scrapy.Request meta key as well.sort_keys=True
argument when preparing JSON body then binary POST body content could vary even if all keys and values are the same, and it means dupefilter and cache will work incorrectly.SPLASH_LOG_400 = False
option).lua_source
) may take a lot of space when saved to Scrapy disk request queues. scrapy-splash
provides a way to store such static parameters only once.save_args
and load_args
values and handle HTTP 498 responses.scrapy-splash utilities allow to handle such edge cases and reduce the boilerplate.
Best approach to get any other help is to ask a question on Stack Overflow
Source code and bug tracker are on github: https://github.com/scrapy-plugins/scrapy-splash
To run tests, install "tox" Python package and then run tox
command from the source checkout.
To run integration tests, start Splash and set SPLASH_URL env variable to Splash address before running tox
command:
docker run -d --rm -p8050:8050 scrapinghub/splash:3.0 SPLASH_URL=http://127.0.0.1:8050 tox -e py36
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4