Splash can execute custom rendering scripts written in the Lua programming language. This allows us to use Splash as a browser automation tool similar to PhantomJS.
To execute a script and get the result back send it to the execute (or run) endpoint in a lua_source argument. Weâll be using execute endpoint in this tutorial.
Note
Most likely youâll be able to follow Splash scripting examples even without knowing Lua; nevertheless, the language is worth learning. With Lua you can, for example, write Redis, Nginx, Apache, World of Warcraft scripts, create mobile apps using Corona or use the state of the art Deep Learning framework Torch7. It is easy to get started and there are good online resources available like the tutorial Learn Lua in 15 minutes and the book Programming in Lua.
Letâs start with a basic example:
function main(splash, args) splash:go("http://example.com") splash:wait(0.5) local title = splash:evaljs("document.title") return {title=title} end
If we submit this script to the execute endpoint in a lua_source
argument, Splash will go to the example.com website, wait until it loads, wait another half-second, then get the page title (by evaluating a JavaScript snippet in page context), and then return the result as a JSON encoded object.
Note
Splash UI provides an easy way to try scripts: there is a code editor for Lua and a button to submit a script to execute
. Visit http://127.0.0.1:8050/ (or whatever host/port Splash is listening to).
To run scripts from your programming environment you need to figure out how to send HTTP requests. Check How to send requests to Splash HTTP API? FAQ section - it contains recipes for some of the common setupts (e.g. Python + requests library).
Entry Point: the âmainâ Function¶The script must provide a âmainâ function which is called by Splash. The result is returned as an HTTP response. The script could contain other helper functions and statements, but âmainâ is required.
In the first example âmainâ function returned a Lua table (an associative array similar to JavaScript Object or Python dict). Such results are returned as JSON.
The following will return the string {"hello":"world!"}
as an HTTP response:
function main(splash) return {hello="world!"} end
The script can also return a string:
function main(splash) return 'hello' end
Strings are returned as-is (unlike tables they are not encoded to JSON). Letâs check it with curl:
$ curl 'http://127.0.0.1:8050/execute?lua_source=function+main%28splash%29%0D%0A++return+%27hello%27%0D%0Aend' hello
The âmainâ function receives an object that allows us to control the âbrowser tabâ. All Splash features are exposed using this object. By convention, this argument is called âsplashâ, but you are not required to follow this convention:
function main(please) please:go("http://example.com") please:wait(0.5) return "ok" endWhere Are My Callbacks?¶
Here is a snippet from our first example:
splash:go("http://example.com") splash:wait(0.5) local title = splash:evaljs("document.title")
The code looks like standard procedural code; there are no callbacks or fancy control-flow structures. It doesnât mean Splash works in a synchronous way; under the hood it is still async. When you call splash:wait(0.5)
, Splash switches from the script to other tasks, and comes back after 0.5s.
It is possible to use loops, conditional statements, functions as usual in Splash scripts which enables more straightforward coding.
Letâs check an example PhantomJS script:
// Render Multiple URLs to file "use strict"; var RenderUrlsToFile, arrayOfUrls, system; system = require("system"); /* Render given urls @param array of URLs to render @param callbackPerUrl Function called after finishing each URL, including the last URL @param callbackFinal Function called after finishing everything */ RenderUrlsToFile = function(urls, callbackPerUrl, callbackFinal) { var getFilename, next, page, retrieve, urlIndex, webpage; urlIndex = 0; webpage = require("webpage"); page = null; getFilename = function() { return "rendermulti-" + urlIndex + ".png"; }; next = function(status, url, file) { page.close(); callbackPerUrl(status, url, file); return retrieve(); }; retrieve = function() { var url; if (urls.length > 0) { url = urls.shift(); urlIndex++; page = webpage.create(); page.viewportSize = { width: 800, height: 600 }; page.settings.userAgent = "Phantom.js bot"; return page.open("http://" + url, function(status) { var file; file = getFilename(); if (status === "success") { return window.setTimeout((function() { page.render(file); return next(status, url, file); }), 200); } else { return next(status, url, file); } }); } else { return callbackFinal(); } }; return retrieve(); }; arrayOfUrls = null; if (system.args.length > 1) { arrayOfUrls = Array.prototype.slice.call(system.args, 1); } else { console.log("Usage: phantomjs render_multi_url.js [domain.name1, domain.name2, ...]"); arrayOfUrls = ["www.google.com", "www.bbc.co.uk", "phantomjs.org"]; } RenderUrlsToFile(arrayOfUrls, (function(status, url, file) { if (status !== "success") { return console.log("Unable to render '" + url + "'"); } else { return console.log("Rendered '" + url + "' at '" + file + "'"); } }), function() { return phantom.exit(); });
The code is (arguably) tricky: RenderUrlsToFile
function implements a loop by creating a chain of callbacks; page.open
callback doesnât return a value (it would be more complex to implement) - the result is saved on disk.
A similar Splash script:
function main(splash, args) splash:set_viewport_size(800, 600) splash:set_user_agent('Splash bot') local example_urls = {"www.google.com", "www.bbc.co.uk", "scrapinghub.com"} local urls = args.urls or example_urls local results = {} for _, url in ipairs(urls) do local ok, reason = splash:go("http://" .. url) if ok then splash:wait(0.2) results[url] = splash:png() end end return results end
It is not doing exactly the same work - instead of saving screenshots to files weâre returning PNG data to the client via HTTP API.
Observations:
page.open
callback which receives âstatusâ argument there is a âblockingâ splash:go call which returns âokâ flag;for
loop without a need to create a recursive callback chain;ipairs
or string concatenation via ..
could be unfamiliar;page.open
callback - example script will get a screenshot nevertheless because âstatusâ wonât be âfailâ; in Splash this error will be detected;page
objects and run several page.open
requests in parallel (?); Splash only provides a single âbrowser tabâ to a script via its splash
parameter of main
function (but youâre free to send multiple concurrent requests with Lua scripts to Splash).There are great PhantomJS wrappers like CasperJS and NightmareJS which (among other things) bring a sync-looking syntax to PhantomJS scripts by providing custom control flow mini-languages. However, they all have their own gotchas and edge cases (loops? moving code to helper functions? error handling?). Splash scripts are standard Lua code.
Note
PhantomJS itself and its wrappers are great, they deserve lots of respect; please donât take this writeup as an attack on them. These tools are much more mature and feature complete than Splash. Splash tries to look at the problem from a different angle, but for each unique Splash feature there is an unique PhantomJS feature.
To read more about Splash Lua API features check Splash Lua API Overview.
Living Without Callbacks¶Note
For the curious, Splash uses Lua coroutines under the hood.
Internally, âmainâ function is executed as a coroutine by Splash, and some of the splash:foo()
methods use coroutine.yield
. See http://www.lua.org/pil/9.html for Lua coroutines tutorial.
In Splash scripts it is not explicit which calls are async and which calls are blocking; this is a common criticism of coroutines/greenlets. Check this article for a good description of the problem.
However, these negatives have no real impact in Splash scripts which: are meant to be small, where shared state is minimized, and the API is designed to execute a single command at a time, so in most cases the control flow is linear.
If you want to be safe then think of all splash
methods as async; consider that after you call splash:foo()
a webpage being rendered can change. Often thatâs the point of calling a method, e.g. splash:wait(time)
or splash:go(url)
only make sense because webpage changes after calling them, but still - keep it in mind.
There are async methods like splash:go, splash:wait, splash:wait_for_resume, etc.; most splash methods are currently not async, but thinking of them as of async will allow your scripts to work if we ever change that.
Calling Splash Methods¶Unlike in many languages, methods in Lua are usually separated from an object using a colon :
; to call âfooâ method of âsplashâ object use splash:foo()
syntax. See http://www.lua.org/pil/16.html for more details.
There are two main ways to call Lua methods in Splash scripts: using positional and named arguments. To call a method using positional arguments use parentheses splash:foo(val1, val2)
, to call it with named arguments use curly braces: splash:foo{name1=val1, name2=val2}
:
-- Examples of positional arguments: splash:go("http://example.com") splash:wait(0.5, false) local title = splash:evaljs("document.title") -- The same using keyword arguments: splash:go{url="http://example.com"} splash:wait{time=0.5, cancel_on_redirect=false} local title = splash:evaljs{source="document.title"} -- Mixed arguments example: splash:wait{0.5, cancel_on_redirect=false}
For convenience all splash
methods are designed to support both styles of calling: positional and named. But since there are no ârealâ named arguments in Lua most Lua functions (including the ones from the standard library) choose to support just positional arguments.
There are two ways to report errors in Lua: raise an exception and return an error flag. See http://www.lua.org/pil/8.3.html.
Splash uses the following convention:
ok, reason
pairs which developer can either handle or ignore.If main
results in an unhandled exception then Splash returns HTTP 400 response with an error message.
It is possible to raise an exception manually using Lua error
function:
error("A message to be returned in a HTTP 400 response")
To handle Lua exceptions (and prevent Splash from returning HTTP 400 response) use Lua pcall
; see http://www.lua.org/pil/8.4.html.
To convert âstatus flagâ errors to exceptions Lua assert
function can be used. For example, if you expect a website to work and donât want to handle errors manually, then assert
allows to stop processing and return HTTP 400 if the assumption is wrong:
local ok, msg = splash:go("http://example.com") if not ok then -- handle error somehow, e.g. error(msg) end -- a shortcut for the code above: use assert assert(splash:go("http://example.com"))Sandbox¶
By default Splash scripts are executed in a restricted environment: not all standard Lua modules and functions are available, Lua require
is restricted, and there are resource limits (quite loose though).
To disable the sandbox start Splash with --disable-lua-sandbox
option:
$ docker run -it -p 8050:8050 scrapinghub/splash --disable-lua-sandbox
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4