Apify is a platform built to serve large-scale and high-performance web scraping and automation needs. It provides easy access to compute instances (Actors), convenient request and result storages, proxies, scheduling, webhooks and more, accessible through a web interface or an API.
While we think that the Apify platform is super cool, and it's definitely worth signing up for a free account, Crawlee is and will always be open source, runnable locally or on any cloud infrastructure.
note
We do not test Crawlee in other cloud environments such as Lambda or on specific architectures such as Raspberry PI. We strive to make it work, but there are no guarantees.
Logging into Apify platform from CrawleeTo access your Apify account from Crawlee, you must provide credentials - your API token. You can do that either by utilizing Apify CLI or with environment variables.
Once you provide credentials to your scraper, you will be able to use all the Apify platform features, such as calling actors, saving to cloud storages, using Apify proxies, setting up webhooks and so on.
Log in with CLIApify CLI allows you to log in to your Apify account on your computer. If you then run your scraper using the CLI, your credentials will automatically be added.
npm install -g apify-cli
apify login -t YOUR_API_TOKEN
Log in with environment variables
Alternatively, you can always provide credentials to your scraper by setting the APIFY_TOKEN
environment variable to your API token.
Log in with ConfigurationThere's also the
APIFY_PROXY_PASSWORD
environment variable. Actor automatically infers that from your token, but it can be useful when you need to access proxies from a different account than your token represents.
Another option is to use the Configuration
instance and set your api token there.
import { Actor } from 'apify';
const sdk = new Actor({ token: 'your_api_token' });
What is an actor
When you deploy your script to the Apify platform, it becomes an actor. An actor is a serverless microservice that accepts an input and produces an output. It can run for a few seconds, hours or even infinitely. An actor can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset.
Actors can be shared in the Apify Store so that other people can use them. But don't worry, if you share your actor in the store and somebody uses it, it runs under their account, not yours.
Related links
Running an actor locallyFirst let's create a boilerplate of the new actor. You could use Apify CLI and just run:
apify create my-hello-world
The CLI will prompt you to select a project boilerplate template - let's pick "Hello world". The tool will create a directory called my-hello-world
with a Node.js project files. You can run the actor as follows:
cd my-hello-world
apify run
Running Crawlee code as an actor
For running Crawlee code as an actor on Apify platform you should either:
Actor.init()
and Actor.exit()
functions;Actor.main()
function.NOTE
Actor.init()
and Actor.exit()
to your code are the only two important things needed to run it on Apify platform as an actor. Actor.init()
is needed to initialize your actor (e.g. to set the correct storage implementation), while without Actor.exit()
the process will simply never stop.Actor.main()
is an alternative to Actor.init()
and Actor.exit()
as it calls both behind the scenes.Let's look at the CheerioCrawler
example from the Quick Start guide:
import { Actor } from 'apify';
import { CheerioCrawler } from 'crawlee';
await Actor.main(async () => {
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
const { url } = request;
const title = $('title').text();
console.log(`Title of ${url}: ${title}`);
await enqueueLinks({
globs: ['https://www.iana.org/*'],
});
await Actor.pushData({ url, title });
},
});
await crawler.run(['https://www.iana.org/']);
});
import { Actor } from 'apify';
import { CheerioCrawler } from 'crawlee';
await Actor.init();
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
const { url } = request;
const title = $('title').text();
console.log(`Title of ${url}: ${title}`);
await enqueueLinks({
globs: ['https://www.iana.org/*'],
});
await Actor.pushData({ url, title });
},
});
await crawler.run(['https://www.iana.org/']);
await Actor.exit();
Note that you could also run your actor (that is using Crawlee) locally with Apify CLI. You could start it via the following command in your project folder:
Deploying an actor to Apify platformNow (assuming you are already logged in to your Apify account) you can easily deploy your code to the Apify platform by running:
Your script will be uploaded to and built on the Apify platform so that it can be run there. For more information, view the Apify Actor documentation.
Usage on Apify platformYou can also develop your actor in an online code editor directly on the platform (you'll need an Apify Account). Let's go to the Actors page in the app, click Create new and then go to the Source tab and start writing the code or paste one of the examples from the Examples section.
StoragesThere are several things worth mentioning here.
Helper functions for default Key-Value Store and DatasetTo simplify access to the default storages, instead of using the helper functions of respective storage classes, you could use:
Actor.setValue()
, Actor.getValue()
, Actor.getInput()
for Key-Value Store
Actor.pushData()
for Dataset
When you plan to use the platform storage while developing and running your actor locally, you should use Actor.openKeyValueStore()
, Actor.openDataset()
and Actor.openRequestQueue()
to open the respective storage.
Using each of these methods allows to pass the OpenStorageOptions
as a second argument, which has only one optional property: forceCloud
. If set to true
- cloud storage will be used instead of the folder on the local disk.
If you need to share a link to some file stored in a Key-Value Store on Apify Platform, you can use getPublicUrl()
method. It accepts only one parameter: key
- the key of the item you want to share.
import { KeyValueStore } from 'apify';
const store = await KeyValueStore.open();
await store.setValue('your-file', { foo: 'bar' });
const url = store.getPublicUrl('your-file');
Exporting dataset data
When the Dataset
is stored on the Apify platform, you can export its data to the following formats: HTML, JSON, CSV, Excel, XML and RSS. The datasets are displayed on the actor run details page and in the Storage section in the Apify Console. The actual data is exported using the Get dataset items Apify API endpoint. This way you can easily share the crawling results.
Related links
The following are some additional environment variables specific to Apify platform. More Crawlee specific environment variables could be found in the Environment Variables guide.
note
It's important to notice that CRAWLEE_
environment variables don't need to be replaced with equivalent APIFY_
ones. Likewise, Crawlee understands APIFY_
environment variables after calling Actor.init()
or when using Actor.main()
.
APIFY_TOKEN
The API token for your Apify account. It is used to access the Apify API, e.g. to access cloud storage or to run an actor on the Apify platform. You can find your API token on the Account Settings / Integrations page.
Combinations ofAPIFY_TOKEN
and CRAWLEE_STORAGE_DIR
CRAWLEE_STORAGE_DIR
env variable description could be found in Environment Variables guide.
By combining the env vars in various ways, you can greatly influence the actor's behavior.
Env Vars API Storages none ORCRAWLEE_STORAGE_DIR
no local APIFY_TOKEN
yes Apify platform APIFY_TOKEN
AND CRAWLEE_STORAGE_DIR
yes local + platform
When using both APIFY_TOKEN
and CRAWLEE_STORAGE_DIR
, you can use all the Apify platform features and your data will be stored locally by default. If you want to access platform storages, you can use the { forceCloud: true }
option in their respective functions.
import { Actor } from 'apify';
import { Dataset } from 'crawlee';
const localDataset = await Actor.openDataset('my-local-data');
const remoteDataset = await Actor.openDataset('my-dataset', {
forceCloud: true,
});
APIFY_PROXY_PASSWORD
Optional password to Apify Proxy for IP address rotation. Assuming Apify Account was already created, you can find the password on the Proxy page in the Apify Console. The password is automatically inferred using the APIFY_TOKEN
env var, so in most cases, you don't need to touch it. You should use it when, for some reason, you need access to Apify Proxy, but not access to Apify API, or when you need access to proxy from a different account than your token represents.
In addition to your own proxy servers and proxy servers acquired from third-party providers used together with Crawlee, you can also rely on Apify Proxy for your scraping needs.
Apify ProxyIf you are already subscribed to Apify Proxy, you can start using them immediately in only a few lines of code (for local usage you first should be logged in to your Apify account.
import { Actor } from 'apify';
const proxyConfiguration = await Actor.createProxyConfiguration();
const proxyUrl = await proxyConfiguration.newUrl();
Note that unlike using your own proxies in Crawlee, you shouldn't use the constructor to create ProxyConfiguration
instance. For using Apify Proxy you should create an instance using the Actor.createProxyConfiguration()
function instead.
With Apify Proxy, you can select specific proxy groups to use, or countries to connect from. This allows you to get better proxy performance after some initial research.
import { Actor } from 'apify';
const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['RESIDENTIAL'],
countryCode: 'US',
});
const proxyUrl = await proxyConfiguration.newUrl();
Now your crawlers will use only Residential proxies from the US. Note that you must first get access to a proxy group before you are able to use it. You can check proxy groups available to you in the proxy dashboard.
Apify Proxy vs. Own proxiesThe ProxyConfiguration
class covers both Apify Proxy and custom proxy URLs so that you can easily switch between proxy providers. However, some features of the class are available only to Apify Proxy users, mainly because Apify Proxy is what one would call a super-proxy. It's not a single proxy server, but an API endpoint that allows connection through millions of different IP addresses. So the class essentially has two modes: Apify Proxy or Own (third party) proxy.
The difference is easy to remember.
constructor
function based on the provided ProxyConfigurationOptions
.Actor.createProxyConfiguration()
function. ProxyConfigurationOptions.proxyUrls
and ProxyConfigurationOptions.newUrlFunction
enable use of your custom proxy URLs, whereas all the other options are there to configure Apify Proxy.Related links
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4