The Actor
class provides methods to work either with the default storages of the Actor, or with any other storage, named or unnamed.
There are three types of storages available to Actors.
First are datasets, which are append-only tables for storing the results of your Actors. You can open a dataset through the Actor.open_dataset
method, and work with it through the resulting Dataset
class instance.
Next there are key-value stores, which function as a read/write storage for storing file-like objects, typically the Actor state or binary results. You can open a key-value store through the Actor.open_key_value_store
method, and work with it through the resulting KeyValueStore
class instance.
Finally, there are request queues. These are queues into which you can put the URLs you want to scrape, and from which the Actor can dequeue them and process them. You can open a request queue through the Actor.open_request_queue
method, and work with it through the resulting RequestQueue
class instance.
Each Actor run has its default dataset, default key-value store and default request queue.
Local storage emulationTo be able to develop Actors locally, the storages that the Apify platform provides are emulated on the local filesystem.
The storage contents are loaded from and saved to the storage
folder in the Actor's main folder. Each storage type is stored in its own subfolder, so for example datasets are stored in the storage/datasets
folder.
Each storage is then stored in its own folder, named after the storage, or called default
if it's the default storage. For example, a request queue with the name my-queue
would be stored in storage/request_queues/my-queue
.
Each dataset item, key-value store record, or request in a request queue is then stored in its own file in the storage folder. Dataset items and request queue requests are always JSON files, and key-value store records can be any file type, based on its content type. For example, the Actor input is typically stored in storage/key_value_stores/default/INPUT.json
.
When developing locally, opening any storage will by default use local storage. To change this behavior and to use remote storage you have to use force_cloud=True
argument in Actor.open_dataset
, Actor.open_request_queue
or Actor.open_key_value_store
. Proper use of this argument allows you to work with both local and remote storages.
Calling another remote Actor and accessing its default storage is typical use-case for using force-cloud=True
argument to open remote Actor's storages.
By default, the storage contents are persisted across multiple Actor runs. To clean up the Actor storages before the running the Actor, use the --purge
flag of the apify run
command of the Apify CLI.
There are several methods for directly working with the default key-value store or default dataset of the Actor.
Actor.get_value('my-record')
reads a record from the default key-value store of the Actor.Actor.set_value('my-record', 'my-value')
saves a new value to the record in the default key-value store.Actor.get_input
reads the Actor input from the default key-value store of the Actor.Actor.push_data([{'result': 'Hello, world!'}, ...])
saves results to the default dataset of the Actor.The Actor.open_dataset
, Actor.open_key_value_store
and Actor.open_request_queue
methods can be used to open any storage for reading and writing. You can either use them without arguments to open the default storages, or you can pass a storage ID or name to open another storage.
from apify import Actor, Request
async def main() -> None:
async with Actor:
dataset = await Actor.open_dataset()
await dataset.push_data({'result': 'Hello, world!'})
key_value_store = await Actor.open_key_value_store(id='mIJVZsRQrDQf4rUAf')
await key_value_store.set_value('record', 'Hello, world!')
request_queue = await Actor.open_request_queue(name='my-queue')
await request_queue.add_request(Request.from_url('https://apify.com'))
Deleting storages
To delete a storage, you can use the Dataset.drop
, KeyValueStore.drop
or RequestQueue.drop
methods.
from apify import Actor
async def main() -> None:
async with Actor:
key_value_store = await Actor.open_key_value_store(name='my-cool-store')
await key_value_store.set_value('record', 'Hello, world!')
await key_value_store.drop()
Working with datasets
In this section we will show you how to work with datasets.
Reading & writing itemsTo write data into a dataset, you can use the Dataset.push_data
method.
To read data from a dataset, you can use the Dataset.get_data
method.
To get an iterator of the data, you can use the Dataset.iterate_items
method.
from apify import Actor
async def main() -> None:
async with Actor:
dataset = await Actor.open_dataset(name='my-cool-dataset')
await dataset.push_data([{'itemNo': i} for i in range(1000)])
first_half = await dataset.get_data(limit=500)
Actor.log.info(f'The first half of items = {first_half.items}')
second_half = [item async for item in dataset.iterate_items(offset=500)]
Actor.log.info(f'The second half of items = {second_half}')
Exporting items
You can also export the dataset items into a key-value store, as either a CSV or a JSON record, using the Dataset.export_to_csv
or Dataset.export_to_json
method.
from apify import Actor
async def main() -> None:
async with Actor:
dataset = await Actor.open_dataset(name='my-cool-dataset')
await dataset.push_data([{'itemNo': i} for i in range(1000)])
await dataset.export_to(
content_type='csv',
key='data.csv',
to_kvs_name='my-cool-key-value-store',
)
await dataset.export_to(
content_type='json',
key='data.json',
to_kvs_name='my-cool-key-value-store',
)
store = await Actor.open_key_value_store(name='my-cool-key-value-store')
csv_data = await store.get_value('data.csv')
Actor.log.info(f'CSV data: {csv_data}')
json_data = await store.get_value('data.json')
Actor.log.info(f'JSON data: {json_data}')
Working with key-value stores
In this section we will show you how to work with key-value stores.
Reading and writing recordsTo read records from a key-value store, you can use the KeyValueStore.get_value
method.
To write records into a key-value store, you can use the KeyValueStore.set_value
method. You can set the content type of a record with the content_type
argument. To delete a record, set its value to None
.
from apify import Actor
async def main() -> None:
async with Actor:
kvs = await Actor.open_key_value_store(name='my-cool-key-value-store')
await kvs.set_value('automatic_text', 'abcd')
await kvs.set_value('automatic_json', {'ab': 'cd'})
await kvs.set_value('explicit_csv', 'a,b\nc,d', content_type='text/csv')
automatic_text = await kvs.get_value('automatic_text')
Actor.log.info(f'Automatic text: {automatic_text}')
automatic_json = await kvs.get_value('automatic_json')
Actor.log.info(f'Automatic JSON: {automatic_json}')
explicit_csv = await kvs.get_value('explicit_csv')
Actor.log.info(f'Explicit CSV: {explicit_csv}')
await kvs.set_value('automatic_text', None)
Iterating keys
To get an iterator of the key-value store record keys, you can use the KeyValueStore.iterate_keys
method.
from apify import Actor
async def main() -> None:
async with Actor:
kvs = await Actor.open_key_value_store(name='my-cool-key-value-store')
await kvs.set_value('automatic_text', 'abcd')
await kvs.set_value('automatic_json', {'ab': 'cd'})
await kvs.set_value('explicit_csv', 'a,b\nc,d', content_type='text/csv')
Actor.log.info('Records in store:')
async for key, info in kvs.iterate_keys():
Actor.log.info(f'key={key}, info={info}')
Public URLs of records
To get a publicly accessible URL of a key-value store record, you can use the KeyValueStore.get_public_url
method.
from apify import Actor
async def main() -> None:
async with Actor:
store = await Actor.open_key_value_store(name='my-cool-key-value-store')
my_record_url = await store.get_public_url('my_record')
Actor.log.info(f'URL of "my_record": {my_record_url}')
Working with request queues
In this section we will show you how to work with request queues.
Adding requests to a queueTo add a request into the queue, you can use the RequestQueue.add_request
method.
You can use the forefront
boolean argument to specify whether the request should go to the beginning of the queue, or to the end.
You can use the unique_key
of the request to uniquely identify a request. If you try to add more requests with the same unique key, only the first one will be added.
Check out the Request
for more information on how to create requests and what properties they have.
To fetch the next request from the queue for processing, you can use the RequestQueue.fetch_next_request
method.
To get info about a specific request from the queue, you can use the RequestQueue.get_request
method.
To mark a request as handled, you can use the RequestQueue.mark_request_as_handled
method.
To mark a request as not handled, so that it gets retried, you can use the RequestQueue.reclaim_request
method.
To check if all the requests in the queue are handled, you can use the RequestQueue.is_finished
method.
import asyncio
import random
from apify import Actor, Request
FAILURE_RATE = 0.3
async def main() -> None:
async with Actor:
queue = await Actor.open_request_queue()
for i in range(1, 10):
await queue.add_request(Request.from_url(f'http://example.com/{i}'))
await queue.add_request(Request.from_url('http://example.com/0'), forefront=True)
add_request_info = await queue.add_request(
Request.from_url('http://different-example.com/5')
)
Actor.log.info(f'Add request info: {add_request_info}')
processed_request = await queue.get_request(add_request_info.id)
Actor.log.info(f'Processed request: {processed_request}')
while not await queue.is_finished():
request = await queue.fetch_next_request()
if request is None:
await asyncio.sleep(1)
continue
Actor.log.info(f'Processing request {request.unique_key}...')
Actor.log.info(f'Scraping URL {request.url}...')
await asyncio.sleep(1)
if random.random() > FAILURE_RATE:
Actor.log.info('Request successful.')
await queue.mark_request_as_handled(request)
else:
Actor.log.warning('Request failed, will retry!')
await queue.reclaim_request(request)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4