RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://docs.apify.com/sdk/python/docs/concepts/storages below:

Working with storages | SDK for Python

Working with storages

The Actor class provides methods to work either with the default storages of the Actor, or with any other storage, named or unnamed.

Types of storages

There are three types of storages available to Actors.

First are datasets, which are append-only tables for storing the results of your Actors. You can open a dataset through the Actor.open_dataset method, and work with it through the resulting Dataset class instance.

Next there are key-value stores, which function as a read/write storage for storing file-like objects, typically the Actor state or binary results. You can open a key-value store through the Actor.open_key_value_store method, and work with it through the resulting KeyValueStore class instance.

Finally, there are request queues. These are queues into which you can put the URLs you want to scrape, and from which the Actor can dequeue them and process them. You can open a request queue through the Actor.open_request_queue method, and work with it through the resulting RequestQueue class instance.

Each Actor run has its default dataset, default key-value store and default request queue.

Local storage emulation

To be able to develop Actors locally, the storages that the Apify platform provides are emulated on the local filesystem.

The storage contents are loaded from and saved to the storage folder in the Actor's main folder. Each storage type is stored in its own subfolder, so for example datasets are stored in the storage/datasets folder.

Each storage is then stored in its own folder, named after the storage, or called default if it's the default storage. For example, a request queue with the name my-queue would be stored in storage/request_queues/my-queue.

Each dataset item, key-value store record, or request in a request queue is then stored in its own file in the storage folder. Dataset items and request queue requests are always JSON files, and key-value store records can be any file type, based on its content type. For example, the Actor input is typically stored in storage/key_value_stores/default/INPUT.json.

Local Actor run with remote storage

When developing locally, opening any storage will by default use local storage. To change this behavior and to use remote storage you have to use force_cloud=True argument in Actor.open_dataset, Actor.open_request_queue or Actor.open_key_value_store. Proper use of this argument allows you to work with both local and remote storages.

Calling another remote Actor and accessing its default storage is typical use-case for using force-cloud=True argument to open remote Actor's storages.

Local storage persistence

By default, the storage contents are persisted across multiple Actor runs. To clean up the Actor storages before the running the Actor, use the --purge flag of the apify run command of the Apify CLI.

Convenience methods for working with default storages

There are several methods for directly working with the default key-value store or default dataset of the Actor.

Actor.get_value('my-record') reads a record from the default key-value store of the Actor.
Actor.set_value('my-record', 'my-value') saves a new value to the record in the default key-value store.
Actor.get_input reads the Actor input from the default key-value store of the Actor.
Actor.push_data([{'result': 'Hello, world!'}, ...]) saves results to the default dataset of the Actor.

Opening named and unnamed storages

The Actor.open_dataset, Actor.open_key_value_store and Actor.open_request_queue methods can be used to open any storage for reading and writing. You can either use them without arguments to open the default storages, or you can pass a storage ID or name to open another storage.

from apify import Actor, Request


async def main() -> None:
    async with Actor:
        
        dataset = await Actor.open_dataset()
        await dataset.push_data({'result': 'Hello, world!'})

        
        key_value_store = await Actor.open_key_value_store(id='mIJVZsRQrDQf4rUAf')
        await key_value_store.set_value('record', 'Hello, world!')

        
        request_queue = await Actor.open_request_queue(name='my-queue')
        await request_queue.add_request(Request.from_url('https://apify.com'))

Deleting storages

To delete a storage, you can use the Dataset.drop, KeyValueStore.drop or RequestQueue.drop methods.

from apify import Actor


async def main() -> None:
    async with Actor:
        
        key_value_store = await Actor.open_key_value_store(name='my-cool-store')
        await key_value_store.set_value('record', 'Hello, world!')

        

        
        await key_value_store.drop()

Working with datasets

In this section we will show you how to work with datasets.

Reading & writing items

To write data into a dataset, you can use the Dataset.push_data method.

To read data from a dataset, you can use the Dataset.get_data method.

To get an iterator of the data, you can use the Dataset.iterate_items method.

from apify import Actor


async def main() -> None:
    async with Actor:
        
        dataset = await Actor.open_dataset(name='my-cool-dataset')
        await dataset.push_data([{'itemNo': i} for i in range(1000)])

        
        first_half = await dataset.get_data(limit=500)
        Actor.log.info(f'The first half of items = {first_half.items}')

        
        second_half = [item async for item in dataset.iterate_items(offset=500)]
        Actor.log.info(f'The second half of items = {second_half}')

Exporting items

You can also export the dataset items into a key-value store, as either a CSV or a JSON record, using the Dataset.export_to_csv or Dataset.export_to_json method.

from apify import Actor


async def main() -> None:
    async with Actor:
        
        dataset = await Actor.open_dataset(name='my-cool-dataset')
        await dataset.push_data([{'itemNo': i} for i in range(1000)])

        
        await dataset.export_to(
            content_type='csv',
            key='data.csv',
            to_kvs_name='my-cool-key-value-store',
        )

        
        await dataset.export_to(
            content_type='json',
            key='data.json',
            to_kvs_name='my-cool-key-value-store',
        )

        
        store = await Actor.open_key_value_store(name='my-cool-key-value-store')

        csv_data = await store.get_value('data.csv')
        Actor.log.info(f'CSV data: {csv_data}')

        json_data = await store.get_value('data.json')
        Actor.log.info(f'JSON data: {json_data}')

Working with key-value stores

In this section we will show you how to work with key-value stores.

Reading and writing records

To read records from a key-value store, you can use the KeyValueStore.get_value method.

To write records into a key-value store, you can use the KeyValueStore.set_value method. You can set the content type of a record with the content_type argument. To delete a record, set its value to None.

from apify import Actor


async def main() -> None:
    async with Actor:
        
        kvs = await Actor.open_key_value_store(name='my-cool-key-value-store')

        
        await kvs.set_value('automatic_text', 'abcd')
        await kvs.set_value('automatic_json', {'ab': 'cd'})
        await kvs.set_value('explicit_csv', 'a,b\nc,d', content_type='text/csv')

        
        automatic_text = await kvs.get_value('automatic_text')
        Actor.log.info(f'Automatic text: {automatic_text}')

        automatic_json = await kvs.get_value('automatic_json')
        Actor.log.info(f'Automatic JSON: {automatic_json}')

        explicit_csv = await kvs.get_value('explicit_csv')
        Actor.log.info(f'Explicit CSV: {explicit_csv}')

        
        await kvs.set_value('automatic_text', None)

Iterating keys

To get an iterator of the key-value store record keys, you can use the KeyValueStore.iterate_keys method.

from apify import Actor


async def main() -> None:
    async with Actor:
        
        kvs = await Actor.open_key_value_store(name='my-cool-key-value-store')

        
        await kvs.set_value('automatic_text', 'abcd')
        await kvs.set_value('automatic_json', {'ab': 'cd'})
        await kvs.set_value('explicit_csv', 'a,b\nc,d', content_type='text/csv')

        
        Actor.log.info('Records in store:')

        async for key, info in kvs.iterate_keys():
            Actor.log.info(f'key={key}, info={info}')

Public URLs of records

To get a publicly accessible URL of a key-value store record, you can use the KeyValueStore.get_public_url method.

from apify import Actor


async def main() -> None:
    async with Actor:
        
        store = await Actor.open_key_value_store(name='my-cool-key-value-store')

        
        my_record_url = await store.get_public_url('my_record')
        Actor.log.info(f'URL of "my_record": {my_record_url}')

Working with request queues

In this section we will show you how to work with request queues.

Adding requests to a queue

To add a request into the queue, you can use the RequestQueue.add_request method.

You can use the forefront boolean argument to specify whether the request should go to the beginning of the queue, or to the end.

You can use the unique_key of the request to uniquely identify a request. If you try to add more requests with the same unique key, only the first one will be added.

Check out the Request for more information on how to create requests and what properties they have.

Reading requests

To fetch the next request from the queue for processing, you can use the RequestQueue.fetch_next_request method.

To get info about a specific request from the queue, you can use the RequestQueue.get_request method.

Handling requests

To mark a request as handled, you can use the RequestQueue.mark_request_as_handled method.

To mark a request as not handled, so that it gets retried, you can use the RequestQueue.reclaim_request method.

To check if all the requests in the queue are handled, you can use the RequestQueue.is_finished method.

Full example

import asyncio
import random

from apify import Actor, Request

FAILURE_RATE = 0.3


async def main() -> None:
    async with Actor:
        
        queue = await Actor.open_request_queue()

        
        for i in range(1, 10):
            await queue.add_request(Request.from_url(f'http://example.com/{i}'))

        
        await queue.add_request(Request.from_url('http://example.com/0'), forefront=True)

        
        add_request_info = await queue.add_request(
            Request.from_url('http://different-example.com/5')
        )
        Actor.log.info(f'Add request info: {add_request_info}')

        processed_request = await queue.get_request(add_request_info.id)
        Actor.log.info(f'Processed request: {processed_request}')

        
        while not await queue.is_finished():
            
            request = await queue.fetch_next_request()
            
            
            if request is None:
                await asyncio.sleep(1)
                continue

            Actor.log.info(f'Processing request {request.unique_key}...')
            Actor.log.info(f'Scraping URL {request.url}...')

            
            await asyncio.sleep(1)
            if random.random() > FAILURE_RATE:
                
                Actor.log.info('Request successful.')
                await queue.mark_request_as_handled(request)
            else:
                
                
                Actor.log.warning('Request failed, will retry!')
                await queue.reclaim_request(request)

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4