A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://docs.apify.com/sdk/python/docs/guides/selenium below:

Using Selenium | SDK for Python

Selenium is a tool for web automation and testing that can also be used for web scraping. It allows you to control a web browser programmatically and interact with web pages just as a human would.

To create Actors which use Selenium, start from the Selenium & Python Actor template.

On the Apify platform, the Actor will already have Selenium and the necessary browsers preinstalled in its Docker image, including the tools and setup necessary to run browsers in headful mode.

When running the Actor locally, you'll need to install the Selenium browser drivers yourself. Refer to the Selenium documentation for installation instructions.

This is a simple Actor that recursively scrapes titles from all linked websites, up to a maximum depth, starting from URLs in the Actor input.

It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title and anchor elements after the pages load.

from __future__ import annotations

import asyncio
from urllib.parse import urljoin

from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.common.by import By

from apify import Actor, Request








async def main() -> None:

async with Actor:

actor_input = await Actor.get_input() or {}
start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])
max_depth = actor_input.get('max_depth', 1)


if not start_urls:
Actor.log.info('No start URLs specified in actor input, exiting...')
await Actor.exit()


request_queue = await Actor.open_request_queue()


for start_url in start_urls:
url = start_url.get('url')
Actor.log.info(f'Enqueuing {url} ...')
new_request = Request.from_url(url, user_data={'depth': 0})
await request_queue.add_request(new_request)


Actor.log.info('Launching Chrome WebDriver...')
chrome_options = ChromeOptions()

if Actor.config.headless:
chrome_options.add_argument('--headless')

chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)


driver.get('http://www.example.com')
if driver.title != 'Example Domain':
raise ValueError('Failed to open example page.')


while request := await request_queue.fetch_next_request():
url = request.url

if not isinstance(request.user_data['depth'], (str, int)):
raise TypeError('Request.depth is an enexpected type.')

depth = int(request.user_data['depth'])
Actor.log.info(f'Scraping {url} (depth={depth}) ...')

try:


await asyncio.to_thread(driver.get, url)



if depth < max_depth:
for link in driver.find_elements(By.TAG_NAME, 'a'):
link_href = link.get_attribute('href')
link_url = urljoin(url, link_href)

if link_url.startswith(('http://', 'https://')):
Actor.log.info(f'Enqueuing {link_url} ...')
new_request = Request.from_url(
link_url,
user_data={'depth': depth + 1},
)
await request_queue.add_request(new_request)


data = {
'url': url,
'title': driver.title,
}


await Actor.push_data(data)

except Exception:
Actor.log.exception(f'Cannot extract data from {url}.')

finally:

await request_queue.mark_request_as_handled(request)

driver.quit()

In this guide you learned how to use Selenium for web scraping in Apify Actors. You can now create your own Actors that use Selenium to scrape dynamic websites and interact with web pages just like a human would. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4