A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://python.langchain.com/docs/integrations/tools/hyperbrowser_web_scraping_tools below:

Hyperbrowser Web Scraping Tools | 🦜️🔗 LangChain

Hyperbrowser Web Scraping Tools

Hyperbrowser is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.

Key Features:

This notebook provides a quick overview for getting started with Hyperbrowser web tools.

For more information about Hyperbrowser, please visit the Hyperbrowser website or if you want to check out the docs, you can visit the Hyperbrowser docs.

Key Capabilities Scrape

Hyperbrowser provides powerful scraping capabilities that allow you to extract data from any webpage. The scraping tool can convert web content into structured formats like markdown or HTML, making it easy to process and analyze the data.

Crawl

The crawling functionality enables you to navigate through multiple pages of a website automatically. You can set parameters like page limits to control how extensively the crawler explores the site, collecting data from each page it visits.

Hyperbrowser's extraction capabilities use AI to pull specific information from webpages according to your defined schema. This allows you to transform unstructured web content into structured data that matches your exact requirements.

Overview Integration details Tool Package Local Serializable JS support Crawl Tool langchain-hyperbrowser ❌ ❌ ❌ Scrape Tool langchain-hyperbrowser ❌ ❌ ❌ Extract Tool langchain-hyperbrowser ❌ ❌ ❌ Setup

To access the Hyperbrowser web tools you'll need to install the langchain-hyperbrowser integration package, and create a Hyperbrowser account and get an API key.

Credentials

Head to Hyperbrowser to sign up and generate an API key. Once you've done this set the HYPERBROWSER_API_KEY environment variable:

export HYPERBROWSER_API_KEY=<your-api-key>
Installation

Install langchain-hyperbrowser.

%pip install -qU langchain-hyperbrowser
Instantiation Crawl Tool

The HyperbrowserCrawlTool is a powerful tool that can crawl entire websites, starting from a given URL. It supports configurable page limits and scraping options.

from langchain_hyperbrowser import HyperbrowserCrawlTool
tool = HyperbrowserCrawlTool()
Scrape Tool

The HyperbrowserScrapeTool is a tool that can scrape content from web pages. It supports both markdown and HTML output formats, along with metadata extraction.

from langchain_hyperbrowser import HyperbrowserScrapeTool
tool = HyperbrowserScrapeTool()

The HyperbrowserExtractTool is a powerful tool that uses AI to extract structured data from web pages. It can extract information based predefined schemas.

from langchain_hyperbrowser import HyperbrowserExtractTool
tool = HyperbrowserExtractTool()
Invocation Basic Usage Crawl Tool
from langchain_hyperbrowser import HyperbrowserCrawlTool

result = HyperbrowserCrawlTool().invoke(
{
"url": "https://example.com",
"max_pages": 2,
"scrape_options": {"formats": ["markdown"]},
}
)
print(result)
{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
Scrape Tool
from langchain_hyperbrowser import HyperbrowserScrapeTool

result = HyperbrowserScrapeTool().invoke(
{"url": "https://example.com", "scrape_options": {"formats": ["markdown"]}}
)
print(result)
{'data': ScrapeJobData(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None), 'error': None}
from langchain_hyperbrowser import HyperbrowserExtractTool
from pydantic import BaseModel


class SimpleExtractionModel(BaseModel):
title: str


result = HyperbrowserExtractTool().invoke(
{
"url": "https://example.com",
"schema": SimpleExtractionModel,
}
)
print(result)
{'data': {'title': 'Example Domain'}, 'error': None}
With Custom Options Crawl Tool with Custom Options
result = HyperbrowserCrawlTool().run(
{
"url": "https://example.com",
"max_pages": 2,
"scrape_options": {
"formats": ["markdown", "html"],
},
"session_options": {"use_proxy": True, "solve_captchas": True},
}
)
print(result)
{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
Scrape Tool with Custom Options
result = HyperbrowserScrapeTool().run(
{
"url": "https://example.com",
"scrape_options": {
"formats": ["markdown", "html"],
},
"session_options": {"use_proxy": True, "solve_captchas": True},
}
)
print(result)
{'data': ScrapeJobData(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html='<html><head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8">\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n        \n</head>\n\n<body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.</p>\n    <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n\n\n</body></html>', markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None), 'error': None}
from typing import List

from pydantic import BaseModel


class ProductSchema(BaseModel):
title: str
price: float


class ProductsSchema(BaseModel):
products: List[ProductSchema]


result = HyperbrowserExtractTool().run(
{
"url": "https://dummyjson.com/products?limit=10",
"schema": ProductsSchema,
"session_options": {"session_options": {"use_proxy": True}},
}
)
print(result)
{'data': {'products': [{'price': 9.99, 'title': 'Essence Mascara Lash Princess'}, {'price': 19.99, 'title': 'Eyeshadow Palette with Mirror'}, {'price': 14.99, 'title': 'Powder Canister'}, {'price': 12.99, 'title': 'Red Lipstick'}, {'price': 8.99, 'title': 'Red Nail Polish'}, {'price': 49.99, 'title': 'Calvin Klein CK One'}, {'price': 129.99, 'title': 'Chanel Coco Noir Eau De'}, {'price': 89.99, 'title': "Dior J'adore"}, {'price': 69.99, 'title': 'Dolce Shine Eau de'}, {'price': 79.99, 'title': 'Gucci Bloom Eau de'}]}, 'error': None}
Async Usage

All tools support async usage:

from typing import List

from langchain_hyperbrowser import (
HyperbrowserCrawlTool,
HyperbrowserExtractTool,
HyperbrowserScrapeTool,
)
from pydantic import BaseModel


class ExtractionSchema(BaseModel):
popular_library_name: List[str]


async def web_operations():

crawl_tool = HyperbrowserCrawlTool()
crawl_result = await crawl_tool.arun(
{
"url": "https://example.com",
"max_pages": 5,
"scrape_options": {"formats": ["markdown"]},
}
)


scrape_tool = HyperbrowserScrapeTool()
scrape_result = await scrape_tool.arun(
{"url": "https://example.com", "scrape_options": {"formats": ["markdown"]}}
)


extract_tool = HyperbrowserExtractTool()
extract_result = await extract_tool.arun(
{
"url": "https://npmjs.com",
"schema": ExtractionSchema,
}
)

return crawl_result, scrape_result, extract_result


results = await web_operations()
print(results)
---------------------------------------------------------------------------
``````output
NameError Traceback (most recent call last)
``````output
Cell In[6], line 10
1 from langchain_hyperbrowser import (
2 HyperbrowserCrawlTool,
3 HyperbrowserExtractTool,
4 HyperbrowserScrapeTool,
5 )
7 from pydantic import BaseModel
---> 10 class ExtractionSchema(BaseModel):
11 popular_library_name: List[str]
14 async def web_operations():
15 # Crawl
``````output
Cell In[6], line 11, in ExtractionSchema()
10 class ExtractionSchema(BaseModel):
---> 11 popular_library_name: List[str]
``````output
NameError: name 'List' is not defined
Use within an agent

Here's how to use any of the web tools within an agent:

from langchain_hyperbrowser import HyperbrowserCrawlTool
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent


crawl_tool = HyperbrowserCrawlTool()


llm = ChatOpenAI(temperature=0)

agent = create_react_agent(llm, [crawl_tool])
user_input = "Crawl https://example.com and get content from up to 5 pages"
for step in agent.stream(
{"messages": user_input},
stream_mode="values",
):
step["messages"][-1].pretty_print()
================================ Human Message =================================

Crawl https://example.com and get content from up to 5 pages
================================== Ai Message ==================================
Tool Calls:
hyperbrowser_crawl_data (call_G2ofdHOqjdnJUZu4hhbuga58)
Call ID: call_G2ofdHOqjdnJUZu4hhbuga58
Args:
url: https://example.com
max_pages: 5
scrape_options: {'formats': ['markdown']}
================================= Tool Message =================================
Name: hyperbrowser_crawl_data

{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
================================== Ai Message ==================================

I have crawled the website [https://example.com](https://example.com) and retrieved content from the first page. Here is the content in markdown format:

\`\`\`
Example Domain

# Example Domain

This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.

[More information...](https://www.iana.org/domains/example)
\`\`\`

If you would like to crawl more pages or need additional information, please let me know!
Configuration Options Common Options

All tools support these basic configuration options:

Tool-Specific Options Crawl Tool Scrape Tool

For more details, see the respective API references:

API reference

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4