Hyperbrowser is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.
Key Features:
This notebook provides a quick overview for getting started with Hyperbrowser web tools.
For more information about Hyperbrowser, please visit the Hyperbrowser website or if you want to check out the docs, you can visit the Hyperbrowser docs.
Key Capabilities ScrapeHyperbrowser provides powerful scraping capabilities that allow you to extract data from any webpage. The scraping tool can convert web content into structured formats like markdown or HTML, making it easy to process and analyze the data.
CrawlThe crawling functionality enables you to navigate through multiple pages of a website automatically. You can set parameters like page limits to control how extensively the crawler explores the site, collecting data from each page it visits.
Hyperbrowser's extraction capabilities use AI to pull specific information from webpages according to your defined schema. This allows you to transform unstructured web content into structured data that matches your exact requirements.
Overview Integration details Tool Package Local Serializable JS support Crawl Tool langchain-hyperbrowser ❌ ❌ ❌ Scrape Tool langchain-hyperbrowser ❌ ❌ ❌ Extract Tool langchain-hyperbrowser ❌ ❌ ❌ SetupTo access the Hyperbrowser web tools you'll need to install the langchain-hyperbrowser
integration package, and create a Hyperbrowser account and get an API key.
Head to Hyperbrowser to sign up and generate an API key. Once you've done this set the HYPERBROWSER_API_KEY environment variable:
export HYPERBROWSER_API_KEY=<your-api-key>
Installation
Install langchain-hyperbrowser.
%pip install -qU langchain-hyperbrowser
Instantiation Crawl Tool
The HyperbrowserCrawlTool
is a powerful tool that can crawl entire websites, starting from a given URL. It supports configurable page limits and scraping options.
from langchain_hyperbrowser import HyperbrowserCrawlTool
tool = HyperbrowserCrawlTool()
Scrape Tool
The HyperbrowserScrapeTool
is a tool that can scrape content from web pages. It supports both markdown and HTML output formats, along with metadata extraction.
from langchain_hyperbrowser import HyperbrowserScrapeTool
tool = HyperbrowserScrapeTool()
The HyperbrowserExtractTool
is a powerful tool that uses AI to extract structured data from web pages. It can extract information based predefined schemas.
from langchain_hyperbrowser import HyperbrowserExtractTool
tool = HyperbrowserExtractTool()
Invocation Basic Usage Crawl Tool
from langchain_hyperbrowser import HyperbrowserCrawlTool
result = HyperbrowserCrawlTool().invoke(
{
"url": "https://example.com",
"max_pages": 2,
"scrape_options": {"formats": ["markdown"]},
}
)
print(result)
{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
Scrape Tool
from langchain_hyperbrowser import HyperbrowserScrapeTool
result = HyperbrowserScrapeTool().invoke(
{"url": "https://example.com", "scrape_options": {"formats": ["markdown"]}}
)
print(result)
{'data': ScrapeJobData(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None), 'error': None}
from langchain_hyperbrowser import HyperbrowserExtractTool
from pydantic import BaseModel
class SimpleExtractionModel(BaseModel):
title: str
result = HyperbrowserExtractTool().invoke(
{
"url": "https://example.com",
"schema": SimpleExtractionModel,
}
)
print(result)
{'data': {'title': 'Example Domain'}, 'error': None}
With Custom Options Crawl Tool with Custom Options
result = HyperbrowserCrawlTool().run(
{
"url": "https://example.com",
"max_pages": 2,
"scrape_options": {
"formats": ["markdown", "html"],
},
"session_options": {"use_proxy": True, "solve_captchas": True},
}
)
print(result)
{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
Scrape Tool with Custom Options
result = HyperbrowserScrapeTool().run(
{
"url": "https://example.com",
"scrape_options": {
"formats": ["markdown", "html"],
},
"session_options": {"use_proxy": True, "solve_captchas": True},
}
)
print(result)
{'data': ScrapeJobData(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html='<html><head>\n <title>Example Domain</title>\n\n <meta charset="utf-8">\n <meta http-equiv="Content-type" content="text/html; charset=utf-8">\n <meta name="viewport" content="width=device-width, initial-scale=1">\n \n</head>\n\n<body>\n<div>\n <h1>Example Domain</h1>\n <p>This domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.</p>\n <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n\n\n</body></html>', markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None), 'error': None}
from typing import List
from pydantic import BaseModel
class ProductSchema(BaseModel):
title: str
price: float
class ProductsSchema(BaseModel):
products: List[ProductSchema]
result = HyperbrowserExtractTool().run(
{
"url": "https://dummyjson.com/products?limit=10",
"schema": ProductsSchema,
"session_options": {"session_options": {"use_proxy": True}},
}
)
print(result)
{'data': {'products': [{'price': 9.99, 'title': 'Essence Mascara Lash Princess'}, {'price': 19.99, 'title': 'Eyeshadow Palette with Mirror'}, {'price': 14.99, 'title': 'Powder Canister'}, {'price': 12.99, 'title': 'Red Lipstick'}, {'price': 8.99, 'title': 'Red Nail Polish'}, {'price': 49.99, 'title': 'Calvin Klein CK One'}, {'price': 129.99, 'title': 'Chanel Coco Noir Eau De'}, {'price': 89.99, 'title': "Dior J'adore"}, {'price': 69.99, 'title': 'Dolce Shine Eau de'}, {'price': 79.99, 'title': 'Gucci Bloom Eau de'}]}, 'error': None}
Async Usage
All tools support async usage:
from typing import List
from langchain_hyperbrowser import (
HyperbrowserCrawlTool,
HyperbrowserExtractTool,
HyperbrowserScrapeTool,
)
from pydantic import BaseModel
class ExtractionSchema(BaseModel):
popular_library_name: List[str]
async def web_operations():
crawl_tool = HyperbrowserCrawlTool()
crawl_result = await crawl_tool.arun(
{
"url": "https://example.com",
"max_pages": 5,
"scrape_options": {"formats": ["markdown"]},
}
)
scrape_tool = HyperbrowserScrapeTool()
scrape_result = await scrape_tool.arun(
{"url": "https://example.com", "scrape_options": {"formats": ["markdown"]}}
)
extract_tool = HyperbrowserExtractTool()
extract_result = await extract_tool.arun(
{
"url": "https://npmjs.com",
"schema": ExtractionSchema,
}
)
return crawl_result, scrape_result, extract_result
results = await web_operations()
print(results)
---------------------------------------------------------------------------
``````output
NameError Traceback (most recent call last)
``````output
Cell In[6], line 10
1 from langchain_hyperbrowser import (
2 HyperbrowserCrawlTool,
3 HyperbrowserExtractTool,
4 HyperbrowserScrapeTool,
5 )
7 from pydantic import BaseModel
---> 10 class ExtractionSchema(BaseModel):
11 popular_library_name: List[str]
14 async def web_operations():
15 # Crawl
``````output
Cell In[6], line 11, in ExtractionSchema()
10 class ExtractionSchema(BaseModel):
---> 11 popular_library_name: List[str]
``````output
NameError: name 'List' is not defined
Use within an agent
Here's how to use any of the web tools within an agent:
from langchain_hyperbrowser import HyperbrowserCrawlTool
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
crawl_tool = HyperbrowserCrawlTool()
llm = ChatOpenAI(temperature=0)
agent = create_react_agent(llm, [crawl_tool])
user_input = "Crawl https://example.com and get content from up to 5 pages"
for step in agent.stream(
{"messages": user_input},
stream_mode="values",
):
step["messages"][-1].pretty_print()
================================[1m Human Message [0m=================================
Crawl https://example.com and get content from up to 5 pages
==================================[1m Ai Message [0m==================================
Tool Calls:
hyperbrowser_crawl_data (call_G2ofdHOqjdnJUZu4hhbuga58)
Call ID: call_G2ofdHOqjdnJUZu4hhbuga58
Args:
url: https://example.com
max_pages: 5
scrape_options: {'formats': ['markdown']}
=================================[1m Tool Message [0m=================================
Name: hyperbrowser_crawl_data
{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
==================================[1m Ai Message [0m==================================
I have crawled the website [https://example.com](https://example.com) and retrieved content from the first page. Here is the content in markdown format:
\`\`\`
Example Domain
# Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
[More information...](https://www.iana.org/domains/example)
\`\`\`
If you would like to crawl more pages or need additional information, please let me know!
Configuration Options Common Options
All tools support these basic configuration options:
url
: The URL to processsession_options
: Browser session configuration
use_proxy
: Whether to use a proxysolve_captchas
: Whether to automatically solve CAPTCHAsaccept_cookies
: Whether to accept cookiesmax_pages
: Maximum number of pages to crawlscrape_options
: Options for scraping each page
formats
: List of output formats (markdown, html)scrape_options
: Options for scraping the page
formats
: List of output formats (markdown, html)schema
: Pydantic model defining the structure to extractextraction_prompt
: Natural language prompt for extractionFor more details, see the respective API references:
API referenceRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4