ScrapFly is a web scraping API with headless browser capabilities, proxies, and anti-bot bypass. It allows for extracting web page data into accessible LLM markdown or text.
from langchain_community.document_loaders import ScrapflyLoader
scrapfly_loader = ScrapflyLoader(
["https://web-scraping.dev/products"],
api_key="Your ScrapFly API key",
continue_on_failure=True,
)
documents = scrapfly_loader.load()
print(documents)
The ScrapflyLoader also allows passing ScrapeConfig object for customizing the scrape request. See the documentation for the full feature details and their API params: https://scrapfly.io/docs/scrape-api/getting-started
from langchain_community.document_loaders import ScrapflyLoader
scrapfly_scrape_config = {
"asp": True,
"render_js": True,
"proxy_pool": "public_residential_pool",
"country": "us",
"auto_scroll": True,
"js": "",
}
scrapfly_loader = ScrapflyLoader(
["https://web-scraping.dev/products"],
api_key="Your ScrapFly API key",
continue_on_failure=True,
scrape_config=scrapfly_scrape_config,
scrape_format="markdown",
)
documents = scrapfly_loader.load()
print(documents)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4