RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://python.langchain.com/docs/how_to/document_loader_html/ below:

How to load HTML | 🦜️🔗 LangChain

How to load HTML

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.

This covers how to load HTML documents into a LangChain Document objects that we can use downstream.

Parsing HTML files often requires specialized tools. Here we demonstrate parsing via Unstructured and BeautifulSoup4, which can be installed via pip. Head over to the integrations page to find integrations with additional services, such as Azure AI Document Intelligence or FireCrawl.

Loading HTML with Unstructured

%pip install unstructured

from langchain_community.document_loaders import UnstructuredHTMLLoader

file_path = "../../docs/integrations/document_loaders/example_data/fake-content.html"

loader = UnstructuredHTMLLoader(file_path)
data = loader.load()

print(data)

[Document(page_content='My First Heading\n\nMy first paragraph.', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html'})]

Loading HTML with BeautifulSoup4

We can also use BeautifulSoup4 to load HTML documents using the BSHTMLLoader. This will extract the text from the HTML into page_content, and the page title as title into metadata.

from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader(file_path)
data = loader.load()

print(data)

[Document(page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html', 'title': 'Test Title'})]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4