A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.geeksforgeeks.org/python/html5lib-and-lxml-parsers-in-python/ below:

html5lib and lxml parsers in Python

html5lib and lxml parsers in Python

Last Updated : 11 Jul, 2025

Parsers in Python:

Parsing simply means to break down a blob of text into smaller and meaningful parts. This breaking down depends on certain rules and factors which a particular parser defines. These parsers can range from native string methods of parsing line by line to the libraries like

html5lib

which can parse almost all the elements of an HTML doc, breaking it down into different tags and pieces which can be filtered out for various use cases. The two parsers we will focus on in this article are

html5lib

and

lxml

. So, before diving into their pros, cons and differences, let's have an overview of both of these libraries.

html5lib:

A

pure-python

library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

lxml:

A Pythonic, mature binding for the C libraries

libxml2

and

libxslt

. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known

ElementTree

API.

Key point:

Since

html5lib

is a pure-python library, it has an external Python Dependency while

lxml

being a binding for certain C libraries has external C dependency.

Pros and Cons:html5lib

:

lxml: Differences with Beautifulsoup:

Just to highlight the difference between the two parsers in terms of how they work and make the tree in order to fix document which is not perfectly formed, we'll take the same example and feed it to the two parsers.

<li></p>
html5lib: Python3 1==
from bs4 import BeautifulSoup

soup_html5lib = BeautifulSoup("<li></p>", "html5lib")

print(soup_html5lib)
Output:
<html><head></head><body><li><p></p></li></body></html>

What we find:

lxml: Python3 1==
from bs4 import BeautifulSoup

soup_lxml = BeautifulSoup("<li></p>", "lxml")

print(soup_lxml)
Output:
<html><body><li></li></body></html>

What we find:

We can easily observe the differences between the two libraries in terms of the final tree formation or the parsing of the document received and spot the completeness,

html5lib

provides to the final parsed text.



RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4