The HTMLRewriter
class allows developers to build comprehensive and expressive HTML parsers inside of a Cloudflare Workers application. It can be thought of as a jQuery-like experience directly inside of your Workers application. Leaning on a powerful JavaScript API to parse and transform HTML, HTMLRewriter
allows developers to build deeply functional applications.
The HTMLRewriter
class should be instantiated once in your Workers script, with a number of handlers attached using the on
and onDocument
functions.
new HTMLRewriter()
.on("*", new ElementHandler())
.onDocument(new DocumentHandler());
Throughout the HTMLRewriter
API, there are a few consistent types that many properties and methods use:
Content
string | Response | ReadableStream
Response
, or ReadableStream
.ContentOptions
Object
{ html: Boolean }
Controls the way the HTMLRewriter treats inserted content. If the html
boolean is set to true, content is treated as raw HTML. If the html
boolean is set to false or not provided, content will be treated as text and proper HTML escaping will be applied to it.There are two handler types that can be used with HTMLRewriter
: element handlers and document handlers.
An element handler responds to any incoming element, when attached using the .on
function of an HTMLRewriter
instance. The element handler should respond to element
, comments
, and text
. The example processes div
elements with an ElementHandler
class.
class ElementHandler {
element(element) {
// An incoming element, such as `div`
console.log(`Incoming element: ${element.tagName}`);
}
comments(comment) {
// An incoming comment
}
text(text) {
// An incoming piece of text
}
}
async function handleRequest(req) {
const res = await fetch(req);
return new HTMLRewriter().on("div", new ElementHandler()).transform(res);
}
A document handler represents the incoming HTML document. A number of functions can be defined on a document handler to query and manipulate a documentâs doctype
, comments
, text
, and end
. Unlike an element handler, a document handlerâs doctype
, comments
, text
, and end
functions are not scoped by a particular selector. A document handler's functions are called for all the content on the page including the content outside of the top-level HTML tag:
class DocumentHandler {
doctype(doctype) {
// An incoming doctype, such as <!DOCTYPE html>
}
comments(comment) {
// An incoming comment
}
text(text) {
// An incoming piece of text
}
end(end) {
// The end of the document
}
}
All functions defined on both element and document handlers can return either void
or a Promise<void>
. Making your handler function async
allows you to access external resources such as an API via fetch, Workers KV, Durable Objects, or the cache.
class UserElementHandler {
async element(element) {
let response = await fetch(new Request("/user"));
// fill in user info using response
}
}
async function handleRequest(req) {
const res = await fetch(req);
// run the user element handler via HTMLRewriter on a div with ID `user_info`
return new HTMLRewriter()
.on("div#user_info", new UserElementHandler())
.transform(res);
}
The element
argument, used only in element handlers, is a representation of a DOM element. A number of methods exist on an element to query and manipulate it:
tagName
string
"h1"
or "div"
. This property can be assigned different values, to modify an elementâs tag.attributes
Iterator read-only
[name, value]
pair of the tagâs attributes.removed
boolean
namespaceURI
String
getAttribute(namestring)
: string | null
null
if it is not found.hasAttribute(namestring)
: boolean
setAttribute(namestring, valuestring)
: Element
removeAttribute(namestring)
: Element
before(contentContent, contentOptionsContentOptionsoptional)
: Element
Content and ContentOptions
Refer to Global types for more information on Content
and ContentOptions
.
after(contentContent, contentOptionsContentOptionsoptional)
: Element
prepend(contentContent, contentOptionsContentOptionsoptional)
: Element
append(contentContent, contentOptionsContentOptionsoptional)
: Element
replace(contentContent, contentOptionsContentOptionsoptional)
: Element
setInnerContent(contentContent, contentOptionsContentOptionsoptional)
: Element
remove()
: Element
removeAndKeepContent()
: Element
onEndTag(handlerFunction<void>)
: void
The endTag
argument, used only in handlers registered with element.onEndTag
, is a limited representation of a DOM element.
name
string
"h1"
or "div"
. This property can be assigned different values, to modify an element's tag.before(contentContent, contentOptionsContentOptionsoptional)
: EndTag
after(contentContent, contentOptionsContentOptionsoptional)
: EndTag
Content and ContentOptions
Refer to Global types for more information on Content
and ContentOptions
.
remove()
: EndTag
Since Cloudflare performs zero-copy streaming parsing, text chunks are not the same thing as text nodes in the lexical tree. A lexical tree text node can be represented by multiple chunks, as they arrive over the wire from the origin.
Consider the following markup: <div>Hey. How are you?</div>
. It is possible that the Workers script will not receive the entire text node from the origin at once; instead, the text
element handler will be invoked for each received part of the text node. For example, the handler might be invoked with âHey. How â,
then âare you?â
. When the last chunk arrives, the textâs lastInTextNode
property will be set to true
. Developers should make sure to concatenate these chunks together.
removed
boolean
text
string read-only
lastInTextNode
boolean read-only
before(contentContent, contentOptionsContentOptionsoptional)
: Element
Content and ContentOptions
Refer to Global types for more information on Content
and ContentOptions
.
after(contentContent, contentOptionsContentOptionsoptional)
: Element
replace(contentContent, contentOptionsContentOptionsoptional)
: Element
remove()
: Element
The comments
function on an element handler allows developers to query and manipulate HTML comment tags.
class ElementHandler {
comments(comment) {
// An incoming comment element, such as <!-- My comment -->
}
}
comment.removed
boolean
comment.text
string
before(contentContent, contentOptionsContentOptionsoptional)
: Element
Content and ContentOptions
Refer to Global types for more information on Content
and ContentOptions
.
after(contentContent, contentOptionsContentOptionsoptional)
: Element
replace(contentContent, contentOptionsContentOptionsoptional)
: Element
remove()
: Element
The doctype
function on a document handler allows developers to query a documentâs doctype â.
class DocumentHandler {
doctype(doctype) {
// An incoming doctype element, such as
// <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
}
}
doctype.name
string | null read-only
doctype.publicId
string | null read-only
doctype.systemId
string | null read-only
publicId
.The end
function on a document handler allows developers to append content to the end of a document.
class DocumentHandler {
end(end) {
// The end of the document
}
}
append(contentContent, contentOptionsContentOptionsoptional)
: DocumentEnd
Content and ContentOptions
Refer to Global types for more information on Content
and ContentOptions
.
This is what selectors are and what they are used for.
*
E
E:nth-child(n)
E:first-child
E:nth-of-type(n)
E:first-of-type
E:not(s)
E.warning
E#myid
E[foo]
E[foo="bar"]
E[foo="bar" i]
E[foo="bar" s]
E[foo~="bar"]
E[foo^="bar"]
E[foo$="bar"]
E[foo*="bar"]
E[foo|="en"]
E F
E > F
If a handler throws an exception, parsing is immediately halted, the transformed response body is errored with the thrown exception, and the untransformed response body is canceled (closed). If the transformed response body was already partially streamed back to the client, the client will see a truncated response.
async function handle(request) {
let oldResponse = await fetch(request);
let newResponse = new HTMLRewriter()
.on("*", {
element(element) {
throw new Error("A really bad error.");
},
})
.transform(oldResponse);
// At this point, an expression like `await newResponse.text()`
// will throw `new Error("A really bad error.")`.
// Thereafter, any use of `newResponse.body` will throw the same error,
// and `oldResponse.body` will be closed.
// Alternatively, this will produce a truncated response to the client:
return newResponse;
}
HTMLRewriter
âRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4