A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.mail-archive.com/html5lib-discuss@googlegroups.com/msg00050.html below:

About the serializer, filters and tree walkers...

Hi everybody,

The HTMLSerializer was merely a port from genshihtml5 [1], and that's
why I went with tree walkers (i.e. generating a stream of tokens from
a tree). Then Sam put the methods for determining whether a tag is
optional into a new class OptionalTagFilter, and we started talking
about filters in a more general way. I even proposed using tree
walkers as the "tokenizer" for the html5parser.HTMLParser.
Having thought more about it, I must say I somehow regret having
created tree walkers, because you're moving from a tree to a stream of
tokens, where StartTag and EndTag tokens *should* be balanced, but
that cannot be guaranteed (moreover if you consider the
OptonalTagFilter as a true filter, which could be used anywhere in a
pipeline). That's what bothers me: you cannot guarantee to a filter of
the serializer than the "token stream" is "wellformed" (yes, this is
wellformedness similar to the XML one).

I know there are precedents: SAX first, but also xmlpull, python's
xml.dom.pulldom and .NET's XMLReader.

What I'd prefer: merging the tree-walking algorithm from
html5lib.treewalkers._base.NonRecursiveTreeWalker into the
HTMLSerializer, along with the OptionalTagFilter (yes, back into the
serializer).
Treewalkers would then just be implementations of the four abstract
methods from NonRecursiveTreeWalker: getNodeDetails, getFirstChild,
getNextSibling and getParent. The guarantee that we'll give to tree
walker implementations is that the algorithm will always go forward
(effectively "walking" the tree).
That way, non-tree implementations (such as genshistream or pulldom)
could still be implemented but we'd "emulate" a tree from a stream
rather than generating a stream from a tree. And we wouldn't need the
"lint" filter anymore.

Filters could still be created, as implementations of the same four
abstract methods (i.e. the whitespace filter would return trimmed text
from getNodeDetails; same for an HTML sanitizer which would call the
wrapped tree walker's getFirstChild/getNextSibling/getParent twice to
skip comments and/or return TEXT nodes from getNodeDetails where the
underlying tree walker returned an ELEMENT node).

It has been suggested that a tree walker could be passed as an
input-stream/tokenizer to the parser using a different tree builder,
to transform say a SimpleTree tree into a DOM one. I guess this is
really an edge case, but a TreeConverter could still be made, taking
as input a treebuilder and an treewalker.

The major drawback of this new approach is that the sanitizer would
need two implementations: one to work as a tokenizer to sanitize while
parsing, the other to sanitize an already built tree (actually maybe
only the latter could be needed, as sanitizing could be done on the
built tree just after it's been parsed). But this means that a
sanitizer working on a tree would either require building a new
sanitized tree (except when sanitizing while serializing) or add
tree-manipulation methods to treewalkers.

The advantages? hmm, well, merely better design wrt working with and
serializing trees [2] (trees, not possibly not-wellformed streams of
tokens). For example, Kid claims [3] that output is guarantied to be
wellformed XML. That's wrong as soon as you start using your own
filters and/or serializers. This is due to the streaming/pipelining
that's listed as a feature [4]. Genshi inherits the same design from
Kid [5]. If we weren't using event streams (à la SAX), we wouldn't
need "lint" filters such as gnu.xml.pipeline.WellFormednessFilter. [6]

The other possible solution is to plug the "lint" filter on the stream
at the beginning of HTMLSerializer.serialize (just before optionally
adding the OptionalTagFilter). This prevents using an
OptionalTagFilter-filtered token stream as the serializer input but I
think it's actually better design too.

Or maybe I'm thinking too much and should be more pragmatic... ;-)

[1] http://code.google.com/p/genshihtml5/
[2] http://hsivonen.iki.fi/producing-xml/#stack
[3] http://www.kid-templating.org/trac/wiki/AboutKid#XMLBased
[4] http://www.kid-templating.org/trac/wiki/AboutKid#StreamingPipelining
[5] http://genshi.edgewall.org/wiki/GenshiVsKid
[6] 
http://www.gnu.org/software/classpathx/jaxp/apidoc/gnu/xml/pipeline/WellFormednessFilter.html

-- 
Thomas Broyer

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
 To post to this group, send email to html5lib-discuss@googlegroups.com
 To unsubscribe from this group, send email to [EMAIL PROTECTED]
 For more options, visit this group at 
http://groups.google.com/group/html5lib-discuss?hl=en-GB
-~----------~----~----~----~------~----~------~--~---


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4