RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://www.mail-archive.com/html5lib-discuss@googlegroups.com/msg00055.html below:

About the serializer, filters and tree walkers...

Thomas Broyer wrote:
> Hi everybody,
> 
> The HTMLSerializer was merely a port from genshihtml5 [1], and that's
> why I went with tree walkers (i.e. generating a stream of tokens from
> a tree). Then Sam put the methods for determining whether a tag is
> optional into a new class OptionalTagFilter, and we started talking
> about filters in a more general way. I even proposed using tree
> walkers as the "tokenizer" for the html5parser.HTMLParser.

In case anybody is interested in the background, what I did wasn't so 
much a part of a master plan, but as a byproduct of the Ruby port.  The 
semantics of generators in Ruby is subtly different in Ruby than in 
Python, enough so that I had to actually understand what that portion of 
the serializer was attempting to do in order to port it properly.

When I started, that file had a small slider helper class, and a 
somewhat larger serializer class.  In order to make the generator an 
object, I created a new class, and then pulled in the methods it needed. 
  By the time I was done, it had inhaled both the slider class and about 
a third of the serializer class.  And when I took a look at the result, 
it seemed much cleaner - each class had a much more focused purpose, and 
related methods were grouped together.  Enough so that I thought it 
worthwhile to port the resulting design back to Python.

> Having thought more about it, I must say I somehow regret having
> created tree walkers, because you're moving from a tree to a stream of
> tokens, where StartTag and EndTag tokens *should* be balanced, but
> that cannot be guaranteed (moreover if you consider the
> OptonalTagFilter as a true filter, which could be used anywhere in a
> pipeline). That's what bothers me: you cannot guarantee to a filter of
> the serializer than the "token stream" is "wellformed" (yes, this is
> wellformedness similar to the XML one).
> 
> I know there are precedents: SAX first, but also xmlpull, python's
> xml.dom.pulldom and .NET's XMLReader.

Those aren't so much filters as sources.

If we are rethinking abstractions, it seems to me that the natural order 
of things is to have sources, filters, and sinks.

Sources (a.k.a. tokenizers) take a foreign object, and produce a series 
of tokens.  This includes the existing tokenizer, all treewalkers, and 
adaptors.  From the consumers point of view, there is no difference.

In fact, we could have a module level function named html5lib.tokenizer 
which, when passed an object, would find the right tokenizer and 
instantiate it (passing along **kwargs).  The tokenizer for string would 
be HTMLTokenizer (of course), but all the classes would need to register 
the types of classes that they handle.

Filters would continue to consume a token stream and produce a token stream.

Sinks consume a token stream and produce something else: a string, a 
tree, a set of SAX events, whatever.

> What I'd prefer: merging the tree-walking algorithm from
> html5lib.treewalkers._base.NonRecursiveTreeWalker into the
> HTMLSerializer, along with the OptionalTagFilter (yes, back into the
> serializer).
> Treewalkers would then just be implementations of the four abstract
> methods from NonRecursiveTreeWalker: getNodeDetails, getFirstChild,
> getNextSibling and getParent. The guarantee that we'll give to tree
> walker implementations is that the algorithm will always go forward
> (effectively "walking" the tree).
> That way, non-tree implementations (such as genshistream or pulldom)
> could still be implemented but we'd "emulate" a tree from a stream
> rather than generating a stream from a tree. And we wouldn't need the
> "lint" filter anymore.

Why can't genshistream or pulldom simply be a tokenizer?

> Filters could still be created, as implementations of the same four
> abstract methods (i.e. the whitespace filter would return trimmed text
> from getNodeDetails; same for an HTML sanitizer which would call the
> wrapped tree walker's getFirstChild/getNextSibling/getParent twice to
> skip comments and/or return TEXT nodes from getNodeDetails where the
> underlying tree walker returned an ELEMENT node).
> 
> It has been suggested that a tree walker could be passed as an
> input-stream/tokenizer to the parser using a different tree builder,
> to transform say a SimpleTree tree into a DOM one. I guess this is
> really an edge case, but a TreeConverter could still be made, taking
> as input a treebuilder and an treewalker.
> 
> The major drawback of this new approach is that the sanitizer would
> need two implementations: one to work as a tokenizer to sanitize while
> parsing, the other to sanitize an already built tree (actually maybe
> only the latter could be needed, as sanitizing could be done on the
> built tree just after it's been parsed). But this means that a
> sanitizer working on a tree would either require building a new
> sanitized tree (except when sanitizing while serializing) or add
> tree-manipulation methods to treewalkers.
> 
> The advantages? hmm, well, merely better design wrt working with and
> serializing trees [2] (trees, not possibly not-wellformed streams of
> tokens). For example, Kid claims [3] that output is guarantied to be
> wellformed XML. That's wrong as soon as you start using your own
> filters and/or serializers. This is due to the streaming/pipelining
> that's listed as a feature [4]. Genshi inherits the same design from
> Kid [5]. If we weren't using event streams (à la SAX), we wouldn't
> need "lint" filters such as gnu.xml.pipeline.WellFormednessFilter. [6]

Part of the premise of HTML5 is that the general case of building a 
well-formed result from a stream of tokens requires building a tree, 
complete with adoption agency algorithms and vodoo modes.

> The other possible solution is to plug the "lint" filter on the stream
> at the beginning of HTMLSerializer.serialize (just before optionally
> adding the OptionalTagFilter). This prevents using an
> OptionalTagFilter-filtered token stream as the serializer input but I
> think it's actually better design too.

Why don't we simply subclass HTMLSerializer to create a XHTMLSerializer. 
   The only differences would be that it would default a number of 
keyword arguments differently, and would apply the lint filter.

> Or maybe I'm thinking too much and should be more pragmatic... ;-)
> 
> [1] http://code.google.com/p/genshihtml5/
> [2] http://hsivonen.iki.fi/producing-xml/#stack
> [3] http://www.kid-templating.org/trac/wiki/AboutKid#XMLBased
> [4] http://www.kid-templating.org/trac/wiki/AboutKid#StreamingPipelining
> [5] http://genshi.edgewall.org/wiki/GenshiVsKid
> [6] 
> http://www.gnu.org/software/classpathx/jaxp/apidoc/gnu/xml/pipeline/WellFormednessFilter.html

- Sam Ruby

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
 To post to this group, send email to html5lib-discuss@googlegroups.com
 To unsubscribe from this group, send email to [EMAIL PROTECTED]
 For more options, visit this group at 
http://groups.google.com/group/html5lib-discuss?hl=en-GB
-~----------~----~----~----~------~----~------~--~---

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4