Hi everybody, The HTMLSerializer was merely a port from genshihtml5 [1], and that's why I went with tree walkers (i.e. generating a stream of tokens from a tree). Then Sam put the methods for determining whether a tag is optional into a new class OptionalTagFilter, and we started talking about filters in a more general way. I even proposed using tree walkers as the "tokenizer" for the html5parser.HTMLParser.
Having thought more about it, I must say I somehow regret having created tree walkers, because you're moving from a tree to a stream of tokens, where StartTag and EndTag tokens *should* be balanced, but that cannot be guaranteed (moreover if you consider the OptonalTagFilter as a true filter, which could be used anywhere in a pipeline). That's what bothers me: you cannot guarantee to a filter of the serializer than the "token stream" is "wellformed" (yes, this is wellformedness similar to the XML one). I know there are precedents: SAX first, but also xmlpull, python's xml.dom.pulldom and .NET's XMLReader. What I'd prefer: merging the tree-walking algorithm from html5lib.treewalkers._base.NonRecursiveTreeWalker into the HTMLSerializer, along with the OptionalTagFilter (yes, back into the serializer). Treewalkers would then just be implementations of the four abstract methods from NonRecursiveTreeWalker: getNodeDetails, getFirstChild, getNextSibling and getParent. The guarantee that we'll give to tree walker implementations is that the algorithm will always go forward (effectively "walking" the tree). That way, non-tree implementations (such as genshistream or pulldom) could still be implemented but we'd "emulate" a tree from a stream rather than generating a stream from a tree. And we wouldn't need the "lint" filter anymore. Filters could still be created, as implementations of the same four abstract methods (i.e. the whitespace filter would return trimmed text from getNodeDetails; same for an HTML sanitizer which would call the wrapped tree walker's getFirstChild/getNextSibling/getParent twice to skip comments and/or return TEXT nodes from getNodeDetails where the underlying tree walker returned an ELEMENT node). It has been suggested that a tree walker could be passed as an input-stream/tokenizer to the parser using a different tree builder, to transform say a SimpleTree tree into a DOM one. I guess this is really an edge case, but a TreeConverter could still be made, taking as input a treebuilder and an treewalker. The major drawback of this new approach is that the sanitizer would need two implementations: one to work as a tokenizer to sanitize while parsing, the other to sanitize an already built tree (actually maybe only the latter could be needed, as sanitizing could be done on the built tree just after it's been parsed). But this means that a sanitizer working on a tree would either require building a new sanitized tree (except when sanitizing while serializing) or add tree-manipulation methods to treewalkers. The advantages? hmm, well, merely better design wrt working with and serializing trees [2] (trees, not possibly not-wellformed streams of tokens). For example, Kid claims [3] that output is guarantied to be wellformed XML. That's wrong as soon as you start using your own filters and/or serializers. This is due to the streaming/pipelining that's listed as a feature [4]. Genshi inherits the same design from Kid [5]. If we weren't using event streams (à la SAX), we wouldn't need "lint" filters such as gnu.xml.pipeline.WellFormednessFilter. [6] The other possible solution is to plug the "lint" filter on the stream at the beginning of HTMLSerializer.serialize (just before optionally adding the OptionalTagFilter). This prevents using an OptionalTagFilter-filtered token stream as the serializer input but I think it's actually better design too. Or maybe I'm thinking too much and should be more pragmatic... ;-) [1] http://code.google.com/p/genshihtml5/ [2] http://hsivonen.iki.fi/producing-xml/#stack [3] http://www.kid-templating.org/trac/wiki/AboutKid#XMLBased [4] http://www.kid-templating.org/trac/wiki/AboutKid#StreamingPipelining [5] http://genshi.edgewall.org/wiki/GenshiVsKid [6] http://www.gnu.org/software/classpathx/jaxp/apidoc/gnu/xml/pipeline/WellFormednessFilter.html -- Thomas Broyer --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To post to this group, send email to html5lib-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB -~----------~----~----~----~------~----~------~--~---
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4