2007/6/5, Sam Ruby: > > I'd like to repackage the sanitizer, highlighter, and serializer filters > (like the optional tag filter) as standalone filters. These could be > used independent of the builder or serializer used. Each filter can > become an option of parse.
Wrt what you started to do in ruby, I'd like to go a little farther: refactor the HTMLParser API so that the tokenizer passed is an instance rather than a class (and then add convenience functions such as parse and parseFragment). This would allow passing an instance of a HTMLTokenizer or a treewalker (useful to convert a tree from DOM to ElementTree for example), proxied by an HTMLSanitizer and/or the OptionalTagFilter (might be useful for tests) and/or hilighter, etc. HTMLParser and HTMLTokenizer would need to be refactored a bit so that HTMLParser don't assume the tokenizer is an HTMLTokenizer (e.g. don't access tokenizer.stream directly but use methods such as tokenizer.position()). I haven't thought about it much so I don't know if it's doable, but it would be cool (there might be a problem with the HTMLParser setting the tokenizer's contentModelFlag, when the tokenizer is not an HTMLTokenizer). > I'd like the input stream to actually stream for the common use cases of > windows-1252 or utf-8 input. Wrt to r674, why not use a codecs.StreamReader? Also, re "unreading" chars, maybe we could follow a pattern similar to Java-IO's mark()/reset(). In Twintsam, I used PeekChar/PeekChars and ReadChar/ReadChars methods instead of "unreading" chars <http://twintsam.googlecode.com/svn/trunk/Twintsam/Html/HtmlReader.StreamHandling.cs> There's also a SkipChars method, identical to ReadChars but "optimized" because it does not collect chars into a buffer. This could be used for encoding detection too (especially for non-seekable streams): use the internal buffer (queue) for bytes read from rawStream when detecting the encoding; then use it for chars read from the decoded stream when parsing. Finally, HTMLInputStream.reset() should call rawStream.seek(0), which means it would be only available if the stream is seekable (if you want to reparse a non-seekable stream, it's up to you to buffer it). Any thoughts? -- Thomas Broyer --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To post to this group, send email to html5lib-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB -~----------~----~----~----~------~----~------~--~---
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4