Thomas Broyer wrote: > 2007/6/5, Sam Ruby: >> I'd like to repackage the sanitizer, highlighter, and serializer filters >> (like the optional tag filter) as standalone filters. These could be >> used independent of the builder or serializer used. Each filter can >> become an option of parse. > > Wrt what you started to do in ruby, I'd like to go a little farther: > refactor the HTMLParser API so that the tokenizer passed is an > instance rather than a class (and then add convenience functions such > as parse and parseFragment). This would allow passing an instance of a > HTMLTokenizer or a treewalker (useful to convert a tree from DOM to > ElementTree for example), proxied by an HTMLSanitizer and/or the > OptionalTagFilter (might be useful for tests) and/or hilighter, etc.
I've been trying to keep the ruby and python versions roughly in sync... > HTMLParser and HTMLTokenizer would need to be refactored a bit so that > HTMLParser don't assume the tokenizer is an HTMLTokenizer (e.g. don't > access tokenizer.stream directly but use methods such as > tokenizer.position()). Yes. > I haven't thought about it much so I don't know if it's doable, but it > would be cool (there might be a problem with the HTMLParser setting > the tokenizer's contentModelFlag, when the tokenizer is not an > HTMLTokenizer). The easiest way to proceed is for: 1) the _parse method to check for hasattr(stream, 'contentModelFlag'), and if so, use the stream as the tokenizer; otherwise construct a tokenizer for the stream. 2) define a base class for filters that has a constructor which accepts a stream, and default implementations of contentModelFlag and position which simply proxy/forward the calls onto the stream. At the moment, HTMLSanitizer is a subclass of HTMLTokenizer, so the forwarding is not required, but changing this to a mechanism based on forwarding allows one to construct a pipe with an arbitrary number of filters in any order that you like. I'd also like to see all filters moved into a "filters" directory/module. >> I'd like the input stream to actually stream for the common use cases of >> windows-1252 or utf-8 input. > > Wrt to r674, why not use a codecs.StreamReader? > > Also, re "unreading" chars, maybe we could follow a pattern similar to > Java-IO's mark()/reset(). > In Twintsam, I used PeekChar/PeekChars and ReadChar/ReadChars methods > instead of "unreading" chars > <http://twintsam.googlecode.com/svn/trunk/Twintsam/Html/HtmlReader.StreamHandling.cs> > There's also a SkipChars method, identical to ReadChars but > "optimized" because it does not collect chars into a buffer. > > This could be used for encoding detection too (especially for > non-seekable streams): use the internal buffer (queue) for bytes read > from rawStream when detecting the encoding; then use it for chars read > from the decoded stream when parsing. > > Finally, HTMLInputStream.reset() should call rawStream.seek(0), which > means it would be only available if the stream is seekable (if you > want to reparse a non-seekable stream, it's up to you to buffer it). > > Any thoughts? All sounds good, but I'd like to get rid of the HTMLInputStream reset method entirely. The parse.py and parse.rb programs should accept a '-' as a filename, and interpret that as meaning stdin. Is there a reason why twintsam isn't simply placed in html5lib? I'd love to see a set of html5 implementations all sharing a common test suite base (augmented by language specific additions, where appropriate). C# would be reasonable next language to tackle. - Sam Ruby --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To post to this group, send email to html5lib-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB -~----------~----~----~----~------~----~------~--~---
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4