2007/6/7, Sam Ruby: > > > I haven't thought about it much so I don't know if it's doable, but it > > would be cool (there might be a problem with the HTMLParser setting > > the tokenizer's contentModelFlag, when the tokenizer is not an > > HTMLTokenizer). > > The easiest way to proceed is for: > 1) the _parse method to check for hasattr(stream, 'contentModelFlag'), > and if so, use the stream as the tokenizer; otherwise construct > a tokenizer for the stream. > 2) define a base class for filters that has a constructor which > accepts a stream, and default implementations of contentModelFlag > and position which simply proxy/forward the calls onto the stream. > > At the moment, HTMLSanitizer is a subclass of HTMLTokenizer, so the > forwarding is not required, but changing this to a mechanism based on > forwarding allows one to construct a pipe with an arbitrary number of > filters in any order that you like.
I was about to propose 2) too. 1) is a useful addition (though a "def parse" à la xml.sax, xml.dom.minidom and xml.dom.pulldom would be bad IMO). ...but there's still a problem with contentModelFlag and treewalkers: what should the HTMLParser do? should treewalkers have a contentModelFlag property? if so, what would it mean? (if it ever means something) Maybe HTMLParser needs a wrapper around filters if they don't have a contentModelFlag? but then 1) cannot be implemented (and "def parse" becomes more than useful for the common case of parsing a file-like object or string). Just thinking out loud. > I'd also like to see all filters moved into a "filters" directory/module. Just to say that I have a small problem with OptionalTagFilter: if breaks the symetry of "StartTag"/"EndTag" tokens (that's the reason why I initially not made it a "filter"). This might cause problems to other filters in the pipe. So: "use with care", but it could still be placed into a filters submodule. > All sounds good, but I'd like to get rid of the HTMLInputStream reset > method entirely. That'd be far better of course. > The parse.py and parse.rb programs should accept a '-' as a filename, > and interpret that as meaning stdin. Yes. > Is there a reason why twintsam isn't simply placed in html5lib? I'd > love to see a set of html5 implementations all sharing a common test > suite base (augmented by language specific additions, where > appropriate). C# would be reasonable next language to tackle. Because Twintsam is not a port of html5lib? ;-) Also, when I started working on it last December, I wasn't speaking Python so it didn't come to my mind sharing the same repository (and same issues list) as html5lib. Actually, I would have rather proposed putting the Ruby port into another repository ;-) Wrt to the test suite, don't worry, Twintsam is using the tests from the html5lib repository (thanks to svn:externals). I finally learnt Python and plugged a small script into the build process to generate C# test classes suitable for NUnit and Microsoft's UnitTestFramework (same code, using compiler directives to switch imports between testing frameworks). -- Thomas Broyer --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To post to this group, send email to html5lib-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB -~----------~----~----~----~------~----~------~--~---
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4