RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://www.mail-archive.com/html5lib-discuss@googlegroups.com/msg00033.html below:

Re: Minor html5lib refactoring

2007/6/7, Sam Ruby:
>
> > I haven't thought about it much so I don't know if it's doable, but it
> > would be cool (there might be a problem with the HTMLParser setting
> > the tokenizer's contentModelFlag, when the tokenizer is not an
> > HTMLTokenizer).
>
> The easiest way to proceed is for:
>    1) the _parse method to check for hasattr(stream, 'contentModelFlag'),
>       and if so, use the stream as the tokenizer; otherwise construct
>       a tokenizer for the stream.
>    2) define a base class for filters that has a constructor which
>       accepts a stream, and default implementations of contentModelFlag
>       and position which simply proxy/forward the calls onto the stream.
>
> At the moment, HTMLSanitizer is a subclass of HTMLTokenizer, so the
> forwarding is not required, but changing this to a mechanism based on
> forwarding allows one to construct a pipe with an arbitrary number of
> filters in any order that you like.

I was about to propose 2) too. 1) is a useful addition (though a "def
parse" à la xml.sax, xml.dom.minidom and xml.dom.pulldom would be bad
IMO).

...but there's still a problem with contentModelFlag and treewalkers:
what should the HTMLParser do? should treewalkers have a
contentModelFlag property? if so, what would it mean? (if it ever
means something)

Maybe HTMLParser needs a wrapper around filters if they don't have a
contentModelFlag? but then 1) cannot be implemented (and "def parse"
becomes more than useful for the common case of parsing a file-like
object or string).

Just thinking out loud.

> I'd also like to see all filters moved into a "filters" directory/module.

Just to say that I have a small problem with OptionalTagFilter: if
breaks the symetry of "StartTag"/"EndTag" tokens (that's the reason
why I initially not made it a "filter"). This might cause problems to
other filters in the pipe. So: "use with care", but it could still be
placed into a filters submodule.

> All sounds good, but I'd like to get rid of the HTMLInputStream reset
> method entirely.

That'd be far better of course.

> The parse.py and parse.rb programs should accept a '-' as a filename,
> and interpret that as meaning stdin.

Yes.

> Is there a reason why twintsam isn't simply placed in html5lib?  I'd
> love to see a set of html5 implementations all sharing a common test
> suite base (augmented by language specific additions, where
> appropriate).  C# would be reasonable next language to tackle.

Because Twintsam is not a port of html5lib? ;-)
Also, when I started working on it last December, I wasn't speaking
Python so it didn't come to my mind sharing the same repository (and
same issues list) as html5lib.
Actually, I would have rather proposed putting the Ruby port into
another repository ;-)

Wrt to the test suite, don't worry, Twintsam is using the tests from
the html5lib repository (thanks to svn:externals). I finally learnt
Python and plugged a small script into the build process to generate
C# test classes suitable for NUnit and Microsoft's UnitTestFramework
(same code, using compiler directives to switch imports between
testing frameworks).

-- 
Thomas Broyer

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
 To post to this group, send email to html5lib-discuss@googlegroups.com
 To unsubscribe from this group, send email to [EMAIL PROTECTED]
 For more options, visit this group at 
http://groups.google.com/group/html5lib-discuss?hl=en-GB
-~----------~----~----~----~------~----~------~--~---

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4