RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://www.mail-archive.com/html5lib-discuss@googlegroups.com/msg00031.html below:

Re: Minor html5lib refactoring

2007/6/5, Sam Ruby:
>
> I'd like to repackage the sanitizer, highlighter, and serializer filters
> (like the optional tag filter) as standalone filters.  These could be
> used independent of the builder or serializer used.  Each filter can
> become an option of parse.

Wrt what you started to do in ruby, I'd like to go a little farther:
refactor the HTMLParser API so that the tokenizer passed is an
instance rather than a class (and then add convenience functions such
as parse and parseFragment). This would allow passing an instance of a
HTMLTokenizer or a treewalker (useful to convert a tree from DOM to
ElementTree for example), proxied by an HTMLSanitizer and/or the
OptionalTagFilter (might be useful for tests) and/or hilighter, etc.

HTMLParser and HTMLTokenizer would need to be refactored a bit so that
HTMLParser don't assume the tokenizer is an HTMLTokenizer (e.g. don't
access tokenizer.stream directly but use methods such as
tokenizer.position()).

I haven't thought about it much so I don't know if it's doable, but it
would be cool (there might be a problem with the HTMLParser setting
the tokenizer's contentModelFlag, when the tokenizer is not an
HTMLTokenizer).

> I'd like the input stream to actually stream for the common use cases of
> windows-1252 or utf-8 input.

Wrt to r674, why not use a codecs.StreamReader?

Also, re "unreading" chars, maybe we could follow a pattern similar to
Java-IO's mark()/reset().
In Twintsam, I used PeekChar/PeekChars and ReadChar/ReadChars methods
instead of "unreading" chars
<http://twintsam.googlecode.com/svn/trunk/Twintsam/Html/HtmlReader.StreamHandling.cs>
There's also a SkipChars method, identical to ReadChars but
"optimized" because it does not collect chars into a buffer.

This could be used for encoding detection too (especially for
non-seekable streams): use the internal buffer (queue) for bytes read
from rawStream when detecting the encoding; then use it for chars read
from the decoded stream when parsing.

Finally, HTMLInputStream.reset() should call rawStream.seek(0), which
means it would be only available if the stream is seekable (if you
want to reparse a non-seekable stream, it's up to you to buffer it).

Any thoughts?

-- 
Thomas Broyer

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
 To post to this group, send email to html5lib-discuss@googlegroups.com
 To unsubscribe from this group, send email to [EMAIL PROTECTED]
 For more options, visit this group at 
http://groups.google.com/group/html5lib-discuss?hl=en-GB
-~----------~----~----~----~------~----~------~--~---

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4