RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://www.mail-archive.com/html5lib-discuss@googlegroups.com/msg00539.html below:

HTMLSanitizer can't be used as a tokenizer

Comment #1 on issue 183 by akvadr...@gmail.com: HTMLSanitizer can't be used as a tokenizer

http://code.google.com/p/html5lib/issues/detail?id=183

This is a workaround and slightly safer design. There is no need for the mixin or to hardcode the __init__ arguments:

from html5lib import HTMLParser
from html5lib.tokenizer import HTMLTokenizer
from html5lib.sanitizer import HTMLSanitizerMixin
from cgi import escape

class Sanitizer(HTMLTokenizer):
    def __init__(self, *a, **kw):
        HTMLTokenizer.__init__(self, *a, **kw)
        self._saner = HTMLSanitizerMixin()

    def __iter__(self):
        for token in HTMLTokenizer.__iter__(self):
            saner = self._saner.sanitize_token(token)
            if saner: yield saner

PARSER = HTMLParser(tokenizer=Sanitizer)

def sanitize(html):
    return PARSER.parseFragment(html).toxml()


--
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
To post to this group, send an email to html5lib-discuss@googlegroups.com.
To unsubscribe from this group, send email to 
html5lib-discuss+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/html5lib-discuss?hl=en-GB.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4