Hi. Yesterday I had asked about being able to strip HTML tags instead of merely escaping them. Somebody mentioned that it was an oft-requested feature and a patch would be great.
I have attached the patch both for the .10 branch (which I currently use) and trunk for the python implementation only. Basically I added a class variable to HTMLSanitizer called strip_tokens which is set to False for normal behaviour (i.e. escaping tags). An example use would look something like the following: def sanitizer_factory(*args, **kwargs): san = sanitizer.HTMLSanitizer(*args, **kwargs) san.strip_tokens = True return san def sanitize(buf): buf = buf.strip() p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"), tokenizer=sanitizer_factory) dom_tree = p.parseFragment(buf) walker = treewalkers.getTreeWalker("dom") stream = walker(dom_tree) s = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False, quote_attr_values=True) return s.render(stream) There appeared to be an alternate way to implement this too - via the html5parser.HTMLParser parse(...) and parseFragment(...) methods. Both methods call the _parse method which initializes the tokenizer and passes in **kwargs to the tokenizer constructor, so parse(...) and parseFragment(...) would need a refactor to accept and pass on **kwargs. I felt that the addition would be semantically confusing, in the sense what were the kwargs for and what did they mean. It seemed that keeping them as they were would be cleaner and simpler. Moreover if I took the above-mentioned route, in the interest of completeness, I would have also had to modify filters.sanitizer.Filter to have a strip_tokens class member and possibly modify filters._base.Filter to accept **kwargs in the constructor method for a more generalized solution, which seemed to be a more complex approach with not that much to gain. At this time I have not added any tests for this, but will do so as soon as I can understand the tests :) Thanks. AM Patch for branch 0.10 Index: sanitizer.py =================================================================== --- sanitizer.py (revision 36654) +++ sanitizer.py (working copy) @@ -130,7 +130,7 @@ # => <script> do_nasty_stuff() </script> # sanitize_html('<a href="javascript: sucker();">Click here for $100</a>') # => <a>Click here for $100</a> - def sanitize_token(self, token): + def sanitize_token(self, token, strip_tokens=False): if token["type"] in ["StartTag", "EndTag", "EmptyTag"]: if token["name"] in self.allowed_elements: if token.has_key("data"): @@ -145,6 +145,8 @@ token["data"] = [[name,val] for name,val in attrs.items()] return token else: + if strip_tokens: + return None if token["type"] == "EndTag": token["data"] = "</%s>" % token["name"] elif token["data"]: @@ -188,6 +190,10 @@ return ' '.join(clean) class HTMLSanitizer(HTMLTokenizer, HTMLSanitizerMixin): + # strip tokens instead of escaping them, if you set it at the class + # level make sure to unset it before using it to simply escape tokens + strip_tokens = False + def __init__(self, stream, encoding=None, parseMeta=True, lowercaseElementName=False, lowercaseAttrName=False): #Change case matching defaults as we only output lowercase html anyway @@ -197,6 +203,6 @@ def __iter__(self): for token in HTMLTokenizer.__iter__(self): - token = self.sanitize_token(token) + token = self.sanitize_token(token, self.strip_tokens) if token: yield token Patch for trunk (the only difference was the extra sanitizer init argument) Index: sanitizer.py =================================================================== --- sanitizer.py (revision 1093) +++ sanitizer.py (working copy) @@ -130,7 +130,7 @@ # => <script> do_nasty_stuff() </script> # sanitize_html('<a href="javascript: sucker();">Click here for $100</a>') # => <a>Click here for $100</a> - def sanitize_token(self, token): + def sanitize_token(self, token, strip_tokens=False): if token["type"] in ["StartTag", "EndTag", "EmptyTag"]: if token["name"] in self.allowed_elements: if token.has_key("data"): @@ -145,6 +145,8 @@ token["data"] = [[name,val] for name,val in attrs.items()] return token else: + if strip_tokens: + return None if token["type"] == "EndTag": token["data"] = "</%s>" % token["name"] elif token["data"]: @@ -188,6 +190,10 @@ return ' '.join(clean) class HTMLSanitizer(HTMLTokenizer, HTMLSanitizerMixin): + # strip tokens instead of escaping them, if you set it at the class + # level make sure to unset it before using it to simply escape tokens + strip_tokens = False + def __init__(self, stream, encoding=None, parseMeta=True, useChardet=True, lowercaseElementName=False, lowercaseAttrName=False): #Change case matching defaults as we only output lowercase html anyway @@ -197,6 +203,6 @@ def __iter__(self): for token in HTMLTokenizer.__iter__(self): - token = self.sanitize_token(token) + token = self.sanitize_token(token, self.strip_tokens) if token: yield token --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To post to this group, send email to html5lib-discuss@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB -~----------~----~----~----~------~----~------~--~---
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4