A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.mail-archive.com/html5lib-discuss@googlegroups.com/msg00158.html below:

Patch for optional stripping tags during sanitization

Hi.

Yesterday I had asked about being able to strip HTML tags instead of
merely escaping them. Somebody mentioned that it was an oft-requested
feature and a patch would be great.
I have attached the patch both for the .10 branch (which I currently
use) and trunk for the python implementation only. Basically I added a
class variable to HTMLSanitizer called strip_tokens which is set to
False for normal behaviour (i.e. escaping tags). An example use would
look something like the following:

def sanitizer_factory(*args, **kwargs):
    san = sanitizer.HTMLSanitizer(*args, **kwargs)
    san.strip_tokens = True
    return san

def sanitize(buf):
    buf = buf.strip()

    p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"),
        tokenizer=sanitizer_factory)
    dom_tree = p.parseFragment(buf)

    walker = treewalkers.getTreeWalker("dom")
    stream = walker(dom_tree)

    s =
serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False,
quote_attr_values=True)
    return s.render(stream)

There appeared to be an alternate way to implement this too - via the
html5parser.HTMLParser parse(...) and parseFragment(...) methods. Both
methods call the _parse method which initializes the tokenizer and
passes in **kwargs to the tokenizer constructor, so parse(...) and
parseFragment(...) would need a refactor to accept and pass on
**kwargs. I felt that the addition would be semantically confusing, in
the sense what were the kwargs for and what did they mean. It seemed
that keeping them as they were would be cleaner and simpler.

Moreover if I took the above-mentioned route, in the interest of
completeness, I would have also had to modify filters.sanitizer.Filter
to have a strip_tokens class member and possibly modify
filters._base.Filter to accept **kwargs in the constructor method for
a more generalized solution, which seemed to be a more complex
approach with not that much to gain. At this time I have not added any
tests for this, but will do so as soon as I can understand the
tests :)

Thanks.
AM

Patch for branch 0.10

Index: sanitizer.py
===================================================================
--- sanitizer.py        (revision 36654)
+++ sanitizer.py        (working copy)
@@ -130,7 +130,7 @@
     #    => <script> do_nasty_stuff() </script>
     #   sanitize_html('<a href="javascript: sucker();">Click here for
$100</a>')
     #    => <a>Click here for $100</a>
-    def sanitize_token(self, token):
+    def sanitize_token(self, token, strip_tokens=False):
         if token["type"] in ["StartTag", "EndTag", "EmptyTag"]:
             if token["name"] in self.allowed_elements:
                 if token.has_key("data"):
@@ -145,6 +145,8 @@
                     token["data"] = [[name,val] for name,val in
attrs.items()]
                 return token
             else:
+                if strip_tokens:
+                    return None
                 if token["type"] == "EndTag":
                     token["data"] = "</%s>" % token["name"]
                 elif token["data"]:
@@ -188,6 +190,10 @@
         return ' '.join(clean)

 class HTMLSanitizer(HTMLTokenizer, HTMLSanitizerMixin):
+    # strip tokens instead of escaping them, if you set it at the
class
+    # level make sure to unset it before using it to simply escape
tokens
+    strip_tokens = False
+
     def __init__(self, stream, encoding=None, parseMeta=True,
                  lowercaseElementName=False,
lowercaseAttrName=False):
         #Change case matching defaults as we only output lowercase
html anyway
@@ -197,6 +203,6 @@

     def __iter__(self):
         for token in HTMLTokenizer.__iter__(self):
-            token = self.sanitize_token(token)
+            token = self.sanitize_token(token, self.strip_tokens)
             if token:
                 yield token


Patch for trunk (the only difference was the extra sanitizer init
argument)

Index: sanitizer.py
===================================================================
--- sanitizer.py        (revision 1093)
+++ sanitizer.py        (working copy)
@@ -130,7 +130,7 @@
     #    => &lt;script> do_nasty_stuff() &lt;/script>
     #   sanitize_html('<a href="javascript: sucker();">Click here for
$100</a>')
     #    => <a>Click here for $100</a>
-    def sanitize_token(self, token):
+    def sanitize_token(self, token, strip_tokens=False):
         if token["type"] in ["StartTag", "EndTag", "EmptyTag"]:
             if token["name"] in self.allowed_elements:
                 if token.has_key("data"):
@@ -145,6 +145,8 @@
                     token["data"] = [[name,val] for name,val in
attrs.items()]
                 return token
             else:
+                if strip_tokens:
+                    return None
                 if token["type"] == "EndTag":
                     token["data"] = "</%s>" % token["name"]
                 elif token["data"]:
@@ -188,6 +190,10 @@
         return ' '.join(clean)

 class HTMLSanitizer(HTMLTokenizer, HTMLSanitizerMixin):
+    # strip tokens instead of escaping them, if you set it at the
class
+    # level make sure to unset it before using it to simply escape
tokens
+    strip_tokens = False
+
     def __init__(self, stream, encoding=None, parseMeta=True,
useChardet=True,
                  lowercaseElementName=False,
lowercaseAttrName=False):
         #Change case matching defaults as we only output lowercase
html anyway
@@ -197,6 +203,6 @@

     def __iter__(self):
         for token in HTMLTokenizer.__iter__(self):
-            token = self.sanitize_token(token)
+            token = self.sanitize_token(token, self.strip_tokens)
             if token:
                 yield token



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
 To post to this group, send email to html5lib-discuss@googlegroups.com
 To unsubscribe from this group, send email to [EMAIL PROTECTED]
 For more options, visit this group at 
http://groups.google.com/group/html5lib-discuss?hl=en-GB
-~----------~----~----~----~------~----~------~--~---


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4