Comment #1 on issue 113 by eromirou: cannot handle mailformed attribute names with html5lib and lxml http://code.google.com/p/html5lib/issues/detail?id=113
I found out that using 'sanitizer.HTMLSanitizer' as the tokenizer works fine: import html5lib from html5lib import treebuilders from html5lib import sanitizer html_code = "<a 123=456></a>" html_parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder('lxml'), tokenizer=sanitizer.HTMLSanitizer) print(html_parser.parse(html_code)) I don't know if this is the right way to do it, but it saved me. Cheers -- You received this message because you are listed in the owner or CC fields of this issue, or because you starred this issue. You may adjust your issue notification preferences at: http://code.google.com/hosting/settings --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To post to this group, send email to html5lib-discuss@googlegroups.com To unsubscribe from this group, send email to html5lib-discuss+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB -~----------~----~----~----~------~----~------~--~---
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4