I've submitted two patches for HTMLParser.py and test_htmlparser.py. They were to fix two problems lexing some html pages I found in the wild. 1. Allow "," in attributes A page had the attribute "color=rgb(1,2,3)", and the parser choked on the ",". Added the "," to the list of allowed characters. 2. More robust <SCRIPT> processing. The eBay homepage has unprotected javascript including the line 'vb += "</SCR"+"IPT>". The parser choked on that line. I modified the source to accept a more robust regex for script and style endtags. A side-effect of this is that any "<!--" .. "-->" within a script/style will be parsed as a comment. If that behavior is incorrect, the regex can be modified.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4