Hi all, I'm using BeautifulSoup to parsing an HTML page and find it refused to parse the page. By looking at the backtrace, I found it is a problem with the python built-in HTMLParser.py. In fact, the web page I'm parsing is with some Chinese characters. there is a tag like <img src=/foo/bar.png alt=中文> , note this is legacy html page where the attributes are not quoted. However, the regexp defined in HTMLParser.py is : attrfind = re.compile( r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*' r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?') Note that the Chinese character (also any other non-english characters), so it fire an error parsing this. I'm not sure whether the HTML standard allow un-quoted non-ASCII characters in the attributes. If it allows, this seems to be a bug. and the regexp to better be [^>\s] IMHO. BTW: It seems something like : <script> var st = "<a></"; </script> can not be parsed. :-/ -- pluskid http://blog.pluskid.org
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4