Hello Zhang Chiyuan, Can you file a bug on the Python issue tracker please: http://bugs.python.org Thanks Michael Foord Zhang Chiyuan wrote: > Hi all, > > I'm using BeautifulSoup to parsing an HTML page and find it refused to > parse the page. By looking at the backtrace, I found it is a problem > with the python built-in HTMLParser.py. In fact, the web page I'm > parsing is with some Chinese characters. there is a tag like <img > src=/foo/bar.png alt=中文> , note this is legacy html page where the > attributes are not quoted. However, the regexp defined in > HTMLParser.py is : > > attrfind = re.compile( > r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*' > r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?') > > Note that the Chinese character (also any other non-english > characters), so it fire an error parsing this. I'm not sure whether > the HTML standard allow un-quoted non-ASCII characters in the > attributes. If it allows, this seems to be a bug. and the regexp to > better be [^>\s] IMHO. > > BTW: It seems something like : > > <script> > var st = "<a></"; > </script> > > can not be parsed. :-/ > > -- > pluskid > http://blog.pluskid.org > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk > -- http://www.ironpythoninaction.com/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4