> This has nothing to do with Python. UTF-8 marks the codes > from 128-191 as illegal prefix. [...] > Perhaps the parser should catch the UnicodeError and > instead return a not-wellformed exception ?! Right on both accounts. If no encoding is specified, and if the document appears not to be UTF-16 in any endianness, an XML processor shall assume it is UTF-8. As Marc-Andre explains, your document is not proper UTF-8, hence the error. The confusing thing is that expat itself does not care about it not being UTF-8; that is only detected when the callback is invoked in pyexpat, and therefore conversion to a Unicode object is attempted. The right solution probably would be to change expat so that it determines correctness of the encoding for each string it gets as part of the wellformedness analysis, and produces illformedness exceptions when an encoding error occurs. Patches are welcome, although they probable should go to sourceforge.net/projects/expat. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4