> > This has nothing to do with Python. UTF-8 marks the codes > > from 128-191 as illegal prefix. > [...] > > Perhaps the parser should catch the UnicodeError and > > instead return a not-wellformed exception ?! > > Right on both accounts. If no encoding is specified, and if the > document appears not to be UTF-16 in any endianness, an XML processor > shall assume it is UTF-8. As Marc-Andre explains, your document is not > proper UTF-8, hence the error. > > The confusing thing is that expat itself does not care about it not > being UTF-8; that is only detected when the callback is invoked in > pyexpat, and therefore conversion to a Unicode object is attempted. Pyexpat violates the XML spec here. XML parsers are not allowed to "recover" from well-formedness errors. And I would classify blithley reporting the character data as "recovery". However, I'm amazed that this wouldn't have come up before, considering the pedigree of expat. I'll poke around, and raise a bug on the expat site if need be. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4