> "M.-A. Lemburg" wrote: > ... > > > The codes from 192 to 236, 238-243 produce > > > "UTF-8 decoding error: invalid data", > > > the rest gives "not well-formed". > > > > > > I would like to know if this happens with your (Tim) modified > > > version as well. I'm using plain vanilla BeOpen Python 2.0 . > > > > This has nothing to do with Python. UTF-8 marks the codes > > from 128-191 as illegal prefix. See Object/unicodeobject.c: > ... > > Schade. > > > Perhaps the parser should catch the UnicodeError and > > instead return a not-wellformed exception ?! > > I belive it would be better. Yes, and given there is not much time before thr 2.1 release, doing so is an acceptable stop-gap. However, I think the real fix has to lie in expat. I just had a *very* quick and dirty perusal of expat 1.2 and 1.95.1, and not only do the UTF-8 validity checks (at the top of xmltok.c) seem wrong, but it doesn't look as if they're ever invoked. I'll try to some time to look into this more closely, or perhaps someone will straighten me out if I'm on the wrong trail. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4