Martin v. Löwis wrote: >> ci = codecs.lookup("xml-auto-detect") >> p = expat.ParserCreate() >> e = "utf-32" >> s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) >> s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0] >> p.Parse(s, True) > > So how come the document being parsed is recognized as UTF-8? Because you can force the encoder to use a specified encoding. If you do this and the unicode string starts with an XML declaration, the encoder will put the specified encoding into the declaration: import codecs e = codecs.getencoder("xml-auto-detect") print e(u"<?xml version='1.0' encoding='iso-8859-1'?><foo/>", encoding="utf-8")[0] This prints: <?xml version='1.0' encoding='utf-8'?><foo/> >> OK, so should I put the C code into a _xml module? > > I don't see the need for C code at all. Doing the bit fiddling for Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the right thing to do. Servus, Walter
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4