Playing around with xml.dom.minidom, I noticed that this beast is perfectly able to read HTML which it can't print: >>> import xml.dom.minidom as md >>> d=md.parseString("<foo>bߐ</foo>")) >>> d.writexml(sys.stdout) ... UnicodeError: ASCII encoding error: ordinal not in range(128) Ouch. Scanning the sources, which revealed various ways to replace '&' with '&' but no generic codec for [ht|x]ml-escaped character entities. Thus, my proposal (which I'm going to implement since I need it...) is to write such a codec. For simplicity, I propose to accept ü and € and friends, but to emit them as Ӓ (or whatever). After this codec is written, all occurrences of string.replace('&','&') (and vice versa) within the standard library can be replaced with the appropriate encode/decode methods. Thoughts? Or am I totally blind, such a codec already exists, and I have missed it? -- Matthias Urlichs | noris network AG | http://smurf.noris.de/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4