On Mon, Aug 05, 2002 at 03:57:10PM +0200, Fredrik Lundh wrote: > > and the decimal character reference isn't really more useful than > > the named entity reference. > > really? converting a decimal character reference to a unicode character > is trivial, but how do you convert a named entity reference to a unicode > character? (look it up in the htmlentitydefs?) > > here's a trivial piece of code that converts the entitydefs dictionary to > a entity->unicode mapping: > > entitydefs_unicode = {} > for entity, char in entitydefs.items(): > if char[:2] == "&#": > char = unichr(int(char[2:-1])) > else: > char = unicode(char, "iso-8859-1") > entitydefs_unicode[entity] = char Sure it's trivial but why should I be forced to do this conversion? I'm sorry if I didn't explain myself so well. What I meant is not that the entitydefs dictionary is useless but that decimal character references are not useful by themselves - they are just another intermediate form. Why does the dictionary convert from "α" to "α" and not to the fully decoded form which is the single unicode character u'\u03b1'? I can't think of a case where numeric references are really useful by themselves and not as some intermediate form. Browsers understand "α" and "α" equally well. Humans find the named references easier to understand. Processing programs can't understand "α" without first isolating the digits and converting them to a number. About case sensitivity you're right - smashing case does lose some information. If a parser needs to understand sloppy manually-generated HTML with tags like ">" it should be a little smarter than that. Oren
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4