RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2002-August/027340.html below:

[Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml

[Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtmlOren Tirosh oren-py-d@hishome.net
Mon, 5 Aug 2002 17:47:03 +0300

Previous message: [Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml
Next message: [Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Aug 05, 2002 at 03:57:10PM +0200, Fredrik Lundh wrote:
> > and the decimal character reference isn't really more useful than
> > the named entity reference.
>
> really?  converting a decimal character reference to a unicode character
> is trivial, but how do you convert a named entity reference to a unicode
> character?  (look it up in the htmlentitydefs?)
>
> here's a trivial piece of code that converts the entitydefs dictionary to
> a entity->unicode mapping:
>
>     entitydefs_unicode = {}
>     for entity, char in entitydefs.items():
>         if char[:2] == "&#":
>             char = unichr(int(char[2:-1]))
>         else:
>             char = unicode(char, "iso-8859-1")
>         entitydefs_unicode[entity] = char

Sure it's trivial but why should I be forced to do this conversion? I'm
sorry if I didn't explain myself so well. What I meant is not that the
entitydefs dictionary is useless but that decimal character references are
not useful by themselves - they are just another intermediate form.  Why
does the dictionary convert from "&alpha;" to "&#945;" and not to the
fully decoded form which is the single unicode character u'\u03b1'?

I can't think of a case where numeric references are really useful by
themselves and not as some intermediate form.  Browsers understand
"&alpha;" and "&#945;" equally well. Humans find the named references
easier to understand. Processing programs can't understand "&#945;"
without first isolating the digits and converting them to a number. 

About case sensitivity you're right - smashing case does lose some
information. If a parser needs to understand sloppy manually-generated
HTML with tags like "&GT;" it should be a little smarter than that.

	Oren

Previous message: [Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml
Next message: [Python-Dev] Re: [ python-Patches-590682 ] New codecs: html, asciihtml
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4