RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2002-August/027819.html below:

[Python-Dev] PEP 277 (unicode filenames): please review

[Python-Dev] PEP 277 (unicode filenames): please reviewMartin v. Loewis martin@v.loewis.de
14 Aug 2002 20:35:41 +0200

Previous message: [Python-Dev] PEP 277 (unicode filenames): please review
Next message: [Python-Dev] PEP 277 (unicode filenames): please review
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Jack Jansen <Jack.Jansen@oratrix.com> writes:

> Why is this hard work? I would guess that a simple table lookup would
> suffice, after all there are only a finite number of unicode
> characters that can be split up, and each one can be split up in only
> a small number of ways.

Canonical decomposition requires more than that: you not only need to
apply the canonical decomposition mapping, but also need to put the
resulting characters into canonical order (if more than one combining
character applies to a base character).

In addition, a na=EFve implementation will consume large amounts of
memory. Hangul decomposition is better done algorithmitically, as we
are talking about 11172 precombined characters for Hangul alone.

> Wouldn't something like
> for c in input:
> 	if not canbestartofcombiningsequence.has_key(c):
> 		output.append(c)
>       nlookahead =3D MAXCHARSTOCOMBINE
>       while nlookahead > 1:
> 		attempt =3D lookahead next nlookahead bytes from input
> 		if combine.has_key(attempt):
> 			output.append(combine[attempt])
> 			skip the lookahead in input
> 			break
> 	else:
> 		output.append(c)
> do the trick, if the two dictionaries are initialized intelligently?

No, that doesn't do canonical ordering. There is a lot more to
normalization; the hard work is really in understanding what has to be
done.

Regards,
Martin

Previous message: [Python-Dev] PEP 277 (unicode filenames): please review
Next message: [Python-Dev] PEP 277 (unicode filenames): please review
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4