On Wednesday, August 14, 2002, at 08:33 , Martin v. Loewis wrote: >> Do I misunderstand something, or this this a bug (limitation?) in the >> unicode->latin-1 decoder? > > It's a limitation, in all codecs. Contributions of normalization code > are welcome. Since this is hard work, this is unlikely to be fixed in > Python 2.3 - unless somebody has a really good incentive for fixing > it. Why is this hard work? I would guess that a simple table lookup would suffice, after all there are only a finite number of unicode characters that can be split up, and each one can be split up in only a small number of ways. Wouldn't something like for c in input: if not canbestartofcombiningsequence.has_key(c): output.append(c) nlookahead = MAXCHARSTOCOMBINE while nlookahead > 1: attempt = lookahead next nlookahead bytes from input if combine.has_key(attempt): output.append(combine[attempt]) skip the lookahead in input break else: output.append(c) do the trick, if the two dictionaries are initialized intelligently? -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4