Jack Jansen <Jack.Jansen@oratrix.com> writes: > Why is this hard work? I would guess that a simple table lookup would > suffice, after all there are only a finite number of unicode > characters that can be split up, and each one can be split up in only > a small number of ways. Canonical decomposition requires more than that: you not only need to apply the canonical decomposition mapping, but also need to put the resulting characters into canonical order (if more than one combining character applies to a base character). In addition, a na=EFve implementation will consume large amounts of memory. Hangul decomposition is better done algorithmitically, as we are talking about 11172 precombined characters for Hangul alone. > Wouldn't something like > for c in input: > if not canbestartofcombiningsequence.has_key(c): > output.append(c) > nlookahead =3D MAXCHARSTOCOMBINE > while nlookahead > 1: > attempt =3D lookahead next nlookahead bytes from input > if combine.has_key(attempt): > output.append(combine[attempt]) > skip the lookahead in input > break > else: > output.append(c) > do the trick, if the two dictionaries are initialized intelligently? No, that doesn't do canonical ordering. There is a lot more to normalization; the hard work is really in understanding what has to be done. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4