>>>>> "Fredrik" == Fredrik Lundh <fredrik at pythonware.com> writes: Fredrik> M.-A. Lemburg wrote: >>> (google for "stringlib" for some work I'm doing in this area) >> Ah, now I know where you're coming from :-) Shift tables don't >> work well in the Unicode world with its large alphabet. Fredrik> since most real-life text use characters from only a Fredrik> small number of regions in that alphabet, This is true of "most real-life text", but it's going to be false most of the time for a large (and rapidly growing) minority of users: those working with texts comprised mostly of Asian ideographs. Unihan (spread over about 80 256-character rows) has a potential big problem: because it is ordered by root, then stroke count, the simpler (and usually more frequently used) ideographs with a common root cluster near the root. Whether those clusters frequently overlap based on a simple compression method like "lowest 5 bits" I don't know offhand. I don't know whether the composed Hangul (~ 40 rows) would show clustering; that would depend on phonetic frequencies in the Korean language. Of course the find algorithm you present is almost surely a big win over the brute-force method, even in the presence of some degree of clustering in Unihan and Hangul. But I worry that it's an exceptional example, when you use assumptions like "real-life text uses characters drawn from a small number of short contiguous regions in the alphabet." -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4