[MAL] > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and > signal failure of this assertion at Unicode object construction time > via an exception. That way we are within the standard, can use > reasonably fast code for Unicode manipulation and add those extra 1M > character at a later stage. I think this is reasonable. Using UTF-8 internally is also reasonable, and if it's being rejected on the grounds of supposed slowness, that deserves a closer look (it's an ingenious encoding scheme that works correctly with a surprising number of existing 8-bit string routines as-is). Indexing UTF-8 strings is greatly speeded by adding a simple finger (i.e., store along with the string an index+offset pair identifying the most recent position indexed to -- since string indexing is overwhelmingly sequential, this makes most indexing constant-time; and UTF-8 can be scanned either forward or backward from a random internal point because "the first byte" of each encoding is recognizable as such). I expect either would work well. It's at least curious that Perl and Tcl both went with UTF-8 -- does anyone think they know *why*? I don't. The people here saying UCS-2 is the obviously better choice are all from the Microsoft camp <wink>. It's not obvious to me, but then neither do I claim that UTF-8 is obviously better.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4