Antoine Pitrou wrote: >>There are many design alternatives: > > Wouldn't it be simpler to use: > - one-byte representation if every character <= 0xFF > - two-byte representation if every character <= 0xFFFF > - four-byte representation otherwise As I said: there are many alternatives. This one has the disadvantage of requiring a copy every time you pass the string to a Win32 function (which expects UTF-16). Whether or not this is a significant disadvantage, I don't know. In any case, a multi-representations implementation has the disadvantage of making the C API more difficult to use, in particular for writing codecs. On encoding, it is difficult to fetch the individual characters which you need for the lookup table; on decoding, it is difficult to know in advance what representation to use (unless you know there is an upper bound on the decoded character ordinals). Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4