[ Greg Stein] > ... > Things will be a lot faster if we have a fixed-size character. Variable > length formats like UTF-8 are a lot harder to slice, search, etc. The initial byte of any UTF-8 encoded character never appears in a *non*-initial position of any UTF-8 encoded character. Which means searching is not only tractable in UTF-8, but also that whatever optimized 8-bit clean string searching routines you happen to have sitting around today can be used as-is on UTF-8 encoded strings. This is not true of UCS-2 encoded strings (in which "the first" byte is not distinguished, so 8-bit search is vulnerable to finding a hit starting "in the middle" of a character). More, to the extent that the bulk of your text is plain ASCII, the UTF-8 search will run much faster than when using a 2-byte encoding, simply because it has half as many bytes to chew over. UTF-8 is certainly slower for random-access indexing, including slicing. I don't know what "etc" means, but if it follows the pattern so far, sometimes it's faster and sometimes it's slower <wink>. > (IMO) a big reason for this new type is for interaction with the > underlying OS/platform. I don't know of any platforms right now that > really use UTF-8 as their Unicode string representation (meaning we'd > have to convert back/forth from our UTF-8 representation to talk to the > OS). No argument here.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4