M.-A. Lemburg wrote: > There seems to be a general misunderstanding here: even if you > have UCS4 storage, it is still possible to slice a Unicode > string in a way which makes rendering it correctly. [impossible?] > Unicode has the concept of combining code points, e.g. you can > store an "é" (e with a accent) as "e" + "'". Now if you slice > off the accent, you'll break the character that you encoded > using combining code points. While this is all true, I agree with Neil that it should do whatever it does consistently across implementations, i.e. len("\U00010000") should always give the same result, and I think this result should always be 1. How to best implement this efficiently is an entirely different question, as is the question whether you can render arbitrary substrings in a meaningful way. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4