For the record: > I also don't see how this could save a lot of memory. As an example > take a French text with say 10mio code points. This would end up > appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), > one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending > on how many accents are used). Typical French text seems to have 5% non-ASCII characters. So the number of UTF-8 bytes needed to represent a French text would only be 5% higher than the number of code points. Anyway, it's quite obvious that Martin's goal is that only one representation gets created most of the time. To quote the draft: “All three representations are optional, although the str form is considered the canonical representation which can be absent only while the string is being created.” Regards Antoine.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4