Asked 16 years, 1 month ago
Viewed 136k times
Locked. This question and its answers are
lockedbecause the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?"
Why do I ask this question?
How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more than one element.
I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements).
For example, try to edit one of these characters:
You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference.
For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad:
u'X'!=unicode('X','utf-16')
on some platforms when X in character outside of BMP.It seems that such bugs are extremely easy to find in many applications that use UTF-16.
So... Do you think that UTF-16 should be considered harmful?
37This is an old answer.
See UTF-8 Everywhere for the latest updates.
Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.
Despite the "anglo-centrism" of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.
On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with char*
. This is a great thing. There're few useful characters which are SHORTER in UTF-16 than they are in UTF-8.
I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at my company ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8 std::string
s to native UTF-16, which Windows itself does not support properly.
To people who say "use what needed where it is needed", I say: there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding wchar_t
to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that every std::string
or char*
parameter would be considered unicode-compatible.
I am also against the "use what you want" approach. I see no reason for such liberty. There's enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I'd be last expected to attack UTF-16 based on religious grounds).
I'd like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:
wchar_t
or std::wstring
in any place other than adjacent point to APIs accepting UTF-16._T("")
or L""
UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation)._UNICODE
constant, such as LPTSTR
or CreateWindow()
._UNICODE
always defined, to avoid passing char*
strings to WinAPI getting silently compiledstd::strings
and char*
anywhere in program are considered UTF-8 (if not said otherwise)std::string
, though you can pass char* or string literal to convert(const std::string &)
.only use Win32 functions that accept widechars (LPWSTR
). Never those which accept LPTSTR
or LPSTR
. Pass parameters this way:
::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str())
(The policy uses conversion functions below.)
With MFC strings:
CString someoneElse; // something that arrived from MFC. Converted as soon as possible, before passing any further away from the API call:
std::string s = str(boost::format("Hello %s\n") % Convert(someoneElse));
AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK);
Working with files, filenames and fstream on Windows:
std::string
or const char*
filename arguments to fstream
family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:Convert std::string
arguments to std::wstring
with Utils::Convert
:
std::ifstream ifs(Utils::Convert("hello"),
std::ios_base::in |
std::ios_base::binary);
We'll have to manually remove the convert, when MSVC's attitude to fstream
changes.
fstream
unicode research/discussion case 4215 for more info.fopen()
for RAII/OOD reasons. If necessary, use _wfopen()
and WinAPI conventions above.// For interface to win32 API functions
std::string convert(const std::wstring& str, unsigned int codePage /*= CP_UTF8*/)
{
// Ask me for implementation..
...
}
std::wstring convert(const std::string& str, unsigned int codePage /*= CP_UTF8*/)
{
// Ask me for implementation..
...
}
// Interface to MFC
std::string convert(const CString &mfcString)
{
#ifdef UNICODE
return Utils::convert(std::wstring(mfcString.GetString()));
#else
return mfcString.GetString(); // This branch is deprecated.
#endif
}
CString convert(const std::string &s)
{
#ifdef UNICODE
return CString(Utils::convert(s).c_str());
#else
Exceptions::Assert(false, "Unicode policy violation. See W569"); // This branch is deprecated as it does not support unicode
return s.c_str();
#endif
}
52
Unicode codepoints are not characters! Sometimes they are not even glyphs (visual forms).
Some examples:
The only ways to get Unicode editing right is to use a library written by an expert, or become an expert and write one yourself. If you are just counting codepoints, you are living in a state of sin.
9There is a simple rule of thumb on what Unicode Transformation Form (UTF) to use: - utf-8 for storage and comunication - utf-16 for data processing - you might go with utf-32 if most of the platform API you use is utf-32 (common in the UNIX world).
Most systems today use utf-16 (Windows, Mac OS, Java, .NET, ICU, Qt). Also see this document: http://unicode.org/notes/tn12/
Back to "UTF-16 as harmful", I would say: definitely not.
People who are afraid of surrogates (thinking that they transform Unicode into a variable-length encoding) don't understand the other (way bigger) complexities that make mapping between characters and a Unicode code point very complex: combining characters, ligatures, variation selectors, control characters, etc.
Just read this series here http://www.siao2.com/2009/06/29/9800913.aspx and see how UTF-16 becomes an easy problem.
9Yes, absolutely.
Why? It has to do with exercising code.
If you look at these codepoint usage statistics on a large corpus by Tom Christiansen you'll see that trans-8bit BMP codepoints are used several orders if magnitude more than non-BMP codepoints:
2663710 U+002013 ‹–› GC=Pd EN DASH
1065594 U+0000A0 ‹ › GC=Zs NO-BREAK SPACE
1009762 U+0000B1 ‹±› GC=Sm PLUS-MINUS SIGN
784139 U+002212 ‹−› GC=Sm MINUS SIGN
602377 U+002003 ‹ › GC=Zs EM SPACE
544 U+01D49E ‹𝒞› GC=Lu MATHEMATICAL SCRIPT CAPITAL C
450 U+01D4AF ‹𝒯› GC=Lu MATHEMATICAL SCRIPT CAPITAL T
385 U+01D4AE ‹𝒮› GC=Lu MATHEMATICAL SCRIPT CAPITAL S
292 U+01D49F ‹𝒟› GC=Lu MATHEMATICAL SCRIPT CAPITAL D
285 U+01D4B3 ‹𝒳› GC=Lu MATHEMATICAL SCRIPT CAPITAL X
Take the TDD dictum: "Untested code is broken code", and rephrase it as "unexercised code is broken code", and think how often programmers have to deal with non-BMP codepoints.
Bugs related to not dealing with UTF-16 as a variable-width encoding are much more likely to go unnoticed than the equivalent bugs in UTF-8. Some programming languages still don't guarantee to give you UTF-16 instead of UCS-2, and some so-called high-level programming languages offer access to code units instead of code-points (even C is supposed to give you access to codepoints if you use wchar_t
, regardless of what some platforms may do).
I would suggest that thinking UTF-16 might be considered harmful says that you need to gain a greater understanding of unicode.
Since I've been downvoted for presenting my opinion on a subjective question, let me elaborate. What exactly is it that bothers you about UTF-16? Would you prefer if everything was encoded in UTF-8? UTF-7? Or how about UCS-4? Of course certain applications are not designed to handle everysingle character code out there - but they are necessary, especially in today's global information domain, for communication between international boundaries.
But really, if you feel UTF-16 should be considered harmful because it's confusing or can be improperly implemented (unicode certainly can be), then what method of character encoding would be considered non-harmful?
EDIT: To clarify: Why consider improper implementations of a standard a reflection of the quality of the standard itself? As others have subsequently noted, merely because an application uses a tool inappropriately, does not mean that the tool itself is defective. If that were the case, we could probably say things like "var keyword considered harmful", or "threading considered harmful". I think the question confuses the quality and nature of the standard with the difficulties many programmers have in implementing and using it properly, which I feel stem more from their lack of understanding how unicode works, rather than unicode itself.
15There is nothing wrong with Utf-16 encoding. But languages that treat the 16-bit units as characters should probably be considered badly designed. Having a type named 'char
' which does not always represent a character is pretty confusing. Since most developers will expect a char type to represent a code point or character, much code will probably break when exposed to characters beyound BMP.
Note however that even using utf-32 does not mean that each 32-bit code point will always represent a character. Due to combining characters, an actual character may consist of several code points. Unicode is never trivial.
BTW. There is probably the same class of bugs with platforms and applications which expect characters to be 8-bit, which are fed Utf-8.
5My personal choice is to always use UTF-8. It's the standard on Linux for nearly everything. It's backwards compatible with many legacy apps. There is a very minimal overhead in terms of extra space used for non-latin characters vs the other UTF formats, and there is a significant savings in space for latin characters. On the web, latin languages reign supreme, and I think they will for the foreseeable future. And to address one of the main arguments in the original post: nearly every programmer is aware that UTF-8 will sometimes have multi-byte characters in it. Not everyone deals with this correctly, but they are usually aware, which is more than can be said for UTF-16. But, of course, you need to choose the one most appropriate for your application. That's why there's more than one in the first place.
8Well, there is an encoding that uses fixed-size symbols. I certainly mean UTF-32. But 4 bytes for each symbol is too much of wasted space, why would we use it in everyday situations?
To my mind, most problems appear from the fact that some software fell behind the Unicode standard, but were not quick to correct the situation. Opera, Windows, Python, Qt - all of them appeared before UTF-16 became widely known or even came into existence. I can confirm, though, that in Opera, Windows Explorer, and Notepad there are no problems with characters outside BMP anymore (at least on my PC). But anyway, if programs don't recognise surrogate pairs, then they don't use UTF-16. Whatever problems arise from dealing with such programs, they have nothing to do with UTF-16 itself.
However, I think that the problems of legacy software with only BMP support are somewhat exaggerated. Characters outside BMP are encountered only in very specific cases and areas. According to the Unicode official FAQ, "even in East Asian text, the incidence of surrogate pairs should be well less than 1% of all text storage on average". Of course, characters outside BMP shouldn't be neglected because a program is not Unicode-conformant otherwise, but most programs are not intended for working with texts containing such characters. That's why if they don't support it, it is unpleasant, but not a catastrophy.
Now let's consider the alternative. If UTF-16 didn't exist, then we wouldn't have an encoding which is well-suited for non-ASCII text, and all the software created for UCS-2 would have to be completely redesigned to remain Unicode-compliant. The latter most likely would only slow Unicode adoption. Also we wouldn't have been able to maintain compability with text in UCS-2 like UTF-8 does in relation to ASCII.
Now, putting aside all the legacy issues, what are the arguments against the encoding itself? I really doubt that developers nowadays don't know that UTF-16 is variable length, it is written everywhere strarting with Wikipedia. UTF-16 is much less difficult to parse than UTF-8, if someone pointed out complexity as a possible problem. Also it is wrong to think that it is easy to mess up with determining the string length only in UTF-16. If you use UTF-8 or UTF-32, you still should be aware that one Unicode code point doesn't necessarily mean one character. Other than that, I don't think that there's anything substantial against the encoding.
Therefore I don't think the encoding itself should be considered harmful. UTF-16 is a compromise between simplicity and compactness, and there's no harm in using what is needed where it is needed. In some cases you need to remain compatible with ASCII and you need UTF-8, in some cases you want to work with work with Han ideographs and conserve space using UTF-16, in some cases you need universal representations of characters usign a fixed-length encoding. Use what's more appropriate, just do it properly.
31Years of Windows internationalization work especially in East Asian languages might have corrupted me, but I lean toward UTF-16 for internal-to-the-program representations of strings, and UTF-8 for network or file storage of plaintext-like documents. UTF-16 can usually be processed faster on Windows, though, so that's the primary benefit of using UTF-16 in Windows.
Making the leap to UTF-16 dramatically improved the adequacy of average products handling international text. There are only a few narrow cases when the surrogate pairs need to be considered (deletions, insertions, and line breaking, basically) and the average-case is mostly straight pass-through. And unlike earlier encodings like JIS variants, UTF-16 limits surrogate pairs to a very narrow range, so the check is really quick and works forward and backward.
Granted, it's roughly as quick in correctly-encoded UTF-8, too. But there's also many broken UTF-8 applications that incorrectly encode surrogate pairs as two UTF-8 sequences. So UTF-8 doesn't guarantee salvation either.
IE handles surrogate pairs reasonably well since 2000 or so, even though it typically is converting them from UTF-8 pages to an internal UTF-16 representation; I'm fairly sure Firefox has got it right too, so I don't really care what Opera does.
UTF-32 (aka UCS4) is pointless for most applications since it's so space-demanding, so it's pretty much a nonstarter.
9UTF-8 is definitely the way to go, possibly accompanied by UTF-32 for internal use in algorithms that need high performance random access (but that ignores combining chars).
Both UTF-16 and UTF-32 (as well as their LE/BE variants) suffer of endianess issues, so they should never be used externally.
1UTF-16? definitely harmful. Just my grain of salt here, but there are exactly three acceptable encodings for text in a program:
integer codepoints ("CP"?): an array of the largest integers that are convenient for your programming language and platform (decays to ASCII in the limit of low resorces). Should be int32 on older computers and int64 on anything with 64-bit addressing.
Obviously interfaces to legacy code use what encoding is needed to make the old code work right.
Unicode defines code points up to 0x10FFFF (1,114,112 codes), all applications running in multilingual environment dealing with strings/file names etc. should handle that correctly.
Utf-16: covers only 1,112,064 codes. Although those at the end of Unicode are from planes 15-16 (Private Use Area). It can not grow any further in the future except breaking Utf-16 concept.
Utf-8: covers theoretically 2,216,757,376 codes. Current range of Unicode codes can be represented by maximally 4 byte sequence. It does not suffer with byte order problem, it is "compatible" with ascii.
Utf-32: covers theoretically 2^32=4,294,967,296 codes. Currently it is not variable length encoded and probably will not be in the future.
Those facts are self explanatory. I do not understand advocating general use of Utf-16. It is variable length encoded (can not be accessed by index), it has problems to cover whole Unicode range even at present, byte order must be handled, etc. I do not see any advantage except that it is natively used in Windows and some other places. Even though when writing multi-platform code it is probably better to use Utf-8 natively and make conversions only at the end points in platform dependent way (as already suggested). When direct access by index is necessary and memory is not a problem, Utf-32 should be used.
The main problem is that many programmers dealing with Windows Unicode = Utf-16 do not even know or ignore the fact that it is variable length encoded.
The way it is usually in *nix platform is pretty good, c strings (char *) interpreted as Utf-8 encoded, wide c strings (wchar_t *) interpreted as Utf-32.
8Add this to the list:
The presented scenario is simple (even more simple as I will present it here than it was originally!): 1.A WinForms TextBox sits on a Form, empty. It has a MaxLength set to 20.
2.The user types into the TextBox, or maybe pastes text into it.
3.No matter what you type or paste into the TextBox, you are limited to 20, though it will sympathetically beep at text beyond the 20 (YMMV here; I changed my sound scheme to give me that effect!).
4.The small packet of text is then sent somewhere else, to start an exciting adventure.
Now this is an easy scenario, and anyone can write this up, in their spare time. I just wrote it up myself in multiple programming languages using WinForms, because I was bored and had never tried it before. And with text in multiple actual languages because I am wired that way and have more keyboard layouts than possibly anyone in the entire freaking universe.
I even named the form Magic Carpet Ride, to help ameliorate the boredom.
This did not work, for what it's worth.
So instead, I entered the following 20 characters into my Magic Carpet Ride form:
0123401234012340123𠀀
Uh oh.
That last character is U+20000, the first Extension B ideograph of Unicode (aka U+d840 U+dc00, to its close friends who he is not ashamed to be disrobed, as it were, in front of)....
And now we have a ball game.
Because when TextBox.MaxLength talks about
Gets or sets the maximum number of characters that can be manually entered into the text box.
what it really means is
Gets or sets the maximum number of UTF-16 LE code units that can be manually entered into the text box and will mercilessly truncate the living crap out of any string that tries to play cutesy games with the linguistic character notion that only someone as obsessed as that Kaplan fellow will find offensive (geez he needs to get out more!).
I'll try and see about getting the document updated....
Regular readers who remember my UCS-2 to UTF-16 series will note my unhappiness with the simplistic notion of TextBox.MaxLength and how it should handle at a minimum this case where its draconian behavior creates an illegal sequence, one that other parts of the .Net Framework may throw a
- System.Text.EncoderFallbackException: Unable to translate Unicode character \uD850 at index 0 to specified code page.*
exception if you pass this string elsewhere in the .Net Framework (as my colleague Dan Thompson was doing).
Now okay, perhaps the full UCS-2 to UTF-16 series is out of the reach of many.
But isn't it reasonable to expect that TextBox.Text will not produce a System.String that won't cause another piece of the .Net Framework to throw? I mean, it isn't like there is a chance in the form of some event on the control that tells you of the upcoming truncation where you can easily add the smarter validation -- validation that the control itself does not mind doing. I would go so far as to say that this punk control is breaking a safety contract that could even lead to security problems if you can class causing unexpected exceptions to terminate an application as a crude sort of denial of service. Why should any WinForms process or method or algorithm or technique produce invalid results?
Source : Michael S. Kaplan MSDN Blog
1I wouldn't necessarily say that UTF-16 is harmful. It's not elegant, but it serves its purpose of backwards compatibility with UCS-2, just like GB18030 does with GB2312, and UTF-8 does with ASCII.
But making a fundamental change to the structure of Unicode in midstream, after Microsoft and Sun had built huge APIs around 16-bit characters, was harmful. The failure to spread awareness of the change was more harmful.
3I've never understood the point of UTF-16. If you want the most space-efficient representation, use UTF-8. If you want to be able to treat text as fixed-length, use UTF-32. If you want neither, use UTF-16. Worse yet, since all of the common (basic multilingual plane) characters in UTF-16 fit in a single code point, bugs that assume that UTF-16 is fixed-length will be subtle and hard to find, whereas if you try to do this with UTF-8, your code will fail fast and loudly as soon as you try to internationalize.
Since I cannot yet comment, I post this as an answer, since it seems I cannot otherwise contact the authors of utf8everywhere.org
. It's a shame I don't automatically get the comment privilege, since I have enough reputation on other stackexchanges.
This is meant as a comment to the Opinion: Yes, UTF-16 should be considered harmful answer.
One little correction:To prevent one from accidentally passing a UTF-8 char*
into ANSI-string versions of Windows-API functions, one should define UNICODE
, not _UNICODE
. _UNICODE
maps functions like _tcslen
to wcslen
, not MessageBox
to MessageBoxW
. Instead, the UNICODE
define takes care of the latter. For proof, this is from MS Visual Studio 2005's WinUser.h
header:
#ifdef UNICODE
#define MessageBox MessageBoxW
#else
#define MessageBox MessageBoxA
#endif // !UNICODE
At the very minimum, this error should be corrected on utf8everywhere.org
.
Perhaps the guide should contain an example of explicit use of the Wide-string version of a data structure, to make it less easy to miss/forget it. Using Wide-string versions of data structures on top of using Wide-string versions of functions makes it even less likely that one accidentally calls an ANSI-string version of such a function.
Example of the example:
WIN32_FIND_DATAW data; // Note the W at the end.
HANDLE hSearch = FindFirstFileW(widen("*.txt").c_str(), &data);
if (hSearch != INVALID_HANDLE_VALUE)
{
FindClose(hSearch);
MessageBoxW(nullptr, data.cFileName, nullptr, MB_OK);
}
5
Someone said UCS4 and UTF-32 were same. No so, but I know what you mean. One of them is an encoding of the other, though. I wish they'd thought to specify endianness from the first so we wouldn't have the endianess battle fought out here too. Couldn't they have seen that coming? At least UTF-8 is the same everywhere (unless someone is following the original spec with 6-bytes).
If you use UTF-16 you have to include handling for multibyte chars. You can't go to the Nth character by indexing 2N into a byte array. You have to walk it, or have character indices. Otherwise you've written a bug.
The current draft spec of C++ says that UTF-32 and UTF-16 can have little-endian, big-endian, and unspecified variants. Really? If Unicode had specified that everyone had to do little-endian from the beginning then it would have all been simpler. (I would have been fine with big-endian as well.) Instead, some people implemented it one way, some the other, and now we're stuck with silliness for nothing. Sometimes it's embarrassing to be a software engineer.
4I don't think it's harmful if the developer is careful enough.
And they should accept this trade off if they know well too.
As a Japanese software developer, I find UCS-2 large enough and limiting the space apparently simplifies the logic and reduces runtime memory, so using utf-16 under UCS-2 limitation is good enough.
There are filesystem or other application which assumes codepoints and bytes to be proportional, so that raw codepoint number can be guaranteed to be fit to some fixed size storage.
One example is NTFS and VFAT specifying UCS-2 as their filename storage encoding.
If those example really wants to extend to support UCS-4, I could agree using utf-8 for everything anyway, but fixed length has good points like:
In the future when memory/processing power is cheap even in any embeded devices, we may accept the device being a bit slow for extra cache misses or page faults and extra memory usage, but this wont happen in the near future I guess...
1"Should one of the most popular encodings, UTF-16, be considered harmful?"
Quite possibly, but the alternatives should not necessarily be viewed as being much better.
The fundamental issue is that there are many different concepts about: glyphs, characters, codepoints and byte sequences. The mapping between each of these is non-trivial, even with the aid of a normalization library. (For example, some characters in European languages that are written with a Latin-based script are not written with a single Unicode codepoint. And that's at the simpler end of the complexity!) What this means is that to get everything correct is quite amazingly difficult; bizarre bugs are to be expected (and instead of just moaning about them here, tell the maintainers of the software concerned).
The only way in which UTF-16 can be considered to be harmful as opposed to, say, UTF-8 is that it has a different way of encoding code points outside the BMP (as a pair of surrogates). If code is wishing to access or iterate by code point, that means it needs to be aware of the difference. OTOH, it does mean that a substantial body of existing code that assumes "characters" can always be fit into a two-byte quantity — a fairly common, if wrong, assumption — can at least continue to work without rebuilding it all. In other words, at least you get to see those characters that aren't being handled right!
I'd turn your question on its head and say that the whole damn shebang of Unicode should be considered harmful and everyone ought to use an 8-bit encoding, except I've seen (over the past 20 years) where that leads: horrible confusion over the various ISO 8859 encodings, plus the whole set of ones used for Cyrillic, and the EBCDIC suite, and… well, Unicode for all its faults beats that. If only it wasn't such a nasty compromise between different countries' misunderstandings.
2Start asking to get answers
Find the answer to your question by asking.
Ask questionExplore related questions
See similar questions with these tags.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4