This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of New status.
2959.char_traits<char16_t>::eof
is a valid UTF-16 code unit
Section: 27.2.4.4 [char.traits.specializations.char16.t] Status: New Submitter: Jonathan Wakely Opened: 2017-05-05 Last modified: 2019-04-02
Priority: 3
View all other issues in [char.traits.specializations.char16.t].
View all issues with New status.
Discussion:
The standard requires that char_traits<char16_t>::int_type
is uint_least16_t
, so when that has the same representation as char16_t
there are no bits left to represent the eof
value.
— The member
eof()
shall return an implementation-defined constant that cannot appear as a valid UTF-16 code unit.
Existing practice is to use the "noncharacter" u'\uffff'
for this value, but the Unicode spec is clear that U+FFFF
and other noncharacters are valid, and their appearance in a UTF-16 string does not make it ill-formed. See here and here:
The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.
In practice this means there's no way to tell if basic_streambuf<char16_t>::sputc(u'\uffff')
succeeded or not. If it can insert the character it returns to_int_type(u'\uffff')
and otherwise it returns eof()
, which is the same value.
char_traits<char16_t>::to_int_type(char_type c)
can be defined to transform U+FFFF
into U+FFFD
, so that the invariant eq_int_type(eof(), to_int_type(c)) == false
holds for any c
(and the return value of sputc
will be distinct from eof
). I don't think any implementation currently meets that invariant. I think at the very least we need to correct the statement "The member eof()
shall return an implementation-defined constant that cannot appear as a valid UTF-16 code unit", because there are no such constants if sizeof(uint_least16_t) == sizeof(char16_t)
. This issue is closely related to LWG 1200(i), but there it's a slightly different statement of the problem, and neither the submitter's recommendation nor the proposed resolution solves this issue here. It seems that was closed as NAD before the Unicode corrigendum existed, so at the time our standard just gave "surprising results" but wasn't strictly wrong. Now it makes a normative statement that conflicts with Unicode.
[2017-07 Toronto Wed Issue Prioritization]
Priority 3
Proposed resolution:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4