> But I don't know whether the ambiguity in cp875 is a bug or an > undocumented feature The official (as in "as official as it gets") mapping between CP 875 and Unicode is at http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP875.TXT This is also the file which served as an input to generate cp875.py. Character 1A, which is the mapping result of these characters, is indeed known with the name "SUBSTITUTE", apparently following the definition in http://www.its.bldrdoc.gov/fs-1037/dir-035/_5170.htm # substitute character (SUB): A control character that is used in the # place of a character that is recognized to be invalid or in error or # that cannot be represented on a given device. That would suggest that these characters in EBCDIC 875 do not have equivalents in Unicode. However, http://www.kostis.net/charsets/ebc875.htm suggests that the characters in question (3F, DC, E1, EC, ED, FC, and FD) have no character meaning at all. It seems that IBM's ICU library also maps U+001A to character 3F, see http://oss.software.ibm.com/developerworks/opensource/cvs/icu/data/ibm-875_P100-2000.ucm?rev=1.1&content-type=text/x-cvsweb-markup It appears, from looking at http://www.natural-innovations.com/boo/asciiebcdic.html that byte 3F *is* the substitution character in EBCDIC. So it is a bug in the CP875 codec to map Unicode SUBSTITUTE to an arbitrary EBCDIC character which is mapped to SUBSTITUTE; I think cp875 should be corrected to always map U+001A to 3F. That is not something the generator can currently do, though. So I think we can take one of two approaches: 1. admit that CP 875 is not round-trippable, and exclude it from the test (although when looking at the first 128 characters only, it is round-trippable). 2. remove the SUBSTITUTE mappings from CP875, acknowledging that apparently these characters have no meaning in that code page. Unfortunately, I could not find any official IBM documentation page that lists the characters supported in each of the EBCDIC code pages. The second seems to be more corrrect to me, although it is a deviation from the Unicode consortium publications. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4