[Martin v. Loewis, whose encyclopedic knowledge of encoding details still isn't enough to get a clear answer (it's like somebody asking me for a simple answer to a floating point question <wink>] > ... > So I think we can take one of two approaches: > > 1. admit that CP 875 is not round-trippable, and exclude it from the > test (although when looking at the first 128 characters only, it > is round-trippable). As I noted later, 875 is already excluded from the roundtrip test across range(128, 256). What it's failing is the roundtrip test across range(128): after unicode("?", "cp875") produces u'\x1a', the following .encode('c875') has no way to know which range the original input came from. So it's not really round-trippable across range(128) either unless more info is given to .encode(). > 2. remove the SUBSTITUTE mappings from CP875, acknowledging that > apparently these characters have no meaning in that code page. > Unfortunately, I could not find any official IBM documentation > page that lists the characters supported in each of the EBCDIC > code pages. > > The second seems to be more corrrect to me, although it is a deviation > from the Unicode consortium publications. Until you and MAL agree on the best thing to do (I have no opinion: my only exposure to Unicode in daily programming life remains the Python test suite), I'm going to opt for #1: as cp875.py stands today, it's simply a fact that it's not round-trippable across any range including 0x3f.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4