"Fred L. Drake, Jr." wrote: > > M.-A. Lemburg writes: > > Access to this mark will go into sys: sys.bom. > > Can the name in sys be a little more descriptive? > sys.byte_order_mark would be reasonable. The abbreviation BOM is quite common w/r to Unicode. > I think that a support module (possibly unicodec) should provide > constants for all four byte order marks as strings (2- & 4-byte, > little- and big-endian). Names could be short BOM_2_LE, BOM_4_LE, > etc. Good idea... sys.bom should return the byte order mark (BOM) for the format used internally. The unicodec module should provide symbols for all possible values of this variable: BOM_BE: '\376\377' (corresponds to Unicode 0x0000FEFF in UTF-16 == ZERO WIDTH NO-BREAK SPACE) BOM_LE: '\377\376' (corresponds to Unicode 0x0000FFFE in UTF-16 == illegal Unicode character) BOM4_BE: '\000\000\377\376' (corresponds to Unicode 0x0000FEFF in UCS-4) BOM4_LE: '\376\377\000\000' (corresponds to Unicode 0x0000FFFE in UCS-4) Note that Unicode sees big endian byte order as being "correct". The swapped order is taken to be an indicator for a "wrong" format, hence the illegal character definition. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4