Some thoughts on the codecs... 1. Stream interface At the moment a codec has dump and load methods which read a (slice of a) stream into a string in memory and vice versa. As the proposal notes, this could lead to errors if you take a slice out of a stream. This is not just due to character truncation; some Asian encodings are modal and have shift-in and shift-out sequences as they move from Western single-byte characters to double-byte ones. It also seems a bit pointless to me as the source (or target) is still a Unicode string in memory. This is a real problem - a filter to convert big files between two encodings should be possible without knowledge of the particular encoding, as should one on the input/output of some server. We can still give a default implementation for single-byte encodings. What's a good API for real stream conversion? just Codec.encodeStream(infile, outfile) ? or is it more useful to feed the codec with data a chunk at a time? 2. Data driven codecs I really like codecs being objects, and believe we could build support for a lot more encodings, a lot sooner than is otherwise possible, by making them data driven rather making each one compiled C code with static mapping tables. What do people think about the approach below? First of all, the ISO8859-1 series are straight mappings to Unicode code points. So one Python script could parse these files and build the mapping table, and a very small data file could hold these encodings. A compiled helper function analogous to string.translate() could deal with most of them. Secondly, the double-byte ones involve a mixture of algorithms and data. The worst cases I know are modal encodings which need a single-byte lookup table, a double-byte lookup table, and have some very simple rules about escape sequences in between them. A simple state machine could still handle these (and the single-byte mappings above become extra-simple special cases); I could imagine feeding it a totally data-driven set of rules. Third, we can massively compress the mapping tables using a notation which just lists contiguous ranges; and very often there are relationships between encodings. For example, "cpXYZ is just like cpXYY but with an extra 'smiley' at 0XFE32". In these cases, a script can build a family of related codecs in an auditable manner. 3. What encodings to distribute? The only clean answers to this are 'almost none', or 'everything that Unicode 3.0 has a mapping for'. The latter is going to add some weight to the distribution. What are people's feelings? Do we ship any at all apart from the Unicode ones? Should new encodings be downloadable from www.python.org? Should there be an optional package outside the main distribution? Thanks, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4