> Andy Robinson wrote: > > > > Some thoughts on the codecs... > > > > 1. Stream interface > > At the moment a codec has dump and load methods which > > read a (slice of a) stream into a string in memory and > > vice versa. As the proposal notes, this could lead to > > errors if you take a slice out of a stream. This is > > not just due to character truncation; some Asian > > encodings are modal and have shift-in and shift-out > > sequences as they move from Western single-byte > > characters to double-byte ones. It also seems a bit > > pointless to me as the source (or target) is still a > > Unicode string in memory. > > > > This is a real problem - a filter to convert big files > > between two encodings should be possible without > > knowledge of the particular encoding, as should one on > > the input/output of some server. We can still give a > > default implementation for single-byte encodings. > > > > What's a good API for real stream conversion? just > > Codec.encodeStream(infile, outfile) ? or is it more > > useful to feed the codec with data a chunk at a time? M.-A. Lemburg responds: > The idea was to use Unicode as intermediate for all > encoding conversions. > > What you invision here are stream recoders. The can > easily be implemented as an useful addition to the Codec > subclasses, but I don't think that these have to go > into the core. What I wanted was a codec API that acts somewhat like a buffered file; the buffer makes it possible to efficient handle shift states. This is not exactly what Andy shows, but it's not what Marc's current spec has either. I had thought something more like what Java does: an output stream codec's constructor takes a writable file object and the object returned by the constructor has a write() method, a flush() method and a close() method. It acts like a buffering interface to the underlying file; this allows it to generate the minimal number of shift sequeuces. Similar for input stream codecs. Andy's file translation example could then be written as follows: # assuming variables input_file, input_encoding, output_file, # output_encoding, and constant BUFFER_SIZE f = open(input_file, "rb") f1 = unicodec.codecs[input_encoding].stream_reader(f) g = open(output_file, "wb") g1 = unicodec.codecs[output_encoding].stream_writer(f) while 1: buffer = f1.read(BUFFER_SIZE) if not buffer: break f2.write(buffer) f2.close() f1.close() Note that we could possibly make these the only API that a codec needs to provide; the string object <--> unicode object conversions can be done using this and the cStringIO module. (On the other hand it seems a common case that would be quite useful.) > > 2. Data driven codecs > > I really like codecs being objects, and believe we > > could build support for a lot more encodings, a lot > > sooner than is otherwise possible, by making them data > > driven rather making each one compiled C code with > > static mapping tables. What do people think about the > > approach below? > > > > First of all, the ISO8859-1 series are straight > > mappings to Unicode code points. So one Python script > > could parse these files and build the mapping table, > > and a very small data file could hold these encodings. > > A compiled helper function analogous to > > string.translate() could deal with most of them. > > The problem with these large tables is that currently > Python modules are not shared among processes since > every process builds its own table. > > Static C data has the advantage of being shareable at > the OS level. Don't worry about it. 128K is too small to care, I think... > You can of course implement Python based lookup tables, > but these should be too large... > > > Secondly, the double-byte ones involve a mixture of > > algorithms and data. The worst cases I know are modal > > encodings which need a single-byte lookup table, a > > double-byte lookup table, and have some very simple > > rules about escape sequences in between them. A > > simple state machine could still handle these (and the > > single-byte mappings above become extra-simple special > > cases); I could imagine feeding it a totally > > data-driven set of rules. > > > > Third, we can massively compress the mapping tables > > using a notation which just lists contiguous ranges; > > and very often there are relationships between > > encodings. For example, "cpXYZ is just like cpXYY but > > with an extra 'smiley' at 0XFE32". In these cases, a > > script can build a family of related codecs in an > > auditable manner. > > These are all great ideas, but I think they unnecessarily > complicate the proposal. Agreed, let's leave the *implementation* of codecs out of the current efforts. However I want to make sure that the *interface* to codecs is defined right, because changing it will be expensive. (This is Linus Torvald's philosophy on drivers -- he doesn't care about bugs in drivers, as they will get fixed; however he greatly cares about defining the driver APIs correctly.) > > 3. What encodings to distribute? > > The only clean answers to this are 'almost none', or > > 'everything that Unicode 3.0 has a mapping for'. The > > latter is going to add some weight to the > > distribution. What are people's feelings? Do we ship > > any at all apart from the Unicode ones? Should new > > encodings be downloadable from www.python.org? Should > > there be an optional package outside the main > > distribution? > > Since Codecs can be registered at runtime, there is quite > some potential there for extension writers coding their > own fast codecs. E.g. one could use mxTextTools as codec > engine working at C speeds. (Do you think you'll be able to extort some money from HP for these? :-) > I would propose to only add some very basic encodings to > the standard distribution, e.g. the ones mentioned under > Standard Codecs in the proposal: > > 'utf-8': 8-bit variable length encoding > 'utf-16': 16-bit variable length encoding (litte/big endian) > 'utf-16-le': utf-16 but explicitly little endian > 'utf-16-be': utf-16 but explicitly big endian > 'ascii': 7-bit ASCII codepage > 'latin-1': Latin-1 codepage > 'html-entities': Latin-1 + HTML entities; > see htmlentitydefs.py from the standard Pythin Lib > 'jis' (a popular version XXX): > Japanese character encoding > 'unicode-escape': See Unicode Constructors for a definition > 'native': Dump of the Internal Format used by Python > > Perhaps not even 'html-entities' (even though it would make > a cool replacement for cgi.escape()) and maybe we should > also place the JIS encoding into a separate Unicode package. I'd drop html-entities, it seems too cutesie. (And who uses these anyway, outside browsers?) For JIS (shift-JIS?) I hope that Andy can help us with some pointers and validation. And unicode-escape: now that you mention it, this is a section of the proposal that I don't understand. I quote it here: | Python should provide a built-in constructor for Unicode strings which | is available through __builtins__: | | u = unicode(<encoded Python string>[,<encoding name>=<default encoding>]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ What do you mean by this notation? Since encoding names are not always legal Python identifiers (most contain hyphens), I don't understand what you really meant here. Do you mean to say that it has to be a keyword argument? I would disagree; and then I would have expected the notation [,encoding=<default encoding>]. | With the 'unicode-escape' encoding being defined as: | | u = u'<unicode-escape encoded Python string>' | | · for single characters (and this includes all \XXX sequences except \uXXXX), | take the ordinal and interpret it as Unicode ordinal; | | · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX | instead, e.g. \u03C0 to represent the character Pi. I've looked at this several times and I don't see the difference between the two bullets. (Ironically, you are using a non-ASCII character here that doesn't always display, depending on where I look at your mail :-). Can you give some examples? Is u'\u0020' different from u'\x20' (a space)? Does '\u0020' (no u prefix) have a meaning? Also, I remember reading Tim Peters who suggested that a "raw unicode" notation (ur"...") might be necessary, to encode regular expressions. I tend to agree. While I'm on the topic, I don't see in your proposal a description of the source file character encoding. Currently, this is undefined, and in fact can be (ab)used to enter non-ASCII in string literals. For example, a programmer named François might write a file containing this statement: print "Written by François." # (There's a cedilla in there!) (He assumes his source character encoding is Latin-1, and he doesn't want to have to type \347 when he can type a cedilla on his keyboard.) If his source file (or .pyc file!) is executed by a Japanese user, this will probably print some garbage. Using the new Unicode strings, François could change his program as follows: print unicode("Written by François.", "latin-1") Assuming that François sets his sys.stdout to use Latin-1, while the Japanese user sets his to shift-JIS (or whatever his kanjiterm uses). But when the Japanese user views François' source file, he will again see garbage. If he uses a generic tool to translate latin-1 files to shift-JIS (assuming shift-JIS has a cedilla character) the program will no longer work correctly -- the string "latin-1" has to be changed to "shift-jis". What should we do about this? The safest and most radical solution is to disallow non-ASCII source characters; François will then have to type print u"Written by Fran\u00E7ois." but, knowing François, he probably won't like this solution very much (since he didn't like the \347 version either). --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4