[Responding to some lingering mails] [/F] > > >>> u = unicode("å i åa ä e ö", "iso-latin-1") > > >>> s = u.encode("html-entities") > > >>> d = decoder("html-entities") > > >>> d.decode(s[:-1]) > > "å i åa ä e " > > >>> d.flush() > > "ö" [MAL] > Ah, ok. So the .flush() method checks for proper > string endings and then either returns the remaining > input or raises an error. No, please. See my previous post on flush(). > > input: read chunks of data, decode, and > > keep extra data in a local buffer. > > > > output: encode data into suitable chunks, > > and write to the output stream (that's why > > there's a buffersize argument to encode -- > > if someone writes a 10mb unicode string to > > an encoded stream, python shouldn't allocate > > an extra 10-30 megabytes just to be able to > > encode the darn thing...) > > So the stream codecs would be wrappers around the > string codecs. No -- the other way around. Think of the stream encoder as a little FSM engine that you feed with unicode characters and which sends bytes to the backend stream. When a unicode character comes in that requires a particular shift state, and the FSM isn't in that shift state, it emits the escape sequence to enter that shift state first. It should use standard buffered writes to the output stream; i.e. one call to feed the encoder could cause several calls to write() on the output stream, or vice versa (if you fed the encoder a single character it might keep it in its own buffer). That's all up to the codec implementation. The flush() forces the FSM into the "neutral" shift state, possibly writing an escape sequence to leave the current shift state, and empties the internal buffer. The string codec CONCEPTUALLY uses the stream codec to a cStringIO object, using flush() to force the final output. However the implementation may take a shortcut. For stateless encodings the stream codec may call on the string codec, but that's all an implementation issue. For input, things are slightly different (you don't know how much encoded data you must read to give you N Unicode characters, so you may have to make a guess and hold on to some data that you read unnecessarily -- either in encoded form or in Unicode form, at the discretion of the implementation. Using seek() on the input stream is forbidden (it could be a pipe or socket). --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4