Andrew McNamara wrote: > Marc-Andre Lemburg mentioned that he has encountered UTF-16 encoded csv > files, so a reasonable starting point would be the ability to read and > parse, as well as the ability to generate, one of these. I see. That would be reasonable, indeed. Notice that this is not so much a "Unicode issue", but more an "encoding" issue. If you solve the "arbitrary encodings" problem, you solve UTF-16 as a side effect. > The reader interface currently returns a row at a time, consuming as many > lines from the supplied iterable (with the most common iterable being > a file). This suggests to me that we will need an optional "encoding" > argument to the reader constructor, and that the reader will need to > decode the source lines. Ok. In this context, I see two possible implementation strategies: 1. Implement the csv module two times: once for bytes, and once for Unicode characters. It is likely that the source code would be the same for each case; you just need to make sure the "Dialect and Formatting Parameters" change their width accordingly. If you use the SRE approach, you would do #define CSV_ITEM_T char #define CSV_NAME_PREFIX byte_ #include "csvimpl.c" #define CSV_ITEM_T Py_Unicode #define CSV_NAME_PREFIX unicode_ #include "csvimpl.c" 2. Use just the existing _csv module, and represent non-byte encodings as UTF-8. This will work as long as the delimiters and other markup characters have always a single byte in UTF-8, which is the case for "':\, as well as for \r and \n. Then, wenn processing using an explicit encoding, first convert the input into Unicode objects. Then encode the Unicode objects into UTF-8, and pass it to _csv. For the results you get back, convert each element back from UTF-8 to a Unicode object. This could be implemented as def reader(f, encoding=None): if encoding is None: return _csv.reader(f) enc, dec, reader, writer = codecs.lookup(encoding) utf8_enc, utf8_dec, utf8_r, utf8_w = codecs.lookup("UTF-8") # Make a recoder which can only read utf8_stream = codecs.StreamRecoder(f, utf8_enc, None, Reader, None) csv_reader = _csv.reader(utf8_stream) # For performance reasons, map_result could be implemented in C def map_result(t): result = [None]*len(t) for i, val in enumerate(t): result[i] = utf8_dec(val) return tuple(result) return itertools.imap(map_result, csv_reader) # This code is untested This approach has the disadvantage of performing three recodings: from input charset to Unicode, from Unicode to UTF-8, from UTF-8 to Unicode. One could: - skip the initial recoding if the encoding is already known to be _csv-safe (i.e. if it is a pure ASCII superset). This would be valid for ASCII, iso-8859-n, UTF-8, ... - offer the user to keep the results in the input encoding, instead of always returning Unicode objects. Apart from this disadvantage, I think this gives people what they want: they can specify the encoding of the input, and they get the results not only csv-separated, but also unicode-decode. This approach is the same that is used for Python source code encodings: the source is first recoded into UTF-8, then parsed, then recoded back. > That said, I'm hardly a unicode expert, so I > may be overlooking something (could a utf-16 encoded character span a > line break, for example). This cannot happen: \r, in UTF-16, is also 2 bytes (0D 00, if UTF-16LE). There are issues that Unicode has additional line break characters, which is probably irrelevant. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4