Here are the features I'd like to see in a Python Internationalisation Toolkit. I'm very open to persuasion about APIs and how to do it, but this is roughly the functionality I would have wanted for the last year (see separate post "Internationalization Case Study"): Built-in types: --------------- "Unicode String" and "Normal String". The normal string is can hold all 256 possible byte values and is analogous to java's Byte Array - in other words an ordinary Python string. Unicode strings iterate (and are manipulated) per character, not per byte. You knew that already. To manipulate anything in a funny encoding, you convert it to Unicode, manipulate it there, then convert it back. Easy Conversions ---------------------- This is modelled on Java which I think has it right. When you construct a Unicode string, you may supply an optional encoding argument. I'm not bothered if conversion happens in a global function, a constructor method or whatever. MyUniString = ToUnicode('hello') # assumes ASCII MyUniString = ToUnicode('pretend this is Japanese', 'ShiftJIS') #specified The converse applies when converting back. The encoding designators should agree with Java. If data is encountered which is not valid for the encoding, there are several strategies, and it would be nice if they could be specified explicitly: 1. replace offending characters with a question mark 2. try to recover intelligently (possible in some cases) 3. raise an exception A 'Unicode' designator is needed which performs a dummy conversion. File Opening: --------------- It should be possible to work with files as we do now - just streams of binary data. It should also be possible to read, say, a file of locally endoded addresses into a Unicode string. e.g. open(myfile, 'r', 'ShiftJIS'). It should also be possible to open a raw Unicode file and read the bytes into ordinary Python strings, or Unicode strings. In this case one needs to watch out for the byte-order marks at the beginning of the file. Not sure of a good API to do this. We could have OrdinaryFile objects and UnicodeFile objects, or proliferate the arguments to 'open. Doing the Conversions ---------------------------- All conversions should go through Unicode as the central point. Here is where we can start to define the territory. Some conversions are algorithmic, some are lookups, many are a mixture with some simple state transitions (e.g. shift characters to denote switches from double-byte to single-byte). I'd like to see an 'encoding engine' modelled on something like mxTextTools - a state machine with a few simple actions, effectively a mini-language for doing simple operations. Then a new encoding can be added in a data-driven way, and still go at C-like speeds. Making this open and extensible (and preferably not needing to code C to do it) is the only way I can see to get a really good solid encodings library. Not all encodings need go in the standard distribution, but all should be downloadable from www.python.org. A generalized two-byte-to-two-byte mapping is 128kb. But there are compact forms which can reduce these to a few kb, and also make the data intelligible. It is obviously desirable to store stuff compactly if we can unpack it fast. Typed Strings ---------------- When you are writing data conversion tools to sit in the middle of a bunch of databases, you could save a lot of grief with a string that knows its encoding. What follows could be done as a Python wrapper around something ordinary strings rather than as a new type, and thus need not be part of the language. This is analogous to Martin Fowler's Quantity pattern in Analysis Patterns, where a number knows its units and you cannot add dollars and pounds accidentally. These would do implicit conversions; and they would stop you assigning or confusing differently encoded strings. They would also validate when constructed. 'Typecasting' would be allowed but would require explicit code. So maybe something like... >>>ts1 = TypedString('hello', 'cp932ms') # specify encoding, it remembers it >>>ts2 = TypedString('goodbye','cp5035') >>>ts1 + ts2 #or any of a host of other encoding options EncodingError >>>ts3 = TypedString(ts1, 'cp5035') #converts it implicitly going via Unicode >>>ts4 = ts1.cast('ShiftJIS') #the developer knows that in this case the string is compatible. Going Deeper ---------------- The project I describe involved many more issues than just a straight conversion. I envisage an encodings package or module which power users could get at directly. We have be able to answer the questions: 'is string X a valid instance of encoding Y?' 'is string X nearly a valid instance of encoding Y, maybe with a little corruption, or is it something totally different?' - this one might be a task left to a programmer, but the toolkit should help where it can. 'can string X be converted from encoding Y to encoding Z without loss of data? If not, exactly what will get trashed' ? This is a really useful utility. More generally, I want tools to reason about character sets and encodings. I have 'Character Set' and 'Character Mapping' classes - very app-specific and proprietary - which let me express and answer questions about whether one character set is a superset of another and reason about round trips. I'd like to do these properly for the toolkit. They would need some C support for speed, but I think they could still be data driven. So we could have an Endoding object which could be pickled, and we could keep a directory full of them as our database. There might actually be two encoding objects - one for single-byte, one for multi-byte, with the same API. There are so many subtle differences between encodings (even within the Shift-JIS family) - company X has ten extra characters, and that is technically a new encoding. So it would be really useful to reason about these and say 'find me all JIS-compatible encodings', or 'report on the differences between Shift-JIS and 'cp932ms'. GUI Issues ------------- The new Pythonwin breaks somewhat on Japanese - editor windows are fine but console output is show as single-byte garbage. I will try to evaluate IDLE on a Japanese test box this week. I think these two need to work for double-byte languages for our credibility. Verifiability and printing ----------------------------- We will need to prove it all works. This means looking at text on a screen or on paper. A really wicked demo utility would be a GUI which could open files and convert encodings in an editor window or spreadsheet window, and specify conversions on copy/paste. If it could save a page as HTML (just an encoding tag and data between <PRE> tags, then we could use Netscape/IE for verification. Better still, a web server demo could convert on python.org and tag the pages appropriately - browsers support most common encodings. All the encoding stuff is ultimately a bit meaningless without a way to display a character. I am hoping that PDF and PDFgen may add a lot of value here. Adobe (and Ken Lunde) have spent years coming up with a general architecture for this stuff in PDF. Basically, the multi-byte fonts they use are encoding independent, and come with a whole bunch of mapping tables. So I can ask for the same Japanese font in any of about ten encodings - font name is a combination of face name and encoding. The font itself does the remapping. They make available downloadable font packs for Acrobat 4.0 for most languages now; these are good places to raid for building encoding databases. It also means that I can write a Python script to crank out beautiful-looking code page charts for all of our encodings from the database, and input and output to regression tests. I've done it for Shift-JIS at Fidelity, and would have to rewrite it once I am out of here. But I think that some good graphic design here would lead to a product that blows people away - an encodings library that can print out its own contents for viewing and thus help demonstrate its own correctness (or make errors stick out like a sore thumb). Am I mad? Have I put you off forever? What I outline above would be a serious project needing months of work; I'd be really happy to take a role, if we could find sponsors for the project. But I believe we could define the standard for years to come. Furthermore, it would go a long way to making Python the corporate choice for data cleaning and transformation - territory I think we should own. Regards, Andy Robinson Robinson Analytics Ltd. ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4