Mark Hammond wrote: > > I would like to discuss Unicode on the Windows platform, and how it relates > to MBCS that Windows uses. > > My main goal here is to ensure that Unicode on Windows can make a round-trip > to and from native Unicode stores. As an example, let's take the registry - > a Windows user should be able to read a Unicode value from the registry then > write it back. The value written back should be _identical_ to the value > read. Ditto for the file system: If the filesystem is Unicode, then I would > expect the following code: > for fname in os.listdir(): > f = open(fname + ".tmp", "w") > > To create filenames on the filesystem with the exact base name even when the > basename contains non-ascii characters. > > However, the Unicode patches do not appear to make this possible. open() > uses PyArg_ParseTuple(args, "s..."); PyArg_ParseTuple() will automatically > convert a Unicode object to UTF-8, so we end up passing a UTF-8 encoded > string to the C runtime fopen function. Right. The idea with open() was to write a special version (using #ifdefs) for use on Windows platforms which does all the needed magic to convert Unicode to whatever the native format and locale is... Using parser markers for this is obviously *not* the right way to get to the core of the problem. Basically, you will have to write a helper which takes a string, Unicode or some other "t" compatible object as name object and then converts it to the system's view of things. I think we had a private discussion about this a few months ago: there was some way to convert Unicode to a platform independent format which then got converted to MBCS -- don't remember the details though. > The end result of all this is that we end up with UTF-8 encoded names in the > registry/on the file system. It does not seem possible to get a true > Unicode string onto either the file system or in the registry. > > Unfortunately, Im not experienced enough to know the full ramifications, but > it _appears_ that on Windows the default "unicode to string" translation > should be done via the WideCharToMultiByte() API. This will then pass an > MBCS encoded ascii string to Windows, and the "right thing" should magically > happen. Unfortunately, MBCS encoding is dependant on the current locale > (ie, one MBCS sequence will mean completely different things depending on > the locale). I dont see a portability issue here, as the documentation > could state that "Unicode->ASCII conversions use the most appropriate > conversion for the platform. If the platform is not Unicode aware, then > UTF-8 will be used." No, no, no... :-) The default should be (and is) UTF-8 on all platforms -- whether the platform supports Unicode or not. If a platform uses a different encoding, an encoder should be used which applies the needed transformation. > This issue is the final one before I release the win32reg module. It seems > _critical_ to me that if Python supports Unicode and the platform supports > Unicode, then Python unicode values must be capable of being passed to the > platform. For the win32reg module I could quite possibly hack around the > problem, but the more general problem (categorized by the open() example > above) still remains... > > Any thoughts? Can't you use the wchar_t interfaces for the task (see the unicodeobject.h file for details) ? Perhaps you can first transfer Unicode to wchar_t and then on to MBCS using a win32 API ?! -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4