Thank you, Martin, for your help. I've been taking a closer look at the codecs module as a way to move escaped Unicode characters/strings from an 8-bit text file into lists. There is an excellent document What's New in Python 2.0 by A.M. Kuchling and Moshe Zadka which has the best examples I've seen so far. I've had to supplement it with Marc-Andre Lemburg's Python Unicode Integration version 1.8 for listings of the parameters to use. Martin von Loewis wrote: > Maurice Bauhahn <bauhahnm at clara.net> writes: > > > My imports of escaped Unicode (u'\u1780' or '\u1780') end up in my lists > > as: > > > > ["u'\\u1780'"] > > I very much doubt this. This looks more like the repr of a list, > instead of like the list itself. That could be an incompatibility of > repr for Unicode objects in Python, but I assume that the list is > still build correctly. It could be that because Jython's default encoding is 'ascii' my reader did not consider those escapes. I used the sys.getdefaultencoding() function to detect that encoding. Subsequently I tried the following: (UTF8_encode, UTF8_decode, UTF8_streamreader, UTF8_streamwriter) = codecs.lookup('UTF-8') (UNIESCAPE_encode, UNIESCAPE_decode, UNIESCAPE_streamreader, UNIESCAPE_streamwriter)=codecs.lookup('unicode-escape') oneencoding = UNIESCAPE_streamreader(open('H:\\jy\\encodings\\KSCIIOne.txt','r') outdocument = UTF8_streamwriter( open('h:\\jy\\outtest.txt','wb' )) for encodingline in oneencoding.readlines(): The error returned from this last line is: SyntaxError: invalid syntax >>> execfile('h:\\jy\\test.py') Traceback (innermost last): File "<console>", line 1, in ? File "h:\jy\test.py", line 408, in ? File "h:\jy\test.py", line 80, in loadencode File "D:\Java\jython\Lib\codecs.py", line 269, in readlines TypeError: unicode_escape_decode(): expected 2 args; got 1 What two arguments were expected where? > > > > and .write as u'\u1780'. > > In CPython, that would give an exception. You cannot write a Unicode > object onto a stream without encoding it first. Which encoding would you recommend for the write() function (if I want to use Regular Expressions on the output)? I like utf-8 because it leaves ASCII characters pretty much as they were; however, I'm afraid that parsing/Regular Expression tools will have problems with the irregular length for characters. Next I want to do letter pair frequency studies with the output. > > > > From the command line I can get something useful by writing: > > > > u'\u1780'.encode('utf-8') > > > > but it does not appear to work within my jython script. > > That should work. How does it fail? The problem is probably back at my input...my list composed of inputted strings still has that u'\\u1780' format. > > > Regards, > Martin -- Maurice Bauhahn United Kingdom Home: bauhahnm at clara dot net
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4