Tim Peters wrote: > > [MAL, on raw Unicode strings] > > ... > > Agreed... note that you could also write your own codec for just this > > reason and then use: > > > > u = unicode('....\u1234...\...\...','raw-unicode-escaped') > > > > Put that into a function called 'ur' and you have: > > > > u = ur('...\u4545...\...\...') > > > > which is not that far away from ur'...' w/r to cosmetics. > > Well, not quite. In general you need to pass raw strings: > > u = unicode(r'....\u1234...\...\...','raw-unicode-escaped') > ^ > u = ur(r'...\u4545...\...\...') > ^ > > else Python will replace all the other backslash sequences. This is a > crucial distinction at times; e.g., else \b in a Unicode regexp will expand > into a backspace character before the regexp processor ever sees it (\b is > supposed to be a word boundary assertion). Right. Here is a sample implementation of what I had in mind: """ Demo for 'unicode-escape' encoding. """ import struct,string,re pack_format = '>H' def convert_string(s): l = map(None,s) for i in range(len(l)): l[i] = struct.pack(pack_format,ord(l[i])) return l u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})') def unicode_unescape(s): l = [] start = 0 while start < len(s): m = u_escape.search(s,start) if not m: l[len(l):] = convert_string(s[start:]) break m_start,m_end = m.span() if m_start > start: l[len(l):] = convert_string(s[start:m_start]) hexcode = m.group(1) #print hexcode,start,m_start if len(hexcode) != 4: raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode ordinal = string.atoi(hexcode,16) l.append(struct.pack(pack_format,ordinal)) start = m_end #print l return string.join(l,'') def hexstr(s,sep=''): return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % ord(x),s),sep) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4