Stefan Rank wrote: > <wishful thinking> > > re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase])) This would (almost) work, but it would be terribly inefficient (time linear to the number of alternatives). You can realistically do uppers = [u'['] for i in range(sys.maxunicode): c = unichr(i) if c.isupper(): uppers.append(c) uppers.append(u']') uppers = u"".join(uppers) uppers_re = re.compile(uppers) Compiling this expression is quite expensive; matching it is fairly efficient (time independent of the number of characters in the class). To save startup cost, consider pickling the compiled expression. (syntax note: this only works because none of the characters special to a RE class (]-^\) is an uppercase letter; otherwise, escaping might be needed) > for the latter two, to work on utf-8 strings, would I have to set the > defaultencoding to utf-8? For Unicode things, you should avoid using byte strings - especially when it comes to regular expressions. Use Unicode strings instead. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4