A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://mail.python.org/pipermail/python-list/2005-September/350925.html below:

regexps with unicode-aware characterclasses?

regexps with unicode-aware characterclasses?"Martin v. Löwis" martin at v.loewis.de
Wed Sep 14 02:09:03 EDT 2005
Stefan Rank wrote:
> <wishful thinking>
> 
>   re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))

This would (almost) work, but it would be terribly inefficient (time
linear to the number of alternatives). You can realistically do

uppers = [u'[']
for i in range(sys.maxunicode):
  c = unichr(i)
  if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)

Compiling this expression is quite expensive; matching it is fairly
efficient (time independent of the number of characters in the class).
To save startup cost, consider pickling the compiled expression.

(syntax note: this only works because none of the characters special
to a RE class (]-^\) is an uppercase letter; otherwise, escaping might
be needed)

> for the latter two, to work on utf-8 strings, would I have to set the
> defaultencoding to utf-8?

For Unicode things, you should avoid using byte strings - especially
when it comes to regular expressions. Use Unicode strings instead.

Regards,
Martin

More information about the Python-list mailing list

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4