A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://mail.python.org/pipermail/python-list/2005-September/348551.html below:

RE + UTF-8

RE + UTF-8cepl@surfbest.net ceplma at gmail.com
Sat Sep 24 19:48:11 EDT 2005
Working on extension of genericwiki.py plugin for PyBlosxom and I have
problems with UTF-8 and RE. When I have this wiki line, it does break
URL too early:

[http://en.wikipedia.org/wiki/Petr_Chelcický Petr Chelcický's]
work(s) into English.

and creates

[<a
href="http://en.wikipedia.org/wiki/Petr_Chel">http://en.wikipedia.org/wiki/Petr_Chel</a>cický
Petr Chelcický's]

The RE genericwiki uses for parsing this:

# WikiName pattern used in your wiki
wikinamepattern = r'\b(([A-Z]\w+){2,})\b' # original
mailurlpattern = r'mailto\:[\"\-\_\.\w]+\@[\-\_\.\w]+\w'
newsurlpattern = r'news\:(?:\w+\.){1,}\w+'
fileurlpattern =
r'(?:http|https|file|ftp):[/-_.\w-]+[\/\w][?&+=%\w/-_.#]*'

[...]

    # Turn '[xxx:address label]' into labeled link
    body = re.sub(r'\[(' +
           fileurlpattern + '|' +
           mailurlpattern + '|' +
           newsurlpattern + ')\s+(.+?)\]',
           r'<a href="\1">\2</a>', body,re.U)

I have tried to test RE and UTF-8 in Python generally and the results
are even more confusing (done with locale cs_CZ.UTF-8 in konsole):

>> locale.getpreferredencoding()
'UTF-8'
>>> print re.sub("(\w*)","X","[Chelcický]",re.L)
X[X?Xý]
>>> print re.sub("(\w*)","X","[Chelcický]",re.UNICODE)
X[X?X?X]X
>>>

I would expect that both print commands should give just plain X, but
apparently Python doesn't undestand that. What's the problem?

Thanks for any reply,

Matej


More information about the Python-list mailing list

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4