Finn Bock wrote: > > On Sat, 13 May 2000 14:56:41 +0200, you wrote: > > >in the current 're' engine, a newline is chr(10) and nothing > >else. > > > >however, in the new unicode aware engine, I used the new > >LINEBREAK predicate instead, but it turned out to break one > >of the tests in the current test suite: > > > > sre.match('a\rb', 'a.b') => None > > > >(unicode adds chr(13), chr(28), chr(29), chr(30), and also > >unichr(133), unichr(8232), and unichr(8233) to the list of > >line breaking codes) > > >what's the best way to deal with this? I see three alter- > >natives: > > > >a) stick to the old definition, and use chr(10) also for > > unicode strings > > In the ORO matcher that comes with jpython, the dot matches all but > chr(10). But that is bad IMO. Unicode should use the LINEBREAK > predicate. +1 on that one... just like \s should use Py_UNICODE_ISSPACE() and \d Py_UNICODE_ISDECIMAL(). BTW, how have you implemented the locale aware \w and \W for Unicode ? Unicode doesn't have any locales, but quite a lot more alphanumeric characters (or equivalents) and there currently is no Py_UNICODE_ISALPHA() in the core. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4