M.-A. Lemburg wrote: ... > > background: in the current implementation, this decision has to > > be made at compile time, and a compiled expression can be used > > with either 8-bit strings or 16-bit strings. ... > For Unicode objects you should really default to using the=20 > Py_UNICODE_ISLINEBREAK() macro which defines all line break > characters (note that CRLF should be interpreted as a > single line break; see PyUnicode_Splitlines()). The reason > here is that Unicode defines how to handle line breaks > and we should try to stick to the standard as close as possible. > All other possibilities could still be made available via new > flags. >=20 > For 8-bit strings I'd suggest sticking to the re definition. guess my background description wasn't clear: Once a pattern has been compiled, it will always handle line endings in the same way. The parser doesn't really care if the pattern is a unicode string or an 8-bit string (unicode strings can contain "wide" characters, but that's the only difference). At the other end, the same compiled pattern can be applied to either 8-bit or unicode strings. It's all just characters to the engine... Now, I can of course change the engine so that it always uses chr(10) on 8-bit strings and LINEBREAK on 16-bit strings, but the result is that pattern.match(widestring) won't necessarily match the same thing as pattern.match(str(widestring)) even if the wide string only contains plain ASCII. (an other alternative is to recompile the pattern for each target string type, but that will hurt performance...) </F> <project name=3D"sre" complete=3D"97.1%" />
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4