Greg Ward wrote: > On 22 October 2002, Martin v. Loewis said: > > OK, then it's an implementation problem rather than a "you can't get > there from here" problem. Good. The reason I need a list of > "whitespace chars" is to convert all whitespace to spaces; I use > string.maketrans() and s.translate() to do this efficiently: Use the trick Fredrik posted: u' '.join(x.split()) (.split() defaults to splitting on whitespace, Unicode whitespace if x is Unicode). > Ahh, OK, I'm starting to see the problem: there's nothing wrong with the > translate() method of strings or unicode strings, but string.maketrans() > doesn't generate a mapping that u''.translate() likes. Hmmmm. Unicode uses a different API for this since it wouldn't make sense to pass a sys.maxunicode character Unicode string to translate just to map a few characters. > The other bit of ASCII/English prejudice hardcoded into textwrap.py is > this regex: > > sentence_end_re = re.compile(r'[%s]' # lowercase letter > r'[\.\!\?]' # sentence-ending punct. > r'[\"\']?' # optional end-of-quote > % string.lowercase) > > You may recall this from the kerfuffle over whether there should be two > spaces after a sentence in fixed-width fonts. The feature is there, and > off by default, in TextWrapper. I'm not so concerned about this -- I > mean, this doesn't even work with German or French, never mind Hebrew or > Chinese or Hindi. Apart from the narrow definition of "lowercase > letter", it has English punctuation conventions hardcoded into it. But > still, it seems *awfully* dumb in this day and age to hardcode > string.lowercase into a regex that's meant to detect "lowercase > letters". But I couldn't find a better way to do it when I wrote this > code last spring. Is there one? There are far too many lowercase characters in Unicode to make this approach usable. It would be better if there were a way to use Unicode character categories in the re sets. Since that's not available, why not search for all potential sentence ends and then try all of the using .islower() in a for-loop ?! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4