Greg Ward <gward@python.net> writes: > if self.replace_whitespace: > text = text.translate(self.whitespace_trans) > return text > > (The rationale: having tabs and newlines in a paragraph about to be > wrapped doesn't make any sense to me.) Is it then really necessary to replace each of these characters, or would it be acceptable to replace sequences of them, as Fredrik proposes (.split,.join)? > for c in string.whitespace: > unicode_whitespace_trans[ord(unicode(c))] = ord(u' ') There are conceptually 5 times as many space characters in Unicode (NO-BREAK SPACE, THREE-PER-EM SPACE, OGHAM SPACE MARK, and whatnot), but it is probably safe to ignore them. The complete fragment would read for c in range(sys.maxunicode): if unichr(c).isspace(): unicode_whitespace_trans[c] = u' ' (which is somewhat time-consuming, so you could hard-code a larger list if you wanted to) > That's ugly as hell, but it works. Is there a cleaner way? You may want to time re.sub, perhaps to find that the speed decrease is acceptable: space = re.compile("\s") text = space.sub(" ", text) > The other bit of ASCII/English prejudice hardcoded into textwrap.py is > this regex: > > sentence_end_re = re.compile(r'[%s]' # lowercase letter > r'[\.\!\?]' # sentence-ending punct. > r'[\"\']?' # optional end-of-quote > % string.lowercase) For the issue at hand: this code does "work" with Unicode, right? I.e. it will give some result, even if confronted with funny characters? If so, I think you can ignore this for the moment. > But I couldn't find a better way to do it when I wrote this code > last spring. Is there one? I believe the right approach is to support more classes in SRE. This one would be covered if there was a [:lower:] class. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4