RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2002-October/029718.html below:

[Python-Dev] textwrap and unicode

[Python-Dev] textwrap and unicodeMartin v. Loewis martin@v.loewis.de
22 Oct 2002 23:17:52 +0200

Previous message: [Python-Dev] textwrap and unicode
Next message: [Python-Dev] textwrap and unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Greg Ward <gward@python.net> writes:

>         if self.replace_whitespace:
>             text = text.translate(self.whitespace_trans)
>         return text
> 
> (The rationale: having tabs and newlines in a paragraph about to be
> wrapped doesn't make any sense to me.)

Is it then really necessary to replace each of these characters, or
would it be acceptable to replace sequences of them, as Fredrik
proposes (.split,.join)?

>     for c in string.whitespace:
>         unicode_whitespace_trans[ord(unicode(c))] = ord(u' ')

There are conceptually 5 times as many space characters in Unicode
(NO-BREAK SPACE, THREE-PER-EM SPACE, OGHAM SPACE MARK, and whatnot),
but it is probably safe to ignore them. The complete fragment would
read

for c in range(sys.maxunicode):
  if unichr(c).isspace():
    unicode_whitespace_trans[c] = u' '
(which is somewhat time-consuming, so you could hard-code a larger
 list if you wanted to)

> That's ugly as hell, but it works.  Is there a cleaner way?

You may want to time re.sub, perhaps to find that the speed decrease
is acceptable:

space = re.compile("\s")

   text = space.sub(" ", text)

> The other bit of ASCII/English prejudice hardcoded into textwrap.py is
> this regex:
> 
>     sentence_end_re = re.compile(r'[%s]'              # lowercase letter
>                                  r'[\.\!\?]'          # sentence-ending punct.
>                                  r'[\"\']?'           # optional end-of-quote
>                                  % string.lowercase)

For the issue at hand: this code does "work" with Unicode, right?
I.e. it will give some result, even if confronted with funny characters?
If so, I think you can ignore this for the moment.

> But I couldn't find a better way to do it when I wrote this code
> last spring.  Is there one?

I believe the right approach is to support more classes in
SRE. This one would be covered if there was a [:lower:] class.

Regards,
Martin

Previous message: [Python-Dev] textwrap and unicode
Next message: [Python-Dev] textwrap and unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4