On 16/09/2010 23:05, Antoine Pitrou wrote: > On Thu, 16 Sep 2010 16:51:58 -0400 > "R. David Murray"<rdmurray at bitdance.com> wrote: >> What do we store in the model? We could say that the model is always >> text. But then we lose information about the original bytes message, >> and we can't reproduce it. For various reasons (mailman being a big one), >> this is not acceptable. So we could say that the model is always bytes. >> But we want access to (for example) the header values as text, so header >> lookup should take string keys and return string values[2]. > Why can't you have both in a single class? If you create the class > using a bytes source (a raw message sent by SMTP, for example), the > class automatically parses and decodes it to unicode strings; if you > create the class using an unicode source (the text body of the e-mail > message and the list of recipients, for example), the class > automatically creates the bytes representation. > I think something like this would be great for WSGI. Rather than focus on whether bytes *or* text should be used, use a higher level object that provides a bytes view, and (where possible/appropriate) a unicode view too. Michael > (of course all processing can be done lazily for performance reasons) > >> What about email files on disk? They could be bytes, or they could be, >> effectively, text (for example, utf-8 encoded). > Such a file can be two things: > - the raw encoding of a whole message (including headers, etc.), then > it should be fed as a bytes object > - the single text body of a hypothetical message, then it should be fed > as a unicode object > > I don't see any possible middle-ground. > >> On disk, using utf-8, >> one might store the text representation of the message, rather than >> the wire-format (ASCII encoded) version. We might want to write such >> messages from scratch. > But then the user knows the encoding (by "user" I mean what/whoever > calls the email API) and mentions it to the email package. > > What I'm having an issue with is that you are talking about a bytes > representation and an unicode representation of a message. But they > aren't representations of the same things: > - if it's a bytes representation, it will be the whole, raw message > including envelope / headers (also, MIME sections etc.) > - if it's an unicode representation, it will only be a section of the > message decodable as such (a text/plain MIME section, for example; > or a decoded header value; or even a single e-mail address part of a > decoded header) > > So, there doesn't seem to be any reason for having both a BytesMessage > and an UnicodeMessage at the same abstraction level. They are both > representing different things at different abstraction levels. I don't > see any potential for confusion: raw assembled e-mail message = bytes; > decoded text section of a message = unicode. > > As for the problem of potential "bogus" raw e-mail data > (e.g., undecodable headers), well, I guess the library has to make a > choice between purity and practicality, or perhaps let the user choose > themselves. For example, through a `strict` flag. If `strict` is true, > raise an error as soon as a non-decodable byte appears in a header, if > `strict` is false, decode it through a default (encoding, errors) > convention which can be overriden by the user (a sensible possibility > being "utf-8, surrogateescape" to allow for lossless round-tripping). > >> As I said above, we could insist that files on >> disk be in wire-format, and for many applications that would work fine, >> but I think people would get mad at us if didn't support text files[3]. > Again, this simply seems to be two different abstraction levels: > pre-generated raw email messages including headers, or a single text > waiting to be embedded in an actual e-mail. > >> Anyway, what polymorphism means in email is that if you put in bytes, >> you get a BytesMessage, if you put in strings you get a StringMessage, >> and if you want the other one you convert. > And then you have two separate worlds while ultimately the same > concepts are underlying. A library accepting BytesMessage will crash > when a program wants to give a StringMessage and vice-versa. That > doesn't sound very practical. > >> [1] Now that surrogateesscape exists, one might suppose that strings >> could be used as an 8bit channel, but that only works if you don't need >> to *parse* the non-ASCII data, just transmit it. > Well, you can parse it, precisely. Not only, but it round-trips if you > unparse it again: > >>>> header_bytes = b"From: bogus\xFFname<someone at python.com>" >>>> name, value = header_bytes.decode("utf-8", "surrogateescape").split(":") >>>> name > 'From' >>>> value > ' bogus\udcffname<someone at python.com>' >>>> "{0}:{1}".format(name, value).encode("utf-8", "surrogateescape") > b'From: bogus\xffname<someone at python.com>' > > > In the end, what I would call a polymorphic best practice is "try to > avoid bytes/str polymorphism if your domain is well-defined > enough" (which I admit URLs aren't necessarily; but there's no > question a single text/XXX e-mail section is text, and a whole > assembled e-mail message is bytes). > > Regards > > Antoine. > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4