On Apr 9, 2009, at 10:38 PM, Barry Warsaw wrote: > So, what I'm really asking is this. Let's say you agree that there > are use cases for accessing a header value as either the raw encoded > bytes or the decoded unicode. As I said in the thread having nearly the same exact discussion on web- sig, except about WSGI headers... > What should this return: > > >>> message['Subject'] > > The raw bytes or the decoded unicode? Until you write a parser for every header, you simply cannot decode to unicode. The only sane choices are: 1) raw bytes 2) parsed structured data There's no "decoded to unicode but not parsed" option: that's doing things in the wrong order. If you RFC2047-decode the header before doing tokenization and parsing, you will just have a *broken* implementation. Here's an example where it matters. If you decode the RFC2047 part before parsing, you'd decide that there's two recipients to the message. There aren't. "<broken at example.com>, " is the display-name of "actual at example.com", not a second recipient. To: =?UTF-8?B?PGJyb2tlbkBleGFtcGxlLmNvbT4sIA==?= <actual at example.com> Here's a quote from RFC2047: > NOTE: Decoding and display of encoded-words occurs *after* a > structured field body is parsed into tokens. It is therefore > possible to hide 'special' characters in encoded-words which, when > displayed, will be indistinguishable from 'special' characters in > the surrounding text. For this and other reasons, it is NOT > generally possible to translate a message header containing 'encoded- > word's to an unencoded form which can be parsed by an RFC 822 mail > reader. And another quote for good measure: > (2) Any header field not defined as '*text' should be parsed > according to the syntax rules for that header field. However, any > 'word' that appears within a 'phrase' should be treated as an > 'encoded-word' if it meets the syntax rules in section 2. Otherwise > it should be treated as an ordinary 'word'. Now, I suppose there's also a third possibility: 3) US-ASCII-only strings, unmolested except for doing a .decode('ascii'). That'll give you a string all right, but it's really just cheating. It's not actually a text string in any meaningful sense. (in all this I'm assuming your question is not about the "Subject" header in particular; that is of course just unstructured text so the parse step doesn't actually do anything...). James
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4