Maybe I didn't understand the RFC quite right, but it seemed like how to handle hostnames was left as a choice between IDNA encoding the hostname or replacing the non-ascii characters with dashes? I guess in practice IDNA is the right decision. Another part I wasn't clear on is whether urllib.quote() understands it's working on URIs, arbitrary strings, URLs or what. It seems that from the documentation it looks like it's expecting to just work on the path component of URLs. If this is so, then it doesn't need to understand what to do if the IRI contains a hostname. Seems like the other somewhat under-specified part of all of this is how urllib.unquote() should work. If after percent decoding it sees non-ascii octets, should it try to decode them as utf-8 and if that fails then leave them as is? On May 7, 2008, at 11:55 AM, Robert Brewer wrote: > "Martin v. Löwis" wrote: >> The proper way to implement this would be IRIs (RFC 3987), >> in particular section 3.1. This is not as simple as just >> encoding it as UTF-8, as you might have to apply IDNA to >> the host part. >> >> Code doing so just hasn't been contributed yet. > > But if someone wanted to do so, it's pretty simple: > >>>> u'www.\u212bngstr\xf6m.com'.encode("idna") > 'www.xn--ngstrm-hua5l.com' > > > Robert Brewer > fumanchu at aminus.org >
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4