RetroSearch Browse

Sat Aug 23 10:21:57 CEST 2014 · https://mail.python.org/pipermail/python-dev/2014-August/135963.html

"Stephen J. Turnbull" <stephen at xemacs.org>:

> Just read as bytes and decode piecewise in one way or another. For
> Oleg's HTML case, there's a well-understood structure that can be used
> to determine retry points

HTML and XML are interesting examples since their encoding is initially
unknown:

  <?xml version="1.0"?>
                      ^
                      +--- Now I know it is UTF-8

  <?xml version="1.0" encoding="UTF-16"?>
                                      ^
                                      +--- Now I know it was UTF-16
                                           all along!

Then we have:

  HTTP/1.1 200 OK
  Content-Type: text/html; charset=ISO-8859-1

  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
  <html>
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-16">

See how deep you have to parse the TCP stream before you realize the
content encoding is UTF-16.

Marko

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2014-August/135963.html below:

[Python-Dev] Bytes path support