"Stephen J. Turnbull" <stephen at xemacs.org>: > Just read as bytes and decode piecewise in one way or another. For > Oleg's HTML case, there's a well-understood structure that can be used > to determine retry points HTML and XML are interesting examples since their encoding is initially unknown: <?xml version="1.0"?> ^ +--- Now I know it is UTF-8 <?xml version="1.0" encoding="UTF-16"?> ^ +--- Now I know it was UTF-16 all along! Then we have: HTTP/1.1 200 OK Content-Type: text/html; charset=ISO-8859-1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-16"> See how deep you have to parse the TCP stream before you realize the content encoding is UTF-16. Marko
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4