Ian Bicking: > I think the use case everyone has in mind here is where > you get a URL from one of these sources, and you want to handle it. I have > a hard time imagining the sequence of events that would lead to mojibake. > Naive parsing of a document in bytes couldn't do it, because if you have a > non-ASCII-compatible document your ASCII-based parsing will also fail (e.g., > looking for b'href="(.*?)"'). It depends on what the particular ASCII-based parsing is doing. For example, the set of trail bytes in Shift-JIS includes the same bytes as some of the punctuation characters in ASCII as well as all the letters. A search or split on '@' or '|' may find the trail byte in a two-byte character rather than a true occurrence of that character so the operation 'succeeds' but produces an incorrect result. Over time, the set of trail bytes used has expanded - in GB18030 digits are possible although many of the most important characters for parsing such as ''' "#%&.?/''' are still safe as they may not be trail bytes in the common double-byte character sets. Neil
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4