M.-A. Lemburg wrote: > Perhaps Hye-Shik Chang could join you in the effort, since he's > the author of the KoreanCodecs package which has somewhat > similar problem scope (that of stateful encodings with a huge > number of mappings) ? I believe (without checking in detail) that the "statefulness" is also an issue in these codecs. Many of the CJK encodings aren't stateful beyond being multi-byte (except for the iso-2022 ones). IOW, there is a non-trivial state only if you process the input byte-for-byte: you have to know whether you are a the first or second byte (and what the first byte was if you are at the second byte). AFAICT, both Japanese codecs assume that you can always look at the second byte when you get the first byte. Of course, this assumption is wrong if you operate in a stream mode, and read the data in, say, chunks of 1024 bytes: such a chunk may split exactly between a first and second byte (*). In these cases, I believe, both codecs would give incorrect results. Please correct me if I'm wrong. Regards, Martin (*) The situation is worse for GB 18030, which also has 4-byte encodings.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4