> From: martin@v.loewis.de > Jeff Hobbs writes: > > > Can someone explain to me why moving to UCS-4 is a good thing? > > Because it simplifies processing of non-BMP characters, as it restores > the property that you get one Unicode character per string index. Right, fair enough, that's all well understood - when you have to deal with characters between U+10000 and U+10FFFF. It was only recently that such characters existed in more than a sprinkling. > > A Tcl_UniChar is 32-bits and TCL_UTF_MAX is 6 (normally it is 3), > > which represents the number of utf-8 bytes that are valid in sequence. > > Is that current code, or future code? How can I select a UCS-4 build > during configuration? In what way is the supported mechanism different > from the one that Redhat uses? There is no "supported" UCS-4 mode for Tcl. You have to hand-twiddle the sources, knowing where to poke. I can make the changes for 8.5 that allow for an easy configuration option to compile in UCS-4 mode. I suppose I could also back-port it to 8.4.4. That won't address the fact that we've never validated non-BMP support. > I couldn't find definitive numbers on distribution over planes, but I > found the following numbers: > - Unicode 3.0 has 49194 assigned characters > (http://www.unicode.org/versions/Unicode3.0.html) > - Unicode 4.0 has 96248 graphic characters > (http://www.unicode.org/versions/Unicode4.0.0/) Right, and Unicode 4.0 is fresh out of diapers. You can't even get the regular code charts yet, you have to view the 4.0 beta ones. With 4.0 the non-BMP finally gets a notable amount of characters, but they are fairly weird ones that I'd be surprised to find a public font for. You can see them at: http://www.unicode.org/charts/u40-beta.html They are the Linear B Syllabary on down. > > The bigger issue is that in changing the basic Tcl_UniChar size, you > > break the binary compatability rules. RH9 is the only > > version/distro to use 32-bit Tcl_UniChar, which breaks compatability > > with extensions build on other versions/distros. > > Indeed. Python has added explicit mechanisms to detect such breakage, > by renaming all API functions depending on the width of a Unicode > character. That, atleast, allows to detect the breakage at import > time (missing symbols). Tcl could do this, but we were very much taken by surprise that it was pushed to use UCS-4 at all. > > Checking on a rebuild now, it does appear that Tk operates just > > fine. However, it does consume a lot more memory. > > When I tested it, I found that it would break very easily. I was using > the Redhat procedure, though, so I might have made something wrong. Can you feed me some sample scripts offline to test with? > > I finally found the source RPMs for Tcl that RH9 uses and checked > > out there patch. It's not even correct. You have to modify > > tcl/generic/regcustom.h as well to account for Tcl_UniChar being > > 32-bits. > > What is the specific change that one has to make? "You have to edit > multiple files to activate a feature" is a strange way of supporting > it... Ha ha ... well, I did say it was never properly supported. That noone bothered to ask how to do it correctly when that was clear is not a good thing. What you have to do is modify generic/tcl.h to set TCL_UTF_MAX to 6, typedef Tcl_UniChar as unsigned int (or wchar_t is what RH used), and then modify the bottom of generic/regcustom.h, where you will see 3 lines that need mods for the change in size of CHR (which is Tcl_UniChar for the RE). Of course, that's what I think is needed. It should probably then get extended tests for more characters and further expectations. We should probably add a tcl_platform(unicharSize) var or something so that users at the Tcl level know this as well. Again, this is only something that I have tinkered with - not extensively tested. Regards, Jeff Hobbs The Tcl Guy Senior Developer http://www.ActiveState.com/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4