Florian Weimer, 28.01.2011 15:27: > * Stefan Behnel: > >> The nice thing about Py_UNICODE is that is basically gives you native >> Unicode code points directly, without needing to decode UTF-8 byte >> runs and the like. In Cython, it allows you to do things like this: >> >> def test_for_those_characters(unicode s): >> for c in s: >> # warning: randomly chosen Unicode escapes ahead >> if c in u"\u0356\u1012\u3359\u4567": >> return True >> else: >> return False >> >> The loop runs in plain C, using the somewhat obvious implementation >> with a loop over Py_UNICODE characters and a switch statement for the >> comparison. This would look a *lot* more ugly with UTF-8 encoded byte >> strings. > > Not really, because UTF-8 is quite search-friendly. (The if would > have to invoke a memmem()-like primitive.) Random subscrips are > problematic. > > However, why would one want to write loops like the above? Don't you > have to take combining characters (comprising multiple codepoints) > into account most of the time when you look at individual characters? > Then UTF-32 does not offer much of a simplification. Hmm, I think this discussion is pointless. Regardless of the memory layout, you can always go down to the byte level and use an efficient (multi-)substring search algorithm. (which is obviously helped if you know the layout at compile time *wink*) Bad example, I guess. Stefan
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4