Le mardi 29 mars 2011 à 22:45 +0200, Lennart Regebro a écrit : > On Tue, Mar 29, 2011 at 22:40, Lennart Regebro <regebro at gmail.com> wrote: > > The lesson here seems to be "if you have to use blacklists, and you > > use unicode strings for those blacklists, also make sure the string > > you compare with doesn't have surrogates". > > > > For that matter, what happens with combining characters? > > '\N{LATIN SMALL LETTER O}\N{COMBINING DIAERESIS}' != '\N{LATIN SMALL > LETTER O WITH DIAERESIS}' > > I guess the filesystem shouldn't treat these as the same (even though > they are), but what if some webservice does? Mac OS X does normalize filenames to a variant of the D (decomposed) form. http://www.haypocalc.com/tmp/unicode-2011-03-25/html/operating_systems.html#mac-os-x > I suspect you should normalize both strings before comparing them in any blacklist, Yes, but a blacklist is not safe: use a whitelist. > and what happens with surrogates when you normalize? Surrogates are not the same in forms N, D, KC and KD. >>> unicodedata.normalize('NFC', '\uDC80') == unicodedata.normalize('NFC', '\uDC80') == unicodedata.normalize('NFKC', '\uDC80') == unicodedata.normalize('NFKD', '\uDC80') == '\uDC80' True Victor
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4