> '\N{LATIN SMALL LETTER O}\N{COMBINING DIAERESIS}' != '\N{LATIN SMALL > LETTER O WITH DIAERESIS}' > > I guess the filesystem shouldn't treat these as the same (even though > they are), but what if some webservice does? I suspect you should > normalize both strings before comparing them in any blacklist, and > what happens with surrogates when you normalize? I think the whole blacklist example is artificial. The string in the blacklist is actually a Chinese "hello" greeting, so it surely isn't the string being blacklisted. For proper blacklisting, you would likely use substring searches, case-insensitivity, transliterations, and perhaps even regular expressions and word stemming. If you consider all these things, proper or alternative encodings of the same text are just another issue to consider. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4