Fredrik Lundh wrote: > > mal wrote: > > > Just for compares: would you mind running the search > > routines in mxTextTools on the same machine ? > > > > searching for "spam" in a string padded with "spaz" (1000 bytes on > > > each side of the target): > > > > > > string.find 0.112 ms > > texttools.find 0.080 ms > > > > sre8.search 0.059 > > > pre.search 0.122 > > > > > > unicode.find 0.130 > > > sre16.search 0.065 > > > > > > same test, without any false matches (padded with "-"): > > > > > > string.find 0.035 ms > > texttools.find 0.083 ms > > > > sre8.search 0.050 > > > pre.search 0.116 > > > > > > unicode.find 0.031 > > > sre16.search 0.055 > > > > Those results are probably due to the fact that string.find > > does a brute force search. If it would do a last match char > > first search or even Boyer-Moore (this only pays off for long > > search targets) then it should be a lot faster than [s|p]re. > > does the TextTools algorithm work with arbitrary character > set sizes, btw? The find function creates a Boyer-Moore search object for the search string (on every call). It compares 1-1 or using a translation table which is applied to the searched text prior to comparing it to the search string (this enables things like case insensitive search and character sets, but is about 45% slower). Real-life usage would be to create the search objects once per process and then reuse them. The Boyer-Moore table calcuation takes some time... But to answer your question: mxTextTools is 8-bit throughout. A Unicode aware version will follow by the end of this year. Thanks for checking, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4