Tim Peters wrote: > [Walter Dörwald] > >>I'm working on it, however I discovered that unicode.join() >>doesn't optimize this special case: >> >>s = "foo" >>assert "".join([s]) is s >> >>u = u"foo" >>assert u"".join([s]) is s >> >>The second assertion fails. > > Well, in that example it *has* to fail, because the input (s) wasn't a > unicode string to begin with, but u"".join() must return a unicode > string. Maybe you intended to say that > > assert u"".join([u]) is u > > fails Argl, you're right. > (which is also true today, but doesn't need to be true tomorrow). I've removed the test today, so it won't fail tomorrow. ;) >>I'd say that this test (joining a one item sequence returns >>the item itself) should be removed because it tests an >>implementation detail. > > Neverthess, it's an important pragmatic detail. We should never throw > away a test just because rearrangement makes a test less convenient. So, should I put the test back in (in test_str.py)? >>I'm not sure, whether the optimization should be added to >>unicode.find(). > > Believing you mean join(), yes. Unfortunately the implementations of str.join and unicode.join look completely different. str.join does a PySequence_Fast() and then tests whether the sequence length is 0 or 1, unicode.join iterates through the argument via PyObject_GetIter/PyIter_Next. Adding the optimization might result in a complete rewrite of PyUnicode_Join(). > Doing common endcases efficiently in > C code is an important quality-of-implementation concern, lest people > need to add reams of optimization test-&-branch guesses in their own > Python code. For example, the SpamBayes tokenizer has many passes > that split input strings on magical separators of one kind or another, > pasting the remaining pieces together again via string.join(). It's > explicitly noted in the code that special-casing the snot out of > "separator wasn't found" in Python is a lot slower than letting > string.join(single_element_list) just return the list element, so that > simple, uniform Python code works well in all cases. It's expected > that *most* of these SB passes won't find the separator they're > looking for, and it's important not to make endless copies of > unboundedly large strings in the expected case. The more heavily used > unicode strings become, the more important that they treat users > kindly in such cases too. Seems like we have to rewrite PyUnicode_Join(). Bye, Walter Dörwald
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4