[Raymond Hettinger, on using Knuth's recurrence for variance] > I found it. > > There is a suggestion that the formula would be better off > if the recurrence sums are accumulated along with a cumulative > error term that tracks the loss of significance when adding > a small adjustment to a larger sum. The textbook exercise > indicates that that technique is the most useful in the presence > of crummy rounding. Do you think there is any value in using > the error term tracking? The tension between speed and accuracy in floating numerics bites on every operation. Tricks like Kahan summation can be valuable, but they can also so grossly complicate the code as to introduce subtle bugs of their own, and can also lull into believing they deliver more than they do. For a discussion of numeric *problems* with Kahan's gimmick, and yet another layer of complication to overcome them (well, to a milder level of weakness): http://www.owlnet.rice.edu/~caam420/lectures/fpsum.html The numeric problems for mean and variance are worst when the inputs vary in sign and span a large dynamic range. But that's unusual in real-life data. For example, computing the mean height of American adults involves all-positive numbers that don't even span a factor of 10. As a bad example, consider this vector: x = [1e10]*10 + [21] + [-1e10]*10 There are 21 values total, the first and last 10 cancel out, so the true mean is 21/21 = 1. Here are two ways to compute the mean: def mean1(x): return sum(x) / float(len(x)) def mean2(x): m = 0.0 for i, elt in enumerate(x): m += (elt - m) / (i+1.0) return m Then >>> mean1(x) 1.0 >>> mean2(x) 0.99999988079071045 >>> mean2 is relatively atrocious, and you'll recognize it now as Knuth's recurrence for the mean. sum(x) is actually exact with 754 double arithmetic (we're not even close to using up 53 bits), and the 9 orders of magnitude spanned by the input elements is much larger than seen in most real-life sources of data. The problem with mean2 here is subtler than just that. So the one thing I absolutely recommend with Knuth's method is keeping a running sum, computing the mean fresh on each iteration, instead of using the mean recurrence. If that's done, Knuth's recurrence for the sum of (x_i-mean)**2 works very well even for the "bad example" above. > ... > Without the error term tracking, the inner loop simplifies to: > > k += 1 > newm = m+(x-m)/k > s += (x-m)*(x-newm) > m = newm I hope you worked out the derivation for s! It's astonishing that such a complicated update could collapse to such a simple expression. The suggestion above amounts to replacing newm = m+(x-m)/k by sum += x newm = sum / k
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4