> > * Is there a way to compute the standard deviation without multiple > > passes over the data (one to compute the mean and a second to sum the > > squares of the differences from the mean? [Tim] > The textbook formula does it in one pass, by accumulating running totals > of > the elements and of their squares. It's numerically atrocious, though, > and > should never be used (it can even end up delivering a negative result!). > Knuth gives a much better recurrence formula for this, doing one-pass > computation of variance in a numerically stable way. I found it. There is a suggestion that the formula would be better off if the recurrence sums are accumulated along with a cumulative error term that tracks the loss of significance when adding a small adjustment to a larger sum. The textbook exercise indicates that that technique is the most useful in the presence of crummy rounding. Do you think there is any value in using the error term tracking? def stddev(data): # Knuth -- Seminumerical Algorithms 4.2.2 p.216 it = iter(data) m = float(it.next()) # new running mean s = 0.0 # running sum((x-mean)**2) k = 1 # number of items dm = ds = 0 # carried forward error term for x in it: k += 1 adjm = (x-m)/k - dm newm = m + adjm dm = (newm - m) - adjm adjs = (x-m)*(x-newm) - ds news = s + adjs ds = (news - s) - adjs s = news m = newm return (s / (k-1)) ** 0.5 # sample standard deviation Without the error term tracking, the inner loop simplifies to: k += 1 newm = m+(x-m)/k s += (x-m)*(x-newm) m = newm Raymond Hettinger
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4