Stable standard deviation

Posted by David Zaslavsky on November 3, 2013 11:40 PM

— Comments

To kick off National Blog Writing Month, here’s something I’ve been saving up for my more mathematically inclined readers — and also for anyone who does scientific computing. It’s a quick computational trick to come out of all the stuff I’ve been doing lately that is not writing blog posts.

Sometimes you have to calculate the standard deviation of a set of numbers. Although there are various definitions of the standard deviation, the usual way to do this in physics is

$$\sigma = \sqrt{\frac{1}{N}\sum (x_k - \expect{x})^2} = \sqrt{\frac{1}{N}\sum x_k^2 - \biggl(\frac{1}{N}\sum x_k\biggr)^2} = \sqrt{\expect{x^2} - \expect{x}^2}$$

where $N$ is the number of data points and $x_k$ are the values. But this can be pretty inaccurate when the values themselves are much larger than the variance — you wind up taking a difference of two very large numbers, the two terms under the square root, and expecting to come out with a small result. That’s bad. Sometimes the difference even winds up being negative!

There’s a better method, described on John D. Cook’s website: instead of computing two large numbers and subtracting them at the end, it incrementally incorporates the deviation of each value from the running mean. Those individual deviations are small, and so it keeps the numerical error down. Here’s my modified description of the algorithm:

Initialize two variables $m$ and $s$ to $m_0 = 0$ and $s_0 = 0$ respectively
Update the variables with each value using the recurrence relations

$
\begin{align}m = m_k &= m_{k-1} + \frac{x_k - m_{k-1}}{k}\\ s = s_k &= s_{k-1} + (x_k - m_{k-1})(x_k - m_k)\end{align}
$
After each value $x_k$ is incorporated, $m_k$ is the mean and $\sqrt{\frac{s_k}{k}}$ is the standard deviation.

How do we know it works? Proof by induction. For the mean, it’s easy:

The case $k = 1$ can be verified by inspection.

$m_1 = m_0 + \frac{x_1 - m_0}{1} = x_1$
For $k > 1$, assume $m_{k-1}$ is the mean of the first $k-1$ values. Then the total so far is $(k - 1) m_{k-1}$. When you add in the $k$th value, the new total is $x_k + (k - 1)m_{k-1}$, and the new mean is

$m_k = \frac{x_k + (k - 1)m_{k-1}}{k} = m_{k-1} + \frac{x_k - m_{k-1}}{k}$

For the standard deviation, it’s just a little more mathy. If $\frac{s_k}{k}$ is supposed to be the variance, then $s_k$ should equal $\sum (x_j - m_k)^2$.

We can verify that this holds for $k = 1$ by inspection: using the recurrence relation,

$s_1 = s_0 + (x_1 - m_0)(x_1 - m_1) = 0$

because $s_0 = 0$, and $m_1 = x_1$. This makes sense; the standard deviation of one value is $\frac{1}{1}(x_1 - m_1)^2 = 0$.
For $k > 1$, assume we’ve already shown that the recurrence relation is accurate for the first $k - 1$ elements. Then

$
\begin{align}s_{k-1} &= \sum_{j=1}^{k-1}(x_j - m_{k-1})^2\\ &= \sum_{j=1}^{k-1}\bigl[(x_j - m_k)^2 + (m_k - m_{k-1})^2 + 2(x_j - m_k)(m_k - m_{k-1})\bigr]\\ &= \sum_{j=1}^{k-1}(x_j - m_k)^2 + (k-1)(m_k - m_{k-1})^2 + 2\biggl(\sum_{j=1}^{k-1}x_j - (k-1)m_k\biggr)(m_k - m_{k-1})\end{align}
$

But $\sum_{j=1}^{k-1}x_j = (k-1)m_{k-1}$, so that’s equal to

$
\begin{align}s_{k-1} &= \sum_{j=1}^{k-1}(x_j - m_k)^2 + (k-1)(m_k - m_{k-1})^2 - 2(k-1)(m_k - m_{k-1})^2 \\ &= \sum_{j=1}^{k-1}(x_j - m_k)^2 - (k - 1)(m_k - m_{k-1})^2\end{align}
$

Now, the term we’re adding to $s_{k-1}$ to get $s_k$ is $(x_k - m_{k-1})(x_k - m_k)$. This can be written as

$(x_k - m_{k-1})(x_k - m_k) = (x_k - m_k)^2 + (m_k - m_{k-1})(x_k - m_k)$

Using the recurrence relation for $m_k$, we find that $x_k - m_k = (k - 1)(m_k - m_{k-1})$, which means the added term can also be written

$(x_k - m_{k-1})(x_k - m_k) = (x_k - m_k)^2 + (k - 1)(m_k - m_{k-1})^2$

Adding that to $s_{k-1}$ from before, the last term cancels out, and the first term gets incorporated into the sum, giving

$s_k = \sum_{j=1}^{k}(x_j - m_k)^2$

which is just what we wanted.