Hydra Chronicles, Part IV: Reformulation of Statistics

When Computing You Are Allowed to Cheat

Throughout the Hydra Chronicles we have explored why group statistics are problematic in R and how to optimize their calculation. R calls are slow, but normally this is not a problem because R will quickly hand off to native routines to do the heavily lifting. With group statistics there must be at least one R call per group. When there are many groups as in the 10MM row and ~1MM group data set we’ve been using, the R-level overhead adds up.

Both data.table and dplyr resolve the problem by providing native routines that can compute per-group statistics completely in native code for a limited number of statistics. This relies on the group expression being in the form statistic(variable) so that the packages may recognize the attempt to compute known group statistics and substitute the native code. There is a more complete discussion of this “Gforce” optimization in an earlier post. We can see it an action here:

library(data.table)
setDTthreads(1)            # for more stable timings

sys.time({                 # wrapper around sytem.time; see appendix
  DT <- data.table(x, grp)
  var.0 <- DT[, .(var=var(x)), keyby=grp]
})
   user  system elapsed
  1.890   0.024   1.923

sys.time is a wrapper around system.time defined in the appendix.

var.0
            grp    var
     1:       1 0.0793
     2:       2 0.0478
    ---               
999952:  999999 0.1711
999953: 1000000 0.0668

But what if we wanted the plain second central moment instead of the unbiased estimate produced by var:

\[E\left[\left(X - E[X]\right)^2\right]\]

Let’s start with the obvious method1:

sys.time({
  DT <- data.table(x, grp)
  var.1 <- DT[, .(var=mean((x - mean(x)) ^ 2)), keyby=grp]
})
   user  system elapsed 
  4.663   0.049   4.728 
var.1
            grp    var
     1:       1 0.0721
     2:       2 0.0409
    ---               
999952:  999999 0.1466
999953: 1000000 0.0594

“Gforce” does not work with more complex expressions, so this ends up substantially slower. We can with some care break up the calculation into simple steps so that “Gforce” can be used, but this requires a group-join-group dance. There is an alternative: as Michael Chirico points out, sometimes we can reformulate a statistic into a different easier to compute form.

With the algebraic expansion:

\[(x - y)^2 = x^2 - 2 x y + y^2\]

And $c$ a constant (scalar), $X$ a variable (vector), $E[X]$ the expected value (mean) of $X$, and the following identities:

\[ \begin{align*} E[c] &= c\\ E\left[E[X]\right] &= E[X]\\ E[c X] &= c E[X]\\ E[X + Y] &= E[X] + E[Y] \end{align*} \]

We can reformulate the variance calculation:

\[ \begin{align*} Var(X) = &E\left[(X - E\left[X\right])^2\right]\\ = &E\left[X^2 - 2X E\left[X\right] + E\left[X\right]^2\right] \\ = &E\left[X^2\right] - 2 E\left[X\right]E\left[X\right] + E\left[X\right]^2 \\ = &E\left[X^2\right] - 2 E\left[X\right]^2 + E\left[X\right]^2 \\ = &E\left[X^2\right] - E\left[X\right]^2 \\ \end{align*} \]

The critical aspect of this reformulation is that there are no interactions between vectors and grouped values (statistics) so we can compute all the component statistics in a single pass. Let’s try it:

sys.time({
  DT <- data.table(x, grp)
  DT[, x2:=x^2]
  var.2 <- DT[,
    .(ux2=mean(x2), ux=mean(x)), keyby=grp  # grouping step
  ][,
    .(grp, var=ux2 - ux^2)                  # final calculation
  ]
})
   user  system elapsed 
  1.159   0.114   1.277 
all.equal(var.1, var.2)
[1] TRUE

Wow, that’s even faster than the “Gforce” optimized “var”. There is only one grouping step, and in that step all expressions are of the form statistic(variable) so they can be “Gforce” optimized.

These concepts apply equally to dplyr.

Beyond Variance

While it’s utterly fascinating (to some) that we can reformulate the variance computation, it’s of little value since both data.table and dplyr have compiled code that obviates the need for it. Fortunately this method is generally applicable to any expression that can be expanded and decomposed as we did with the variance. For example with a little elbow grease we can reformulate the third moment:

\[ \begin{align*} Skew(X) = &E\left[\left(X - E[X]\right)^3\right]\\ = & E\left[X^3 - 3 X^2 E\left[X\right] + 3 X E\left[X\right]^2 - E\left[X\right]^3\right]\\ = & E\left[X^3\right] - 3 E\left[X^2\right] E\left[X\right] + 3 E\left[X\right] E\left[X\right]^2 - E\left[X\right]^3\\ = & E\left[X^3\right] - 3 E\left[X^2\right] - 2E\left[X\right]^3 \end{align*} \]

And to confirm:

X <- runif(100)
all.equal(
  mean((X - mean(X))^3),
  mean(X^3) - 3 * mean(X^2) * mean(X) + 2 * mean(X) ^3
)
[1] TRUE

Similarly, and what set me off on this blogpost, it is possible to reformulate the slope of a single variable linear regression slope. What Michael Chirico saw that was not immediately obvious to me is that:

\[ \begin{align*} Slope(X,Y) = &\frac{Cov(X,Y)}{Var(X)}\\ = &\frac{ E\left[ \left(X - E\left[X\right]\right) \left(Y - E\left[Y\right]\right) \right] }{E\left[\left(X - E\left[X\right]\right)^2\right]}\\ = &\frac{ E\left[XY\right] - E\left[X\right]E\left[Y\right]}{ E\left[X^2\right] - E\left[X\right]^2} \end{align*} \]

In data.table-speak:

DT <- data.table(x, y, xy=x*y, x2=x^2, grp)
slope.dt.re <- DT[,
  .(ux=mean(x), uy=mean(y), uxy=mean(xy), ux2=mean(x2)),
  keyby=grp
][,
  .(grp, slope=(uxy - ux*uy)/(ux2 - ux^2))
]
   user  system elapsed
  1.377   0.126   1.507

As with all the reformulations we’ve seen the key feature of this one is there is no interaction between grouped statistics and ungrouped ones, allowing a single grouping step. With the reformulation data.table is able to beat the cumsum based base-R method we studied in an earlier post:

“Original” in this case is the single-pass cumsum method for base and the group-join-group method for data.table.

More generally we should be able to apply this expansion to any nested statistics involving sum or mean and possibly other functions, which will cover a broad range of interesting statistics. In the case of sum, keep in mind that:

\[ \sum_i^n c = n c \]

And similarly:

\[ \sum_i^n \left(c + X\right) = n c + \sum_i^n X \]

Slow Down Cowboy

Before we run off and reformulate all the things, it is worth noting that the wikipedia page I took the reformulation of the variance from also provides a scary warning:

This equation should not be used for computations using floating point arithmetic because it suffers from catastrophic cancellation if the two components of the equation are similar in magnitude.

Uh-oh. Let’s see:

quantile(((var.2 - var.3) / var.2), na.rm=TRUE)
       0%       25%       50%       75%      100%
-2.09e-07 -5.91e-16  0.00e+00  5.94e-16  2.91e-09

We did lose some precision, but certainly nothing catastrophic. So under what circumstances do we actually need to worry? Let’s cook up an example that actually causes a problem, as with these values at the edge of precision of double precision numbers.

X <- c(2^52 + 1, 2^52 - 1)
mean(X)
[1] 4.5e+15
mean((X - mean(X))^2)            # "normal" calc
[1] 1
mean(X^2) - mean(X)^2            # "reformulated" calc
[1] 0

If we carry out the algebraic expansion of the first element of $X^2$ we can see that problems crop up well before the subtraction:

\[ \begin{align*} (X_1)^2 = &(2^{52} + 2^0)^2\\ = &(2^{52})^2 + 2 \times 2^{52} \times 2^0 + (2^0)^2\\ = &2^{104} + 2^{53} + 2^0 \end{align*} \]

Double precision floating point numbers only have 53 bits of precision in the fraction, which means the difference between the highest power of two and the lowest one contained in a number cannot be greater than 52 without loss of precision. Yet, here we are trying to add $2^0$ to a number that contains $2^{104}$! This is completely out of the realm of what double precision numbers can handle:

identical(
  2^104 + 2^53,
  2^104 + 2^53 + 2^0
)
[1] TRUE

One might think, “oh, that’s just the last term, it doesn’t matter”, but it does:

\[ \require{cancel} \begin{align*} E[X^2] = &\frac{1}{N}\sum X^2\\ = &\frac{1}{2}\left(\left(2^{52} + 2^0\right)^2 + \left(2^{52} - 2^0\right)^2\right)\\ = &\frac{1}{2} \left(\left(\left(2^{104} \cancel{+ 2^{53}} + 2^0\right)\right) + \left((2^{104} \cancel{- 2^{53}} + 2^0)\right)\right)\\ = &2^{104} + 2^0 \end{align*} \]

The contributions from the middle terms cancel out, leaving only the last term as the difference between $E[X^2]$ and $E[X]^2$.

In essence squaring of $X$ values is causing meaningful precision to overflow out of the double precision values. By the time we get to the subtraction in $E[X^2] - E[X]^2$ the precision has already been lost. While it is true that $E[X^2]$ and $E[X]^2$ are close in the cases of precision loss, it is not the subtraction that is the problem.

Generally we need to worry when $\left|E[X]\right| \gg Var(X)$. The point at which catastrophic precision loss happens will be around $\left|E[X]\right| / Var(X) \approx 26$, as that is when squaring $X$ values will cause most or all the variance precision to overflow. We can illustrate with a sample with $n = 100$, $Var(X) = 1$, and varying ratios of $\left|E[X]\right| / Var(X)$, repeating the experiment 100 times for each ratio:

The flat lines that start forming around $2^{26}$ correspond to when overflow causes the reformulation to “round down” to zero. The more dramatic errors correspond to the last bit(s) rounding up instead of down2. As the last bits are further and further in magnitude from the true value, the errors get dramatic.

If we know what the true $\left|E[X]\right| / Var(X)$ ratio is, we can estimate the likely loss of precision. Unfortunately we cannot get a correct measure of this value from the reformulated calculations. For example, if the true ratio is $2^{50}$, due to the possibility for gross overestimation the measured ratio might be as low as $2^2$:

Infinite values (i.e. zero measured variance) are removed from the plot.

Another way to look at the precision loss is to examine the underlying binary representation of the floating point numbers. We do so here for each of the 100 samples of each of the attempted $\left|E[X]\right| / Var(X)$ ratios:

Binary representation of reformulated variance value.

The fraction of the numbers is shown in green, and the exponent in blue. Dark squares are ones and dark ones zeroes. The precision loss shows up as trailing zeroes in those numbers (light green). This view makes it obvious that there is essentially no precision by the time we get to $\left|E[X]\right| / Var(X) \approx 26$, which is why the errors in the reformulation start getting so wild. We can estimate a lower bound on the precision by inspecting the bit pattern, but since there may be legitimate trailing zeroes this will likely be an under-estimate. See the appendix for a method that checks a number has at least n bits of precision.

Each reformulation will bring its own precision challenges. The skew calculation will exhibit the same overflow problems as the variance, except further exacerbated by the cube term. The slope reformulation introduces another wrinkle due to the division of the covariance by the variance. We can illustrate with variables $X$ and $Y$ each with true variance 1, and means $2^{26}$ and $2^4$ respectively. We sampled from these 100 values, 100 times, and plot the resulting bit patterns for the resulting finites slopes:

Binary representation of the reformulated slope and its components.

The precision of the overall calculation should be limited by that of the variance, but the interaction between the variance and covariance in all cases causes the bit pattern to overstate the available precision. In some cases, it even gives the illusion that all 53 bits are in use.

Unlike with our cumsum based group sum, I have not found a good way to estimate a bound on precision loss. The bit pattern method is likely our best bet, yet it feels… shaky. Not to mention that for large enough group counts it will produce false positives as some groups will randomly have statistics with many legitimate trailing zeroes.

A saving grace might be that, e.g. for the variance, data for which $\left|E[X]\right| \gg Var(X)$ should be reasonably rare. In cases where that might hold true, we will often have a rough order-of-magnitude estimate of $E[X]$ which we can subtract from the data prior to computation, though this is only useful if the mean is roughly the same across all groups3.

Conclusions

Reformulations is a powerful tool for dealing with group statistics, but the precision issues are worrying because they are difficult to measure, and may take new forms with each different reformulation.

Appendix

Acknowledgments

Special thanks to Michael Chirico for making me see what should have been obvious from the get go. The best optimization is always a better algorithm, and I was too far removed from my econ classes to remember the expected value identities. So far removed in fact that my first reaction on seeing that his calculations actually produced the same result was “how is that even possible”…

Also thank you to:

Code

dplyr Reformulate

We can apply the reformulation to dplyr, but the slow grouping holds it back.

sys.time({
  var.2a <- tibble(x, grp) %>%
    mutate(x2=x^2) %>%
    group_by(grp) %>%
    summarise(ux2=mean(x2), ux=mean(x)) %>%
    mutate(var=ux2 - ux^2) %>%
    select(grp, var)
})
   user  system elapsed 
 11.171   0.299  11.660 
all.equal(as.data.frame(var.2), as.data.frame(var.2a))
[1] TRUE

Data

RNGversion("3.5.2"); set.seed(42)
n     <- 1e7
n.grp <- 1e6
grp   <- sample(n.grp, n, replace=TRUE)
noise <- rep(c(.001, -.001), n/2)  # more on this later
x     <- runif(n) + noise
y     <- runif(n) + noise          # we'll use this later

sys.time

# Run `system.time` `reps` times and return the timing closest to the median
# timing that is faster than the median.

sys.time <- function(exp, reps=11) {
  res <- matrix(0, reps, 5)
  time.call <- quote(system.time({NULL}))
  time.call[[2]][[2]] <- substitute(exp)
  gc()
  for(i in seq_len(reps)) {
    res[i,] <- eval(time.call, parent.frame())
  }
  structure(res, class='proc_time2')
}
print.proc_time2 <- function(x, ...) {
  print(
    structure(
      x[order(x[,3]),][floor(nrow(x)/2),],
      names=c("user.self", "sys.self", "elapsed", "user.child", "sys.child"),
      class='proc_time'
) ) }

Check Bit Precision

The following function will check whether doubles have at least bits precision as implied by their underlying binary encoding. The precision is defined by the least significant “one” present in the fraction component of the number. There is no guarantee that the least significant “one” correctly represents the precision of the number. There may be trailing zeroes that are significant, or a number may gain random trailing noise that is not a measure of precision.

While this implementation is reasonably fast, it is memory inefficient. Improving it to return the position of the last “one” bit will likely make it even more memory inefficient. I have only very lightly tested it so consider it more a proof of concept than a correct or robust implementation.

# requires double `x` of length > 0, and positive scalar `bits`
# assumes little endian byte order.  Not tested with NA or infinite values.

has_precision_bits <- function(x, bits) {
  n <- 53 - min(c(bits-1, 53))
  mask.full <- n %/% 8
  mask.extra <- n %% 8
  mask <- as.raw(c(
    rep(255, mask.full),
    if(mask.extra) 2^(mask.extra) - 1,
    rep(0, 8 - (mask.full + (mask.extra > 0)))
  ) )
  colSums(matrix(as.logical(writeBin(x, raw()) & mask), nrow=8)) > 0
}

  1. By “obvious” I mean the most obvious that will also advance the narrative in the direction I want it to go in. The most obvious one would be to multiply the result of var(...) by .N/(.N-1) to undo the bias correction.

  2. I have no support for this other than it makes sense. I cannot explain why there seem to be more than one “rounding up” other than it may be an artifact of the specific values that were drawn randomly for any given level of $E[X]/Var(X)$. Normally this is something I would chase down and pin down, but honestly I’m kinda running out of patience with this endless post series.

  3. In the case where each group has means that are very different from the others this is no longer a useful correction as matching each group to its mean-estimate is costly. At that point and we might as well revert to a two pass solution.

Brodie Gaslam Written by:

Brodie Gaslam is a hobbyist programmer based on the US East Coast.