Skewing Around, Part 2


\( \DeclareMathOperator*{\argmax}{\mathbf{arg\,max}} \)

Where did November go? Or the first half of December for that matter?

In Part 1 , I began describing a project to estimate the skew of many integer streams in real-time. I explained how the second frequency moment, also called the surprise number, can be used for this purpose and demonstrated using the AMS to estimate it with minimal compute and memory resources. However, the post concluded with the observation that it is difficult to compare second frequency moments between streams. In this post, I'll explain how I solved that problem.

Coming up with an approach wasn't that hard once I recalled two facts about surprise numbers,

  1. They are at minimum when the distribution is uniform.
  2. Their value monotonically increases with the distribution's skew.

I reasoned that dividing the estimated surprise number by the theoretical minimum should yield a scaled skew measure comparable between streams,

\begin{equation*} R = \frac{\hat{S}}{S_{min}} \end{equation*}

Where \(\hat{S}\) and \(S_{min}\) are the estimated and minimum surprise numbers respectively.

The minimum surprise number is easily calculated,

\begin{align*} S_{min} &= \sum_{i=1}^{n} \left( \frac{O}{n} \right)^2\\ &= n \, \left( \frac{O}{n} \right)^2\\ &= \frac{O^2}{n} \end{align*}

where \(O\) and \(n\) are the number of observations and unique values respectively. Now, \(O\) was easy as it just required counting the number of values observed. Determining \(n\) was another story.

In Part 1, we noted that explicitly tracking unique values wasn't an option due to the many possible values and large number of streams. The AMS algorithm allowed estimating surprise numbers without having to do that. I needed a similar way to estimate \(n\). Enter the HyperLogLog algorithm.

HyperLogLog is a probabilistic algorithm to estimate cardinality, the number of distinct elements in a set or stream. It was introduced in a paper by by Flojet et al. in 2007. In 2013, a paper by some fine folks at Google described an improved version, called HyperLogLog++, that yielded better accuracy using less resources. Other cardinality estimation algorithms exist, but HyperLogLog appears to be a popular choice.

HyperLogLog and its variants rely on the magic of hashing and stochastic averaging. Each stream value is hashed and the result inspected to see how many leading zeroes it has. For example, if the stream value 25 is hashed to the 32 bit value 0x0000ffff, there are 16 leading zeroes. The algorithm keeps track of the maximum number of leading zeroes over all of the hashed stream values. Now, this is the tricky bit. Assuming that the hash function is uniform, the probability of a hash result having \(r\) leading zeroes is \(2^{-r}\). With a bit of tricky math, it can be shown that the estimated cardinality is \(2^{\argmax \,r}\). Intuitively this makes sense as observing an event with probability \(2^{-r}\) on average takes \(2^r\) trials. So, tracking the rarest value provides an estimate of the number of hash values and, assuming no collisions, the stream's cardinality. That's the hashing part of the magic. The stochastic averaging part involves partitioning the hash values into buckets, tracking the maximum number of leading zeroes in each, and averaging the results.

This may sound complicated but its actually easy to implement. Here's a quick version in Julia.

type hll
  p::Int64
  m::Int64
  alpha::FloatingPoint
  buckets::Array{Int8,1}
end

function hll(p::Int) 
 local m = 2^p;
 local alpha;

 if (m <= 16)
  alpha = 0.673
 elseif (m <= 32) 
  alpha = 0.697
 elseif (m <= 64)
  alpha = 0.709
 else
  alpha = 0.7213/(1 + 1.079/m);
 end

 hll(p, m, alpha, fill(-1,m));
end

function bucketIdx(v::Uint64, p::Integer)
  local lv    = convert(Uint64,v);
  local mask  = convert(Uint64,2^p)-1;
  (lv &  mask) + 1
end

function clz(v::Uint64,p::Int)
  local lv    = convert(Uint64,v);
  local mask  = convert(Uint64,2^63);
  local limit = convert(Uint64,2^(p-1));
  local c     = 0;

  while (((lv & mask) == 0) && (mask > limit))
    mask = mask >> 1
    c +=1
  end
  c
end

function update(h::hll, v)
  local hv   = hash(v)
  local bidx = bucketIdx(hv, h.p)
  local nlz  = clz(hv, h.p)

  h.buckets[bidx] = max(h.buckets[bidx], nlz+1)
end

function estimate(h::hll)
 local bsum   = 0;
 local nempty = 0;
 for b in h.buckets
   if (b >= 0)
     bsum += 1.0/(2^b)
   else
    nempty += 1
   end
 end

 # NB - omitting the emperical bias correction
 #      described in the Google paper for 
 #      simplicity
 eraw = h.alpha * (h.m^2)/bsum  
 if (nempty > 0)
   est = h.m * log(h.m / nempty)
 else
   est = eraw
 end

 convert(Int, round(est))
end

With a bit more code, we can test how the algorithm's accuracy changes with the parameters N, number of unique values, and P, which uses \(2^P\) buckets for the averaging.

using DataFrames

function test(ps,ns,r)
 local df = DataFrame(N=Int64[], P=Int64[], MAPE=Float32[])
 for n in ns
   for p in ps
     perrors = Float64[]
     for i = 1:r
       local h = hll(p)
       for v in rand(Uint, n)
         update(h,v)
       end
       est = estimate(h)
       err = abs(n-est)/n
       push!(perrors, err)
     end

     mape = mean(perrors)
     push!(df, [n p mape])
   end
 end

 df
end

df = test([4 8 10], [10000 100000 1000000],3)

The results clearly show that accuracy improves with \(P\), the number of buckets. For a given \(P\), there's a bit of odd behavior as \(N\) increases but appears to be less of an issue at higher values of \(P\).

N P=4 P=8 P=10
10000 0.14 0.03 0.03
100000 0.26 0.06 0.02
1000000 0.29 0.03 0.01

Using HyperLogLog, I could now calculate a scaled skew metric that could be compared between streams. I was ready to declare "success" but my friend wanted to know more, in particular a Pareto of occurrences versus the number of unique values - for example 80% of the occurrences were by 20% of the values. I'll explain how I approached that in the next post.

Topics:

MachineLearning Streams Julia
comments powered by Disqus