DSGeek

December 12, 2014

Skewing Around, Part 2

Where did November go? Or the first half of December for that matter?

In Part 1 , I began describing a project to estimate the skew of many integer streams in real-time. I explained how the second frequency moment, also called the surprise number, can be used for this purpose and demonstrated using the AMS to estimate it with minimal compute and memory resources. However, the post concluded with the observation that it is difficult to compare second frequency moments between streams. In this post, I’ll explain how I solved that problem.

Coming up with an approach wasn’t that hard once I recalled two facts about surprise numbers,

They are at minimum when the distribution is uniform.
Their value monotonically increases with the distribution’s skew.

I reasoned that dividing the estimated surprise number by the theoretical minimum should yield a scaled skew measure comparable between streams,

$$ R = \frac{\hat{S}}{S_{min}} $$

Where $\hat{S}$ and $S_{min}$ are the estimated and minimum surprise numbers respectively.

The minimum surprise number is easily calculated,

$$ \begin{aligned} S_{min} &= \sum_{i=1}^{n} \left( \frac{O}{n} \right)^2\\ &= n \, \left( \frac{O}{n} \right)^2\\ &= \frac{O^2}{n} \end{aligned} $$

where $O$ and $n$ are the number of observations and unique values respectively. Now, $O$ was easy as it just required counting the number of values observed. Determining $n$ was another story.

In Part 1, we noted that explicitly tracking unique values wasn’t an option due to the many possible values and large number of streams. The AMS algorithm allowed estimating surprise numbers without having to do that. I needed a similar way to estimate $n$. Enter the HyperLogLog algorithm.

HyperLogLog is a probabilistic algorithm to estimate cardinality, the number of distinct elements in a set or stream. It was introduced in a paper by by Flojet et al. in 2007. In 2013, a paper by some fine folks at Google described an improved version, called HyperLogLog++, that yielded better accuracy using less resources. Other cardinality estimation algorithms exist, but HyperLogLog appears to be a popular choice.

HyperLogLog and its variants rely on the magic of hashing and stochastic averaging. Each stream value is hashed and the result inspected to see how many leading zeroes it has. For example, if the stream value 25 is hashed to the 32 bit value 0x0000ffff, there are 16 leading zeroes. The algorithm keeps track of the maximum number of leading zeroes over all of the hashed stream values. Now, this is the tricky bit. Assuming that the hash function is uniform, the probability of a hash result having $r$ leading zeroes is $2^{-r}$. With a bit of tricky math, it can be shown that the estimated cardinality is $2^{\text{argmax}\,r}$ . Intuitively this makes sense as observing an event with probability $2^{-r}$ on average takes $2^r$ trials. So, tracking the rarest value provides an estimate of the number of hash values and, assuming no collisions, the stream’s cardinality. That’s the hashing part of the magic. The stochastic averaging part involves partitioning the hash values into buckets, tracking the maximum number of leading zeroes in each, and averaging the results.

This may sound complicated but its actually easy to implement. Here’s a quick version in Julia.

    type hll
      p::Int64
      m::Int64
      alpha::FloatingPoint
      buckets::Array{Int8,1}
    end
    
    function hll(p::Int) 
     local m = 2^p;
     local alpha;
    
     if (m <= 16)
      alpha = 0.673
     elseif (m <= 32) 
      alpha = 0.697
     elseif (m <= 64)
      alpha = 0.709
     else
      alpha = 0.7213/(1 + 1.079/m);
     end
    
     hll(p, m, alpha, fill(-1,m));
    end
    
    function bucketIdx(v::Uint64, p::Integer)
      local lv    = convert(Uint64,v);
      local mask  = convert(Uint64,2^p)-1;
      (lv &  mask) + 1
    end
    
    function clz(v::Uint64,p::Int)
      local lv    = convert(Uint64,v);
      local mask  = convert(Uint64,2^63);
      local limit = convert(Uint64,2^(p-1));
      local c     = 0;
    
      while (((lv & mask) == 0) && (mask > limit))
        mask = mask >> 1
        c +=1
      end
      c
    end
    
    function update(h::hll, v)
      local hv   = hash(v)
      local bidx = bucketIdx(hv, h.p)
      local nlz  = clz(hv, h.p)
    
      h.buckets[bidx] = max(h.buckets[bidx], nlz+1)
    end
    
    function estimate(h::hll)
     local bsum   = 0;
     local nempty = 0;
     for b in h.buckets
       if (b >= 0)
         bsum += 1.0/(2^b)
       else
        nempty += 1
       end
     end
    
     # NB - omitting the emperical bias correction
     #      described in the Google paper for 
     #      simplicity
     eraw = h.alpha * (h.m^2)/bsum  
     if (nempty > 0)
       est = h.m * log(h.m / nempty)
     else
       est = eraw
     end
    
     convert(Int, round(est))
    end

With a bit more code, we can test how the algorithm’s accuracy changes with the parameters $N$, number of unique values, and $P$, which uses $2^P$ buckets for the averaging.

    using DataFrames
    
    function test(ps,ns,r)
     local df = DataFrame(N=Int64[], P=Int64[], MAPE=Float32[])
     for n in ns
       for p in ps
         perrors = Float64[]
         for i = 1:r
           local h = hll(p)
           for v in rand(Uint, n)
             update(h,v)
           end
           est = estimate(h)
           err = abs(n-est)/n
           push!(perrors, err)
         end
    
         mape = mean(perrors)
         push!(df, [n p mape])
       end
     end
    
     df
    end
    
    df = test([4 8 10], [10000 100000 1000000],3)

The results clearly show that accuracy improves with $P$, the number of buckets. For a given $P$, there’s a bit of odd behavior as $N$ increases but appears to be less of an issue at higher values of $P$.

N	P=4	P=8	P=10
10000	0.14	0.03	0.03
100000	0.26	0.06	0.02
1000000	0.29	0.03	0.01

Using HyperLogLog, I could now calculate a scaled skew metric that could be compared between streams. I was ready to declare “success” but my friend wanted to know more, in particular a Pareto of occurrences versus the number of unique values - for example 80% of the occurrences were by 20% of the values. I’ll explain how I approached that in the next post.

Tags: MachineLearning , Streams , Julia