#### Skewing Around, Part 3

In parts 1 and 2 , I described how I used the AMS and HyperLogLog online algorithms to estimate the frequency distribution of values in multiple integer sequences. I was doing this for a friend that needed a compute and space efficient way to analyze thousands of such streams.

Although the solution I came up with in Part 2 worked, it didn’t express the “skew” in human understandable terms like “20% of the values accounted for 80% of the occurrences”. Challenge accepted.

In the last post, I explained that an integer stream of length $$O$$ with an equal number of occurrences of $$N$$ values will have a minimum surprise number of,

\begin{aligned} S_{min} &= \frac{O^2}{N} \end{aligned}

And that this can be combined with the estimated surprise number to create a scaled “skew” metric $$R$$ that is comparable between different streams,

$$R = \frac{\hat{S}}{S_{min}}$$\/
Now, consider a stream where proportion \$$q\$$ of the values account for proportion \$$p\$$ of the occurrences. With some simplifying assumptions, the surprise number for this stream will be,
\begin{aligned} S_{pq} &= \sum_{i}^{qN} \left(\frac{pO}{qN}\right)^2 + \sum_{j}^{(1-q)N} \left(\frac{(1-p)O}{(1-q)N}\right)^2\\ &= \frac{(pO)^2}{qN} + \frac{((1-p)O)^2}{(1-q)N}\\ \end{aligned}

The resulting $$R$$ metric will be,

\begin{aligned} R &= \frac{S_{pq}}{S_{min}}\\ &= \frac{\frac{(pO)^2}{qN} + \frac{((1-p)O)^2}{(1-q)N}}{S_{min}}\\ &= \frac{\frac{(pO)^2}{qN} + \frac{((1-p)O)^2}{(1-q)N}}{\frac{O^2}{N}}\\ &= \frac{p^2}{q} + \frac{(1-p)^2}{(1-q)}\\ &= \frac{p^2(1-q)}{q(1-q)} + \frac{q(1-)^2)}{q(1-q)}\\ &= \frac{p^2(1-q) + q(1 - 2p + p^2)}{q(1-q)}\\ &= \frac{p^2 - p^2q + q - 2pq + p^2q}{q(1-q)}\\ &= \frac{p^2 + q - 2pq}{q(1-q)}\\ &= \frac{p^2 + q - 2pq + (q^2 - q^2)}{q(1-q)}\\ &= \frac{(p^2 - 2pq + q^2) + q - q^2}{q(1-q)}\\ &= \frac{(p-q)^2 + q(1-q)}{q(1-q)}\\ &= \frac{(p-q)^2}{q(1-q)} + 1\\ \end{aligned}

Expressing this in terms of the estimated ratio described in the last post leads to,

\begin{aligned} R &= \frac{\hat{S}}{S_{min}} \approx \frac{(p-q)^2}{q(1-q)} + 1\\ \end{aligned}

Now, lets substitute $$R$$ with a new metric $$R^{\prime} = \tilde{R} - 1$$,

\begin{aligned} R^{\prime}_s &\approx \frac{(p-q)^2}{q(1-q)}\\ R^{\prime} q(1-q) &\approx (p-q)^2\\ \sqrt{R^{\prime} q(1-q)} &\approx p-q\\ p &\approx q + \sqrt{R^{\prime} q(1-q)}\\ \end{aligned}

Since this equation has two unknowns, there are many solutions of $$p$$ and $$q$$ for a given $$R^{\prime}$$. For example, $$\tilde{R} = 3.25$$ has an infinite number of possible solutions including: $${p=.9, q=.25},{p=.8, q=.2}, {p=.6,q=.15}$$. The following figure illustrates this,

However, as the figure suggests, an upper bound on $$q$$ can be found by setting $$p <= 1$$.

\begin{aligned} q + \sqrt{R^{\prime} q(1-q)} &<= 1\\ \sqrt{R^{\prime} q(1-q)} &<= (1-q)\\ R^{\prime} q(1-q) &<= (1-q)^2\\ R^{\prime} q &<= (1-q)\\ R^{\prime} &<= \frac{(1-q)}{q}\\ R^{\prime} &<= \frac{1}{q} - 1\\ R^{\prime} + 1 &<= \frac{1}{q}\\ \end{aligned}

Recalling the definition of $$R^{\prime}$$,

\begin{aligned} R - 1 + 1 &<= \frac{1}{q}\\ R &<= \frac{1}{q}\\ q &<= \frac{1}{R}\\ \end{aligned}

This means that the upper bound on the percentage of skewed values is the reciprocal of the estimated $$R$$. Cool!

Unfortunately, with both $$p$$ and $$q$$ unknown and only one equation, this was as far as I could take this. However, the upper bound and family of curves for given values of $$p$$ were enough for my friend to do what he wanted.

This was a fun project. I got to see first hand that although probabilistic online algorithms sacrifice accuracy for efficiency, they can still perform well enough to get the job done!