In parts 1 and 2 , I described how I used the AMS and HyperLogLog online algorithms to estimate the frequency distribution of values in multiple integer sequences. I was doing this for a friend that needed a compute and space efficient way to analyze thousands of such streams.

Although the solution I came up with in Part 2 worked, it didn’t express the “skew” in human understandable terms like “20% of the values accounted for 80% of the occurrences”. Challenge accepted.

In the last post, I explained that an integer stream of length \(O\) with an equal number of occurrences of \(N\) values will have a minimum surprise number of,

$$
\begin{aligned}
S_{min} &= \frac{O^2}{N}
\end{aligned}
$$

And that this can be combined with the estimated surprise number to create a scaled “skew” metric \(R\) that is comparable between different streams,

$$
R = \frac{\hat{S}}{S_{min}}
$$\/

Now, consider a stream where proportion \\(q\\) of the values account for
proportion \\(p\\) of the occurrences. With some simplifying assumptions,
the surprise number for this stream will be,

$$
\begin{aligned}
S_{pq} &= \sum_{i}^{qN} \left(\frac{pO}{qN}\right)^2 + \sum_{j}^{(1-q)N} \left(\frac{(1-p)O}{(1-q)N}\right)^2\\
&= \frac{(pO)^2}{qN} + \frac{((1-p)O)^2}{(1-q)N}\\
\end{aligned}
$$

The resulting \(R\) metric will be,

$$
\begin{aligned}
R &= \frac{S_{pq}}{S_{min}}\\
&= \frac{\frac{(pO)^2}{qN} + \frac{((1-p)O)^2}{(1-q)N}}{S_{min}}\\
&= \frac{\frac{(pO)^2}{qN} + \frac{((1-p)O)^2}{(1-q)N}}{\frac{O^2}{N}}\\
&= \frac{p^2}{q} + \frac{(1-p)^2}{(1-q)}\\
&= \frac{p^2(1-q)}{q(1-q)} + \frac{q(1-)^2)}{q(1-q)}\\
&= \frac{p^2(1-q) + q(1 - 2p + p^2)}{q(1-q)}\\
&= \frac{p^2 - p^2q + q - 2pq + p^2q}{q(1-q)}\\
&= \frac{p^2 + q - 2pq}{q(1-q)}\\
&= \frac{p^2 + q - 2pq + (q^2 - q^2)}{q(1-q)}\\
&= \frac{(p^2 - 2pq + q^2) + q - q^2}{q(1-q)}\\
&= \frac{(p-q)^2 + q(1-q)}{q(1-q)}\\
&= \frac{(p-q)^2}{q(1-q)} + 1\\
\end{aligned}
$$

Expressing this in terms of the estimated ratio described in the last post leads to,

$$
\begin{aligned}
R &= \frac{\hat{S}}{S_{min}} \approx \frac{(p-q)^2}{q(1-q)} + 1\\
\end{aligned}
$$

Now, lets substitute \(R\) with a new metric \(R^{\prime} = \tilde{R} - 1\),

$$
\begin{aligned}
R^{\prime}_s &\approx \frac{(p-q)^2}{q(1-q)}\\
R^{\prime} q(1-q) &\approx (p-q)^2\\
\sqrt{R^{\prime} q(1-q)} &\approx p-q\\
p &\approx q + \sqrt{R^{\prime} q(1-q)}\\
\end{aligned}
$$

Since this equation has two unknowns, there are many solutions of \(p\) and \(q\) for a given \(R^{\prime}\). For example, \(\tilde{R} = 3.25\) has an infinite number of possible solutions including: \({p=.9, q=.25},{p=.8, q=.2}, {p=.6,q=.15}\). The following figure illustrates this,

However, as the figure suggests, an upper bound on \(q\) can be found by setting \(p <= 1\).

$$
\begin{aligned}
q + \sqrt{R^{\prime} q(1-q)} &<= 1\\
\sqrt{R^{\prime} q(1-q)} &<= (1-q)\\
R^{\prime} q(1-q) &<= (1-q)^2\\
R^{\prime} q &<= (1-q)\\
R^{\prime} &<= \frac{(1-q)}{q}\\
R^{\prime} &<= \frac{1}{q} - 1\\
R^{\prime} + 1 &<= \frac{1}{q}\\
\end{aligned}
$$

Recalling the definition of \(R^{\prime}\),

$$
\begin{aligned}
R - 1 + 1 &<= \frac{1}{q}\\
R &<= \frac{1}{q}\\
q &<= \frac{1}{R}\\
\end{aligned}
$$

This means that the upper bound on the percentage of skewed values is the reciprocal of the estimated \(R\). Cool!

Unfortunately, with both \(p\) and \(q\) unknown and only one equation, this was as far as I could take this. However, the upper bound and family of curves for given values of \(p\) were enough for my friend to do what he wanted.

This was a fun project. I got to see first hand that although probabilistic online algorithms sacrifice accuracy for efficiency, they can still perform well enough to get the job done!

Tags: MachineLearning , Streams