This past holiday season, I once again spent my "time off" working on a solution to a Kaggle holiday challenge. This year, Santa needed help after his magic sleigh was stolen!

Instead of delivering all the presents in one trip, Santa was forced to make multiple trips using his non-magical sleigh. The sleigh had a weight limit and the poor reindeer were in danger of exhaustion from all the trips. Santa needed a plan for delivering the presents that minimized reindeer weariness. Poor little guys. You can find out more about the challenge in this post by the designer.

This was essentially a capacitated vehicle routing problem. I approached it by clustering presents based on geographic distance. To avoid an \(O(N^2)\) computation, I needed a way to efficiently determine a present's closest neighbors. I ended up using a form of spherical quad-tree (SQT), or quaternary triangular mesh, based on a coordinate transformation method described in this paper.

Given latitude and longitude locations, the SQT maps each of them to an octahedron face that serves as an (initially poor) approximation of the globe. The octahedron is then tessellated and the points mapped to the appropriate triangles. When the tessellation is done, locations close to each other get mapped to the same triangle. This substantially reduces the number of locations to consider when clustering. My SQT implementation is extremely simple - all points must be added before the tessellation is done and it doesn't support hierarchical querying. But I think the implementation can be extended to be a general purpose SQT without much effort.

To make sure I got the SQT math right, I wrote a small prototype in Processing. The following animation visualizes the initial octahedron, two tessellations, and the successive mapping of a test point. Debugging the math visually made getting the SQT right much, much easier.

In prior Kaggle competitions, I ran into run-time issues with solutions implemented in R, Python, and Julia. This time I decided to try Go. I was very happy with this choice. Execution time was very short and I liked Go's minimalist syntax. In fact, I suspect it will become my new "system" programming language of choice.

I've posted the Go and Processing source code to my GitHub account.

My ego wishes I had done better on the leader board. But, as is usually the case, I learned a lot from working on this challenge. One of the things I like best about data science is that it offers opportunities to learn new things like SQTs. It's hard to put into words how much I enjoy that.

Now, onto the next competition…

For the Whenology project, I needed to visually describe key concepts related to leaf color change during Fall. The obvious answer was to simply draw some trees! But how?

I knew that D3js could import SVG files, so the first step was to draw a tree. I started with a public domain picture of a leaf and used Adobe Illustrator's Image Trace feature to turn it into an SVG outline.

I also used Illustrator to draw a tree trunk with branches,

Over this, I copied-and-pasted the leaf outline many times to create a tree with full foliage and saved the image as an SVG.

At this point, I could use D3js to render this exact tree on a webpage. That was nice, but what I really needed was a way to draw trees with different amounts of leaves and degrees of brownness. The brute force approach would have been to draw different trees in Illustrator but that seemed "inelegant". What I really wanted to do was modify the tree SVG dynamically with D3js.

This turned out to be pretty easy to do. The essential trick was to edit the tree SVG file and add a class attribute to each leaf path. Basically, change this

... <path fill="#8CC63F" stroke="#000000" ...../> <path fill="#8CC63F" stroke="#000000" ...../> <path fill="#8CC63F" stroke="#000000" ...../> <path fill="#8CC63F" stroke="#000000" ...../> ...

to this,

... <path class="leaf" fill="#8CC63F" stroke="#000000" ...../> <path class="leaf" fill="#8CC63F" stroke="#000000" ...../> <path class="leaf" fill="#8CC63F" stroke="#000000" ...../> <path class="leaf" fill="#8CC63F" stroke="#000000" ...../> ...

With that small change, the color and number of leaves can be changed interactively using D3js. To see this in action, play around with the sliders in this visualization,

This approach isn't limited to visualizing one tree, a whole forest can be made by repeatedly cloning the SVG object. This was exactly what I wanted. I was really happy with the visualizations I created using this approach.

Below is example D3js and JavaScript code for loading, rendering, and interactively updating the tree. This gist shows how I created the sliders based on Mike Bostock's example.

var leafColor = d3.scale.quantize() .domain([0,1]) .range([d3.rgb("#8da43b"),d3.rgb("#5c5b32"),d3.rgb("#814725")]); function drawTree(x, y, pBrown, pMissing, scale) { var treeG = svg.append("g") .attr("class","tree") .attr("transform", "translate(" + (x-(treeWidth/2)*scale) + "," + (y-treeHeight*scale) + ")scale(" + scale + ")") .each(function (d,i) { this.appendChild(treeNode.cloneNode(true)); }); treeG.selectAll(".leaf") .style("fill", function(d) { return leafColor(pBrown + Math.random()) ;}) //<= pbrown ? "brown" : "green"; }) .style("opacity", function(d) { return Math.random() <= pMissing ? 0 : 1; }); } function update(pBrown, pMissing) { svg.select(".tree") .selectAll(".leaf") .transition() .style("fill", function(d) { return leafColor(pBrown + Math.random()) ;}) .style("opacity", function(d) { return Math.random() <= pMissing ? 0 : 1; }); } var svg, width, treeNode, treeHeight, treeWidth; d3.xml("/data/tree.svg", function (e, svgTree) { var margin = {top: 20, right: 40, bottom: 20, left: 40}, width = 600 - margin.left - margin.right, height = 350 - margin.top - margin.bottom; svg = d3.select("#viz").append("svg") .attr("height", height + margin.top + margin.bottom) .attr("width", width + margin.left + margin.right) .append("g") .attr("transform", "translate(" + margin.left + "," + margin.top + ")"); treeNode = document.importNode(svgTree.documentElement, true); treeHeight = parseFloat(treeNode.getAttribute("height")); treeWidth = parseFloat(treeNode.getAttribute("width")); treeX = width-treeWidth/2; treeY = treeHeight; var pBrown = -1; // Range -1 to 1 to force all green or brown // regardles of random number var pMissing = 0; // Range 0 to 1 var pScale = 1; drawTree(treeX, treeY, pBrown, pMissing, pScale); }

I like to solve hard problems. So much so that when I get a
particularly good one it's often difficult to do **anything** else.
This personality quirk has been great for my engineering career
but not so good for consistently blogging. Maybe you've noticed.

The good news is that I'm finally coming up for air and have lots to blog about.

Since February, I've been collaborating with scientists from Acadia National Park, the Earthwatch Institute, and the Schoodic Institute to study how climate change may be altering the life-cycles of and interactions between species at the park. This area of science is called Phenology but for fun we decided to name the project Whenology - a play on words based on the important of life-cycle timing.

To do this, we're combining NOAA weather data, NASA MODIS vegetation index data, and field observation data collected the citizen science organizations HawkWatch, eBird, and USA National Phenology Network. This may be the first time these data have been combined in this way. It's been very exciting to help explore this relatively new area of research.

The team just completed a pilot feasibility project focused on raptor migrations in the Acadia region. My part included writing code in R, Python, and Spark to transform, combine, analyze, and visualize these data. I integrated this work into an RShiny application to enable non-technical team members to also explore the data. I also used d3js to create dynamic visualizations for the project's first website focused on education.

In (near?) future posts, I plan to share tips and tricks that I "discovered" while working on the pilot project. Work on the next project phase is about to begin. I hope to be less obsessed this time around but it's going to be hard :-).

Be sure to checkout the project website!

\( \DeclareMathOperator*{\argmax}{\mathbf{arg\,max}} \)

In parts 1 and 2 , I described how I used the AMS and HyperLogLog online algorithms to estimate the frequency distribution of values in multiple integer sequences. I was doing this for a friend that needed a compute and space efficient way to analyze thousands of such streams.

Although the solution I came up with in Part 2 worked, it didn't express the "skew" in human understandable terms like "20% of the values accounted for 80% of the occurrences". Challenge accepted.

In the last post, I explained that an integer stream of length \(O\) with an equal number of occurrences of \(N\) values will have a minimum surprise number of,

\begin{align*} S_{min} &= \frac{O^2}{N} \end{align*}And that this can be combined with the estimated surprise number to create a scaled "skew" metric \(R\) that is comparable between different streams,

\begin{equation*} R = \frac{\hat{S}}{S_{min}} \end{equation*}Now, consider a stream where proportion \(q\) of the values account for proportion \(p\) of the occurrences. With some simplifying assumptions, the surprise number for this stream will be,

\begin{align*} S_{pq} &= \sum_{i}^{qN} \left(\frac{pO}{qN}\right)^2 + \sum_{j}^{(1-q)N} \left(\frac{(1-p)O}{(1-q)N}\right)^2\\ &= \frac{(pO)^2}{qN} + \frac{((1-p)O)^2}{(1-q)N}\\ \end{align*}The resulting \(R\) metric will be,

\begin{align*} R &= \frac{S_{pq}}{S_{min}}\\ &= \frac{\frac{(pO)^2}{qN} + \frac{((1-p)O)^2}{(1-q)N}}{S_{min}}\\ &= \frac{\frac{(pO)^2}{qN} + \frac{((1-p)O)^2}{(1-q)N}}{\frac{O^2}{N}}\\ &= \frac{p^2}{q} + \frac{(1-p)^2}{(1-q)}\\ &= \frac{p^2(1-q)}{q(1-q)} + \frac{q(1-)^2)}{q(1-q)}\\ &= \frac{p^2(1-q) + q(1 - 2p + p^2)}{q(1-q)}\\ &= \frac{p^2 - p^2q + q - 2pq + p^2q}{q(1-q)}\\ &= \frac{p^2 + q - 2pq}{q(1-q)}\\ &= \frac{p^2 + q - 2pq + (q^2 - q^2)}{q(1-q)}\\ &= \frac{(p^2 - 2pq + q^2) + q - q^2}{q(1-q)}\\ &= \frac{(p-q)^2 + q(1-q)}{q(1-q)}\\ &= \frac{(p-q)^2}{q(1-q)} + 1\\ \end{align*}Expressing this in terms of the estimated ratio described in the last post leads to,

\begin{align*} R &= \frac{\hat{S}}{S_{min}} \approx \frac{(p-q)^2}{q(1-q)} + 1\\ \end{align*}Now, lets substitute \(R\) with a new metric \(R^{\prime} = \tilde{R} - 1\),

\begin{align*} R^{\prime}_s &\approx \frac{(p-q)^2}{q(1-q)}\\ R^{\prime} q(1-q) &\approx (p-q)^2\\ \sqrt{R^{\prime} q(1-q)} &\approx p-q\\ p &\approx q + \sqrt{R^{\prime} q(1-q)}\\ \end{align*}Since this equation has two unknowns, there are many solutions of \(p\) and \(q\) for a given \(R^{\prime}\). For example, \(\tilde{R} = 3.25\) has an infinite number of possible solutions including: \(\{p=.9, q=.25\},\{p=.8, q=.2\}, \{p=.6,q=.15\}\). The following figure illustrates this,

However, as the figure suggests, an upper bound on \(q\) can be found by setting \(p <= 1\).

\begin{align*} q + \sqrt{R^{\prime} q(1-q)} &<= 1\\ \sqrt{R^{\prime} q(1-q)} &<= (1-q)\\ R^{\prime} q(1-q) &<= (1-q)^2\\ R^{\prime} q &<= (1-q)\\ R^{\prime} &<= \frac{(1-q)}{q}\\ R^{\prime} &<= \frac{1}{q} - 1\\ R^{\prime} + 1 &<= \frac{1}{q}\\ \end{align*}Recalling the definition of \(R^{\prime}\),

\begin{align*} R - 1 + 1 &<= \frac{1}{q}\\ R &<= \frac{1}{q}\\ q &<= \frac{1}{R}\\ \end{align*}This means that the upper bound on the percentage of skewed values is the reciprocal of the estimated \(R\). Cool!

Unfortunately, with both \(p\) and \(q\) unknown and only one equation, this was as far as I could take this. However, the upper bound and family of curves for given values of \(p\) were enough for my friend to do what he wanted.

This was a fun project. I got to see first hand that although probabilistic online algorithms sacrifice accuracy for efficiency, they can still perform well enough to get the job done!

\( \DeclareMathOperator*{\argmax}{\mathbf{arg\,max}} \)

Where did November go? Or the first half of December for that matter?

In Part 1 , I began describing a project to estimate the skew of many integer streams in real-time. I explained how the second frequency moment, also called the surprise number, can be used for this purpose and demonstrated using the AMS to estimate it with minimal compute and memory resources. However, the post concluded with the observation that it is difficult to compare second frequency moments between streams. In this post, I'll explain how I solved that problem.

Coming up with an approach wasn't that hard once I recalled two facts about surprise numbers,

- They are at minimum when the distribution is uniform.
- Their value monotonically increases with the distribution's skew.

I reasoned that dividing the estimated surprise number by the theoretical minimum should yield a scaled skew measure comparable between streams,

\begin{equation*} R = \frac{\hat{S}}{S_{min}} \end{equation*}Where \(\hat{S}\) and \(S_{min}\) are the estimated and minimum surprise numbers respectively.

The minimum surprise number is easily calculated,

\begin{align*} S_{min} &= \sum_{i=1}^{n} \left( \frac{O}{n} \right)^2\\ &= n \, \left( \frac{O}{n} \right)^2\\ &= \frac{O^2}{n} \end{align*}where \(O\) and \(n\) are the number of observations and unique values respectively. Now, \(O\) was easy as it just required counting the number of values observed. Determining \(n\) was another story.

In Part 1, we noted that explicitly tracking unique values wasn't an option due to the many possible values and large number of streams. The AMS algorithm allowed estimating surprise numbers without having to do that. I needed a similar way to estimate \(n\). Enter the HyperLogLog algorithm.

HyperLogLog is a probabilistic algorithm to estimate cardinality, the number of distinct elements in a set or stream. It was introduced in a paper by by Flojet et al. in 2007. In 2013, a paper by some fine folks at Google described an improved version, called HyperLogLog++, that yielded better accuracy using less resources. Other cardinality estimation algorithms exist, but HyperLogLog appears to be a popular choice.

HyperLogLog and its variants rely on the magic of hashing and stochastic averaging. Each stream value is hashed and the result inspected to see how many leading zeroes it has. For example, if the stream value 25 is hashed to the 32 bit value 0x0000ffff, there are 16 leading zeroes. The algorithm keeps track of the maximum number of leading zeroes over all of the hashed stream values. Now, this is the tricky bit. Assuming that the hash function is uniform, the probability of a hash result having \(r\) leading zeroes is \(2^{-r}\). With a bit of tricky math, it can be shown that the estimated cardinality is \(2^{\argmax \,r}\). Intuitively this makes sense as observing an event with probability \(2^{-r}\) on average takes \(2^r\) trials. So, tracking the rarest value provides an estimate of the number of hash values and, assuming no collisions, the stream's cardinality. That's the hashing part of the magic. The stochastic averaging part involves partitioning the hash values into buckets, tracking the maximum number of leading zeroes in each, and averaging the results.

This may sound complicated but its actually easy to implement. Here's a quick version in Julia.

type hll p::Int64 m::Int64 alpha::FloatingPoint buckets::Array{Int8,1} end function hll(p::Int) local m = 2^p; local alpha; if (m <= 16) alpha = 0.673 elseif (m <= 32) alpha = 0.697 elseif (m <= 64) alpha = 0.709 else alpha = 0.7213/(1 + 1.079/m); end hll(p, m, alpha, fill(-1,m)); end function bucketIdx(v::Uint64, p::Integer) local lv = convert(Uint64,v); local mask = convert(Uint64,2^p)-1; (lv & mask) + 1 end function clz(v::Uint64,p::Int) local lv = convert(Uint64,v); local mask = convert(Uint64,2^63); local limit = convert(Uint64,2^(p-1)); local c = 0; while (((lv & mask) == 0) && (mask > limit)) mask = mask >> 1 c +=1 end c end function update(h::hll, v) local hv = hash(v) local bidx = bucketIdx(hv, h.p) local nlz = clz(hv, h.p) h.buckets[bidx] = max(h.buckets[bidx], nlz+1) end function estimate(h::hll) local bsum = 0; local nempty = 0; for b in h.buckets if (b >= 0) bsum += 1.0/(2^b) else nempty += 1 end end # NB - omitting the emperical bias correction # described in the Google paper for # simplicity eraw = h.alpha * (h.m^2)/bsum if (nempty > 0) est = h.m * log(h.m / nempty) else est = eraw end convert(Int, round(est)) end

With a bit more code, we can test how the algorithm's accuracy changes
with the parameters `N`

, number of unique values, and `P`

, which uses
\(2^P\) buckets for the averaging.

using DataFrames function test(ps,ns,r) local df = DataFrame(N=Int64[], P=Int64[], MAPE=Float32[]) for n in ns for p in ps perrors = Float64[] for i = 1:r local h = hll(p) for v in rand(Uint, n) update(h,v) end est = estimate(h) err = abs(n-est)/n push!(perrors, err) end mape = mean(perrors) push!(df, [n p mape]) end end df end df = test([4 8 10], [10000 100000 1000000],3)

The results clearly show that accuracy improves with \(P\), the number of buckets. For a given \(P\), there's a bit of odd behavior as \(N\) increases but appears to be less of an issue at higher values of \(P\).

N | P=4 | P=8 | P=10 |
---|---|---|---|

10000 | 0.14 | 0.03 | 0.03 |

100000 | 0.26 | 0.06 | 0.02 |

1000000 | 0.29 | 0.03 | 0.01 |

Using HyperLogLog, I could now calculate a scaled skew metric that could be compared between streams. I was ready to declare "success" but my friend wanted to know more, in particular a Pareto of occurrences versus the number of unique values - for example 80% of the occurrences were by 20% of the values. I'll explain how I approached that in the next post.

A while ago, a friend of mine asked for help analyzing a continuous integer stream of values spread across a large range. He wanted a way to determine if a subset of values appeared more frequently than the rest. He didn't necessary want to know what those values were, just the degree to which the value occurrence frequencies were similar. Now, my friend didn't have just one stream to analyze, there were hundreds, perhaps thousands. So, the solution had to use minimal compute and memory resources. He also wanted the skew estimated in real-time as the streams were received.

Lucky for him, I had just completed Stanford's Mining Massive Datasets class (CS246) that covered, among other things, using probabilistic online algorithms to estimate a data stream's frequency moments.

Determining a population's diversity is a well studied problem that has applications in many fields including ecology, and demography. Many quantitative measures exist for representing diversity including one called the second frequency moment which is related to sample variance. For a sequence of integers \(A = (a_1, a_2, \ldots, a_m)\) where \(a_i \in (1,2,\ldots,n)\) and \(m_i\) is the number of times the value \(i\) appears in the stream, the second freqency moment is defined as,

\begin{equation*} F_2 = \sum_{i=1}^{n} m_{i}^{2} \end{equation*}To see how this works, consider a stream of 20 integers with 4 unique values. When each value appears an equal number of times, \(F_2\) is,

\begin{equation*} F_2 = \sum_{i}^{4} 5^{2} = 4 * 25 = 100 \end{equation*}In the maximum skew case with appearance counts of 15, 2, 1, and 1, \(F_2\) will be,

\begin{equation*} F_2 = 15^2 + 2^2 + 2*1^2 = 225 + 4 + 2 = 231 \end{equation*}As this simple example illustrates, the second frequency moment is minimized when the distribution is even and maximized when the distribution is the most skewed. In essence, it reflects how "surprising" a stream's values are. For this reason, the second frequency moment is sometimes called the "surprise number".

The surprise number is easy to calculate but requires keeping a count for each distinct value in the stream. This can lead to excessive memory consumption for streams with many, many unique values - exactly like my friend's situation.

In a 1996 paper, Alon, Matias, and Szegedy (AMS) presented a space efficient, on-line algorithm for estimating frequency moments. The algorithm is deceptively simple. For now, assume that the integer sequence \(A\) from earlier has a known length. A subset of \(P\) stream positions are chosen a-priori with uniform probability. At each of the selected positions in stream, the integer value is recorded along with a counter initialized to 1. From that point onward, the counter is incremented each time the recorded value appears in the stream. At the end of the stream, there are \(P\) counters that can be used to estimate the surprise number by calculating,

\begin{equation*} \hat{S} = \frac{|A|}{P} \sum_{i}^{P} (2 p_i - 1) \end{equation*}For example, consider the integer stream ( 6, 6, 1, 5, 4, 4, 10, 3, 3, 2, 10, 10, 9, 2, 10) and selected positions 1, 5, 13, and 14. At the end of the stream, the four AMS counters are

Position | Value | Count |
---|---|---|

1 | 6 | 2 |

5 | 4 | 2 |

13 | 9 | 1 |

14 | 2 | 1 |

The actual second frequency moment is,

\begin{align*} S &= (1)^2 + (2)^2 + (2)^2 + (2)^2 + (1)^2 + (2)^2 + (1)^2 + (4)^2\\ &= 1 + 4 + 4 + 4 + 1 + 4 + 1 + 16\\ &= 35\\ \end{align*}and the AMS algorithm's estimate is,

\begin{align*} \hat{S} &= \frac{15}{4} \left( (2*2-1) + (2*2-1) + (2*1-1) + (2*1-1) \right)\\ &= \frac{15}{4} \left(3 + 3 + 1 +1 \right)\\ &= \frac{15 * 8}{4}\\ &= 30\\ \end{align*}For this simple example the AMS estimate has a relative error of 16% but, in general, the algorithm achieves better accuracy with more counters. To illustrate this, the following chart plots the AMS algorithm's accuracy for simulated streams of 100000 integers with 1000 unique values (the R code for the simulation is included at the end of this post),

The dots and vertical lines respectively represent the mean average percent error (MAPE) and 95% confidence interval across ten trials for each case. This simple simulation shows how the relative error decreases as the number of AMS counters increases and generally converges to within 5%. It also shows the algorithm's sensitivity to the number of unique values in the stream.

The observant reader may notice that it's possible for multiple AMS counters to track the same value. This is OK, each counter is basically an independent estimate of the second frequency moment so these redundant counters can simply be averaged together.

This is pretty cool. But what about infinite streams? AMS relies on the uniform selection of stream positions to sample. How can that be done if the stream is continuous? The intuitive solution is to periodically replace an existing counter with a new one later in the stream. The trick, is deciding when to do the replacements to maintain the uniform selection probability.

This turns out to be pretty simple. To start, create AMS counters for the first \(P\) elements. For each subsequent stream element, replace an existing counter with probability \(P/(i+1)\), where \(i\) is the stream's present length. The actual counter replaced is chosen with equal probability. Mathematically, this all works out such that the counter sample positions have equal probability at each point in the stream.

The AMS algorithm can be used to estimated higher order moments as well. For a thorough discussion of the algorithm and associated mathematical proofs, see Section 4.5 of the book Mining of Massive Datasets.

Getting back to my friend's problem, the AMS algorithm provided a way to estimate a stream's skew. However, comparing the skew of different streams is difficult as the second frequency moment's absolute value can vary greatly based on the stream length, etc. Therefore, I needed a more reliable means for comparing streams which will be the subject of the next post.

Here is the AMS simulation R code.

library(Hmisc) library(ggplot2) library(scales) library(ggthemes) set.seed(1234567) stream.max <- 100000 num.unique <- c(1000) stream.lengths <- c(100000) ams.counters <- c(10, 100, 200) test.cases <- expand.grid(l=stream.lengths, c=ams.counters, n=num.unique) test.cases <- test.cases[rep(1:nrow(test.cases), 10),] ams.test <- function(l,c,n) { values <- sample.int(stream.max, n) stream <- sample(values, l, replace=TRUE) indxs <- sort(sample(1:l, c)) counts <- sapply(indxs, function(i) { sum(stream[i:length(stream)] == stream[i]) }) actual <- sum(table(stream)^2) estimated <- l/c * sum(2*counts -1) data.frame(l=l,c=c, n=n, actual=actual, AMS=estimated) } results <- Reduce(rbind, apply(test.cases,1, function(t) { ams.test(t["l"],t["c"], t["n"])})) results$l <- as.factor(format(results$l, scientific=FALSE)) results$c <- as.factor(results$c) results$n <- factor(results$n, levels=num.unique, labels=paste(num.unique, "Unique Values")) results$rerror <- abs(results$actual - results$AMS) / results$actual ggplot(results, aes(c, rerror,group=n)) + stat_summary(fun.data="mean_cl_boot", color="gray") + stat_summary(fun.y=mean, geom="point", size=3, color=ggthemes_data$few$medium["blue"]) + labs(title="MAPE vs Number of AMS Counters", x="",y="") + scale_y_continuous(labels = percent) + theme_few()