I finished my first Kaggle competition a couple of weeks ago, the Amazon Employee Access challenge. I started way too late, within the last two weeks, but had a lot of fun trying. I spent every spare moment examining the data, trying different models, and refreshing my knowledge on different topics. The experience alone made the time well spent.
The problem proved to be harder than expected. The goal was to predict if an employee would be given access to a “resource” based on a training set of prior approvals and rejections. Each training set example consisted of,
Before I dug into the data, I had two hypotheses,
Therefore, I expected the training set to have one or more of the following attributes,
I reasoned that such a dataset would provide enough information to infer the underlying role-resource relationships. But examining the data revealed a different story.
> train <- read.table("data/train.csv",
+ header=TRUE,
+ colClasses=c("integer"),
+ sep=",")
> dim(train)
[1] 32769 10
> train[1,]
ACTION RESOURCE MGR_ID ROLE_ROLLUP_1 ROLE_ROLLUP_2 ROLE_DEPTNAME ROLE_TITLE
1 1 39353 85475 117961 118300 123472 117905
ROLE_FAMILY_DESC ROLE_FAMILY ROLE_CODE
1 117906 290919 117908
As expected, the training set contained 32769 observations each with
the ACTION
outcome, RESOURCE
requested, and eight employee role
attributes.
To get a feel for the data, I assigned a PID
identifier to each
unique combination of role attributes,
> library(plyr)
> feature.colnames <- (colnames(train))[-c(1:2)]
> id.num <- 0
> train.pids <- ddply(train, feature.colnames,
+ function(r) {
+ id.num <<- id.num + 1
+ cbind(PID=id.num, r)
+ })
Then I examined the number of unique resources,
> length(unique(train.pids$RESOURCE))
[1] 7518
> nrow(train)/7518
[1] 4.358739
On average there were only 4 observations per-resource, much lower than expected. The distribution of observations per resource was even more interesting,
> stem(ddply(train.pids, .(RESOURCE), nrow)$V1, width=40)
The decimal point is 2 digit(s) to the right of the |
0 | 0000000000000000000000000000+7397
0 | 5555555555555555555666666667+11
1 | 00112333444
1 | 666799
2 | 0244
2 | 5
3 | 000
3 |
4 | 011
4 | 8
5 |
5 |
6 |
6 |
7 |
7 |
8 | 4
> sum(ddply(train.pids, .(RESOURCE), nrow)$V1 ==1)
[1] 3766
> 3766/7518
[1] 0.5009311
Half of the resources had only a single observation! Only a relatively small number of resources had more than 10 observations.
The role attribute combinations (profiles) were similarly surprising,
> length(unique(train.pids$PID))
[1] 9561
> stem(ddply(train.pids, .(PID), nrow)$V1, width=40)
The decimal point is at the |
0 | 0000000000000000000000000000+3296
2 | 0000000000000000000000000000+3086
4 | 0000000000000000000000000000+1279
6 | 0000000000000000000000000000+742
8 | 0000000000000000000000000000+408
10 | 0000000000000000000000000000+186
12 | 0000000000000000000000000000+103
14 | 0000000000000000000000000000+20
16 | 0000000000000000000000000000+5
18 | 0000000000000000000000000000
20 | 00000000000000000000
22 | 000000
24 | 0000000
26 | 0000000
28 | 0
30 | 00
32 | 0
34 |
36 | 0
> sum(ddply(train.pids, .(PID), nrow)$V1==1)
[1] 3336
> 3336/9561
[1] 0.3489175
>
> all(ddply(train.pids, .(RESOURCE),
+ function(x) {max(table(x$PID))})$V1==1)
[1] TRUE
A third of the role profiles had only a single observation. Also, each role profile only appeared in at most one example per resource.
Finally, looking at the outcomes,
> table(train.pids$ACTION)
0 1
1897 30872
>
> sum(ddply(train.pids, .(RESOURCE),
+ function(x) {sum(x$ACTION)/nrow(x)})$V1==1)
[1] 6384
> 6384/7518
[1] 0.849162
> sum(ddply(train.pids, .(RESOURCE),
+ function(x) {sum(x$ACTION)/nrow(x)})$V1==0)
[1] 292
>
> sum(ddply(train.pids, .(PID),
+ function(x) {sum(x$ACTION)/nrow(x)})$V1==1)
[1] 8602
> 8602/9561
[1] 0.8996967
> sum(ddply(train.pids, .(PID),
+ function(x) {sum(x$ACTION)/nrow(x)})$V1==0)
[1] 263
>
that 95% of the outcomes were approvals, 84% of the resources had only approval outcomes, and 90% of the role profiles had only approval outcomes.
At this point, I concluded that,
Based on this, I had three initial ideas for models,
I spent a lot of time on the first idea. In particular, I wrote a gradient descent model that tried to identify the lower-dimensional role attribute feature space that best predicted the outcome. It wasn’t easy creating a gradient descent model for Jaccard Similarity but I think I came up with something reasonable. This model worked OK but not well enough. One challenge was the limited number of rejections. Test cases for resources with only a single positive example would always be closest to the positive outcome. I tried to overcome that challenge by imputing rejection examples using role profiles and resources that didn’t appear with each other in the training set. But I eventually decided the approach was suspicious since the absence of a role-resource training example didn’t indicate that such a request would result in a rejection.
I briefly investigated the K-partite graph approach but didn’t feel certain that I could disambiguate the strong and weak relationships. For example, a common role attribute would likely have a high centrality measure in the partite network but might actually convey very little information about the role-resource relationship. In the end, I wasn’t sure if the graph approach would yield materially different results than the k-means model since both were based on a distance metric.
At this point, there were only two days left to the competition so I switched to the association rule approach. The model was fairly straightforward. For each test case, I
Since Confidence has a maximal value of 1 for perfect implications, the
spread between the approval-rejection Confidences ranged between -1 and 1.
The sigmoid function transformed this into a value between 0 and 1 that was
used a proxy for the predicted ACTION
approval probability.
The model achieved high precision and recall scores but failed to get
a high enough area under the ROC curve (AUC) score - the evaluation
criteria used by the competition. I played around with different
sigmoid thresholds and scaling factors but nothing changed the AUC. At
first I was a little confused by this as the different sigmoid
parameters did effect the precision and recall results. Then, I
realized that while the sigmoid parameters changed the number of True
Positives, False Positives, etc, it didn’t change the relative order
in which those predictions were made. For example, lowering the
threshold would cause some False Negatives to become True Positives
but would also cause True Negatives with a higher predicted ACTION
value to become False Positives. Since AUC evaluates discrimination,
simply changing the sigmoid parameters had no effect as it didn’t
improve the discrimination of the approval and rejection outcomes.
That was disappointing but also a very valuable lesson learned.
Here again, the lack of negative examples proved a challenge. For example, consider a role that required a lot of resources. Assume that it’s particular combination of role attributes appeared multiple times in the training set (one for each resource). The Confidence of one of the role-resource relationship’s would be low because of the other resource examples. In this case, the resource examples incorrectly reduce the significance of each other.
Unfortunately, I ran out of time before trying out a couple of
ideas. One was to use Resource ->
Role Confidences instead of
Role ->
Resource. Another was to test sub-sets of matching columns to
find the highest confidence rule.
In the end, none of my models did better than the starter Logistic Regression example model using indicator variables for each role attribute and resource (thousands!). My guess is that there was enough separation in the role-resource sub-spaces to allow the Logistic Regression model to fit non-overlapping sets of coefficients. It essentially found the sub-spaces I worked hard to identify through my models. That was humbling but also increased my appreciation for simple models.
Overall, I really enjoyed competing in the Amazon challenge. I definitely plan to compete in other Kaggle competitions as they are a great way to get hands-on experience with challenging data sets and problems. My takeaway lessons from this effort were,
I uploaded the R code for my final model to GitHub for posterity.