August 18, 2013

Kaggle Amazon Employee Access Challenge

I finished my first Kaggle competition a couple of weeks ago, the Amazon Employee Access challenge. I started way too late, within the last two weeks, but had a lot of fun trying. I spent every spare moment examining the data, trying different models, and refreshing my knowledge on different topics. The experience alone made the time well spent.

The problem proved to be harder than expected. The goal was to predict if an employee would be given access to a “resource” based on a training set of prior approvals and rejections. Each training set example consisted of,

  • Eight nominal categorical attributes describing an employee’s role.
  • The requested resource, also nominal categorical.
  • The binary approval/rejection outcome of the request.

Before I dug into the data, I had two hypotheses,

  • approvals/rejections were based on relationships between roles and resources.
  • the relationships varied for each resource or “class” of resources.

Therefore, I expected the training set to have one or more of the following attributes,

  • a high ratio of role examples per resource.
  • a high ratio of resource examples per role.
  • a good ratio of approval and rejection outcomes.

I reasoned that such a dataset would provide enough information to infer the underlying role-resource relationships. But examining the data revealed a different story.

    > train <- read.table("data/train.csv", 
    +                     header=TRUE,
    +                     colClasses=c("integer"), 
    +                     sep=",")
    > dim(train)
    [1] 32769    10
    > train[1,]
      ACTION RESOURCE MGR_ID ROLE_ROLLUP_1 ROLE_ROLLUP_2 ROLE_DEPTNAME ROLE_TITLE
    1      1    39353  85475        117961        118300        123472     117905
      ROLE_FAMILY_DESC ROLE_FAMILY ROLE_CODE
    1           117906      290919    117908

As expected, the training set contained 32769 observations each with the ACTION outcome, RESOURCE requested, and eight employee role attributes.

To get a feel for the data, I assigned a PID identifier to each unique combination of role attributes,

    > library(plyr)
    > feature.colnames <- (colnames(train))[-c(1:2)]
    > id.num <- 0
    > train.pids <- ddply(train, feature.colnames,
    +                   function(r) {
    +                     id.num <<- id.num + 1
    +                     cbind(PID=id.num, r)
    +                   })

Then I examined the number of unique resources,

    > length(unique(train.pids$RESOURCE))
    [1] 7518
    > nrow(train)/7518
    [1] 4.358739

On average there were only 4 observations per-resource, much lower than expected. The distribution of observations per resource was even more interesting,

    > stem(ddply(train.pids, .(RESOURCE), nrow)$V1, width=40)
    
      The decimal point is 2 digit(s) to the right of the |
    
      0 | 0000000000000000000000000000+7397
      0 | 5555555555555555555666666667+11
      1 | 00112333444
      1 | 666799
      2 | 0244
      2 | 5
      3 | 000
      3 | 
      4 | 011
      4 | 8
      5 | 
      5 | 
      6 | 
      6 | 
      7 | 
      7 | 
      8 | 4
    
    > sum(ddply(train.pids, .(RESOURCE), nrow)$V1 ==1)
    [1] 3766
    > 3766/7518
    [1] 0.5009311

Half of the resources had only a single observation! Only a relatively small number of resources had more than 10 observations.

The role attribute combinations (profiles) were similarly surprising,

    > length(unique(train.pids$PID))
    [1] 9561
    > stem(ddply(train.pids, .(PID), nrow)$V1, width=40)
    
      The decimal point is at the |
    
       0 | 0000000000000000000000000000+3296
       2 | 0000000000000000000000000000+3086
       4 | 0000000000000000000000000000+1279
       6 | 0000000000000000000000000000+742
       8 | 0000000000000000000000000000+408
      10 | 0000000000000000000000000000+186
      12 | 0000000000000000000000000000+103
      14 | 0000000000000000000000000000+20
      16 | 0000000000000000000000000000+5
      18 | 0000000000000000000000000000
      20 | 00000000000000000000
      22 | 000000
      24 | 0000000
      26 | 0000000
      28 | 0
      30 | 00
      32 | 0
      34 | 
      36 | 0
    
    > sum(ddply(train.pids, .(PID), nrow)$V1==1)
    [1] 3336
    > 3336/9561
    [1] 0.3489175
    >
    > all(ddply(train.pids, .(RESOURCE), 
    +     function(x) {max(table(x$PID))})$V1==1)
    [1] TRUE

A third of the role profiles had only a single observation. Also, each role profile only appeared in at most one example per resource.

Finally, looking at the outcomes,

    > table(train.pids$ACTION)
    
        0     1 
     1897 30872 
    > 
    > sum(ddply(train.pids, .(RESOURCE), 
    +     function(x) {sum(x$ACTION)/nrow(x)})$V1==1)
    [1] 6384
    > 6384/7518
    [1] 0.849162
    > sum(ddply(train.pids, .(RESOURCE), 
    +      function(x) {sum(x$ACTION)/nrow(x)})$V1==0)
    [1] 292
    >
    > sum(ddply(train.pids, .(PID), 
    +     function(x) {sum(x$ACTION)/nrow(x)})$V1==1)
    [1] 8602
    > 8602/9561
    [1] 0.8996967
    > sum(ddply(train.pids, .(PID), 
    +     function(x) {sum(x$ACTION)/nrow(x)})$V1==0)
    [1] 263
    >

that 95% of the outcomes were approvals, 84% of the resources had only approval outcomes, and 90% of the role profiles had only approval outcomes.

At this point, I concluded that,

  • The training set didn’t meet my initial many role-resource or resource-role expectations.
  • There was likely a lower dimensional role attribute feature space that better represented the role-resource relationship.
  • There was likely a lower dimensional resource feature space that allowed examples from different infrequent resources to predict each other.
  • The nominal resource and role attribute encoding suggested a categorical similarity model of some kind.
  • The model had to account for the very few rejection training examples.

Based on this, I had three initial ideas for models,

  1. A k-means model based on Jaccard Similarity.
  2. A K-partite graph model using a distance measure.
  3. An association rule (market basket) model.

I spent a lot of time on the first idea. In particular, I wrote a gradient descent model that tried to identify the lower-dimensional role attribute feature space that best predicted the outcome. It wasn’t easy creating a gradient descent model for Jaccard Similarity but I think I came up with something reasonable. This model worked OK but not well enough. One challenge was the limited number of rejections. Test cases for resources with only a single positive example would always be closest to the positive outcome. I tried to overcome that challenge by imputing rejection examples using role profiles and resources that didn’t appear with each other in the training set. But I eventually decided the approach was suspicious since the absence of a role-resource training example didn’t indicate that such a request would result in a rejection.

I briefly investigated the K-partite graph approach but didn’t feel certain that I could disambiguate the strong and weak relationships. For example, a common role attribute would likely have a high centrality measure in the partite network but might actually convey very little information about the role-resource relationship. In the end, I wasn’t sure if the graph approach would yield materially different results than the k-means model since both were based on a distance metric.

At this point, there were only two days left to the competition so I switched to the association rule approach. The model was fairly straightforward. For each test case, I

  • Found the best matching approval and rejection training examples.
  • Identified the matching columns.
  • Removed any columns that matched both an approval and rejection training example.
  • Calculated the Confidence of the remaining matching columns within the approval and rejection training examples.
  • Used a sigmoid transformation of the difference between the approval and rejection confidences as the final approval probability.

Since Confidence has a maximal value of 1 for perfect implications, the spread between the approval-rejection Confidences ranged between -1 and 1. The sigmoid function transformed this into a value between 0 and 1 that was used a proxy for the predicted ACTION approval probability.

The model achieved high precision and recall scores but failed to get a high enough area under the ROC curve (AUC) score - the evaluation criteria used by the competition. I played around with different sigmoid thresholds and scaling factors but nothing changed the AUC. At first I was a little confused by this as the different sigmoid parameters did effect the precision and recall results. Then, I realized that while the sigmoid parameters changed the number of True Positives, False Positives, etc, it didn’t change the relative order in which those predictions were made. For example, lowering the threshold would cause some False Negatives to become True Positives but would also cause True Negatives with a higher predicted ACTION value to become False Positives. Since AUC evaluates discrimination, simply changing the sigmoid parameters had no effect as it didn’t improve the discrimination of the approval and rejection outcomes. That was disappointing but also a very valuable lesson learned.

Here again, the lack of negative examples proved a challenge. For example, consider a role that required a lot of resources. Assume that it’s particular combination of role attributes appeared multiple times in the training set (one for each resource). The Confidence of one of the role-resource relationship’s would be low because of the other resource examples. In this case, the resource examples incorrectly reduce the significance of each other.

Unfortunately, I ran out of time before trying out a couple of ideas. One was to use Resource -> Role Confidences instead of Role -> Resource. Another was to test sub-sets of matching columns to find the highest confidence rule.

In the end, none of my models did better than the starter Logistic Regression example model using indicator variables for each role attribute and resource (thousands!). My guess is that there was enough separation in the role-resource sub-spaces to allow the Logistic Regression model to fit non-overlapping sets of coefficients. It essentially found the sub-spaces I worked hard to identify through my models. That was humbling but also increased my appreciation for simple models.

Overall, I really enjoyed competing in the Amazon challenge. I definitely plan to compete in other Kaggle competitions as they are a great way to get hands-on experience with challenging data sets and problems. My takeaway lessons from this effort were,

  • Start early.
  • Create a good cross-validation framework. The two submissions a day limitation is not enough to iteratively develop models.
  • Avoid premature optimization, start with simple models and add complexity only when necessary.

I uploaded the R code for my final model to GitHub for posterity.

Tags:  Kaggle , R