May 14, 2013

A Simple R Framework for Testing Classifiers

While solving a problem, I like to use an iterative style - try a little something, examine the results, try a little something else, etc. When I work this way, I find it easier to enter into a flow state and remain engaged with the problem for a long period of time. I’ve always programmed in this manner. Now I find myself solving data science problems in the same way.

Achieving this state requires the right tools. Thankfully, R’s dynamic, interactive nature fits well with this style. However, to stay in the flow state, I’ve learned that I need to automate any repetitive tasks. This let’s me focus on the insights from each iteration instead of the steps needed to do it.

To this end, I created a simple R framework, PredTest, that lets me quickly test classification models. I initially wrote it while working on Kaggle’s Titanic challenge. Since then, I’ve re-used it on a couple of other projects and it’s been handy. So, I uploaded it to GitHub for others to use as well.

Here’s an excerpt from one of the examples included in the GitHub repository.

    magic <- read.table("magic04.data", header = F, sep=",")
    
    # Create a model function 
    magic.logit <- function(model,
                         train.df, test.df) {
    
      dep.col     <- model$dep.col
      indep.cols  <- model$indep.cols
      good.levels <- model$good.levels
      bad.levels  <- model$bad.levels
    
      train.df$Good.Class <- factor(ifelse(train.df[,dep.col] %in% good.levels, 
                                    "Yes", "No"), levels=c("No","Yes"))
      test.df$Good.Class  <- factor(ifelse(test.df[,dep.col] %in% good.levels, 
                                    "Yes", "No"), levels=c("No","Yes"))
    
      fit.formula <- as.formula(paste0("Good.Class ~ ", 
                                       paste( "poly(",
                                             indep.cols,
                                             ", degree=2)",
                                             collapse="+")))
    
      fit <- glm(fit.formula, data=train.df, family=binomial(link="logit"))
    
      predicted <- ifelse(predict(fit, test.df, type='response') >= 0.5, 1, 0)
      actual    <- ifelse(test.df[,dep.col] %in% good.levels, 1, 0)
      perf.df   <- pt.performance(actual,
                                  predicted)
    
      perf.df   
    }
    
    # Define the data frame columns holding the dependent
    # and independent features. Also define the "good" and
    # "bad" levels for the depedent feature.
    dep.col     <- colnames(magic)[ncol(magic)]
    indep.cols  <- colnames(magic)[-ncol(magic)]
    good.levels <- c("h")
    bad.levels  <- c("g")
    
    # Specify models and parameters to run and evaluate
    models <- list(list(name        = "Logit",
                        model.fn    = "magic.logit",
                        run         = TRUE,
                        avg.results = FALSE,
                        dep.col     = dep.col,
                        indep.cols  = indep.cols, 
                        good.levels = good.levels,
                        bad.levels  = bad.levels, 
                        balanced    = TRUE,
                        kfolds      = 10))
    
    # Run and Test models
    results <- pt.test.models(magic, models)

To use the framework, you implement a model functions that conform to an API and provide a list of lists that specifies the models to run and associated parameters. The framework will take take of sub-setting the data, doing cross validation, balancing the data set, and aggregating or averaging the results. The same model function can be run multiple times with different parameters to test different features by adding new entries in the model list.

The result is a data frame with various evaluation metrics,

    > pt.test.models(magic, models)
       Model  tp  fp   tn  fn       tpr        fpr   errrate    recall precision
    1  Logit 520 101 1118 163 0.7613470 0.08285480 0.1388013 0.7613470 0.8373591
    2  Logit 461 107 1130 204 0.6932331 0.08649960 0.1635121 0.6932331 0.8116197
    3  Logit 437  97 1182 186 0.7014446 0.07584050 0.1487907 0.7014446 0.8183521
    4  Logit 458  94 1152 198 0.6981707 0.07544141 0.1535226 0.6981707 0.8297101
    5  Logit 498  72 1110 222 0.6916667 0.06091371 0.1545741 0.6916667 0.8736842
    6  Logit 484 110 1104 204 0.7034884 0.09060956 0.1650894 0.7034884 0.8148148
    7  Logit 477  98 1150 177 0.7293578 0.07852564 0.1445846 0.7293578 0.8295652
    8  Logit 487 102 1143 170 0.7412481 0.08192771 0.1430074 0.7412481 0.8268251
    9  Logit 451 105 1143 203 0.6896024 0.08413462 0.1619348 0.6896024 0.8111511
    10 Logit 477 105 1109 211 0.6933140 0.08649094 0.1661409 0.6933140 0.8195876

There are more complete and robust R packages that do similar things. But, sometimes a simple tool is good enough.

Tags:  R , Classification