While solving a problem, I like to use an iterative style - try a little something, examine the results, try a little something else, etc. When I work this way, I find it easier to enter into a flow state and remain engaged with the problem for a long period of time. I’ve always programmed in this manner. Now I find myself solving data science problems in the same way.
Achieving this state requires the right tools. Thankfully, R’s dynamic, interactive nature fits well with this style. However, to stay in the flow state, I’ve learned that I need to automate any repetitive tasks. This let’s me focus on the insights from each iteration instead of the steps needed to do it.
To this end, I created a simple R framework, PredTest, that lets me quickly test classification models. I initially wrote it while working on Kaggle’s Titanic challenge. Since then, I’ve re-used it on a couple of other projects and it’s been handy. So, I uploaded it to GitHub for others to use as well.
Here’s an excerpt from one of the examples included in the GitHub repository.
magic <- read.table("magic04.data", header = F, sep=",")
# Create a model function
magic.logit <- function(model,
train.df, test.df) {
dep.col <- model$dep.col
indep.cols <- model$indep.cols
good.levels <- model$good.levels
bad.levels <- model$bad.levels
train.df$Good.Class <- factor(ifelse(train.df[,dep.col] %in% good.levels,
"Yes", "No"), levels=c("No","Yes"))
test.df$Good.Class <- factor(ifelse(test.df[,dep.col] %in% good.levels,
"Yes", "No"), levels=c("No","Yes"))
fit.formula <- as.formula(paste0("Good.Class ~ ",
paste( "poly(",
indep.cols,
", degree=2)",
collapse="+")))
fit <- glm(fit.formula, data=train.df, family=binomial(link="logit"))
predicted <- ifelse(predict(fit, test.df, type='response') >= 0.5, 1, 0)
actual <- ifelse(test.df[,dep.col] %in% good.levels, 1, 0)
perf.df <- pt.performance(actual,
predicted)
perf.df
}
# Define the data frame columns holding the dependent
# and independent features. Also define the "good" and
# "bad" levels for the depedent feature.
dep.col <- colnames(magic)[ncol(magic)]
indep.cols <- colnames(magic)[-ncol(magic)]
good.levels <- c("h")
bad.levels <- c("g")
# Specify models and parameters to run and evaluate
models <- list(list(name = "Logit",
model.fn = "magic.logit",
run = TRUE,
avg.results = FALSE,
dep.col = dep.col,
indep.cols = indep.cols,
good.levels = good.levels,
bad.levels = bad.levels,
balanced = TRUE,
kfolds = 10))
# Run and Test models
results <- pt.test.models(magic, models)
To use the framework, you implement a model functions that conform to an API and provide a list of lists that specifies the models to run and associated parameters. The framework will take take of sub-setting the data, doing cross validation, balancing the data set, and aggregating or averaging the results. The same model function can be run multiple times with different parameters to test different features by adding new entries in the model list.
The result is a data frame with various evaluation metrics,
> pt.test.models(magic, models)
Model tp fp tn fn tpr fpr errrate recall precision
1 Logit 520 101 1118 163 0.7613470 0.08285480 0.1388013 0.7613470 0.8373591
2 Logit 461 107 1130 204 0.6932331 0.08649960 0.1635121 0.6932331 0.8116197
3 Logit 437 97 1182 186 0.7014446 0.07584050 0.1487907 0.7014446 0.8183521
4 Logit 458 94 1152 198 0.6981707 0.07544141 0.1535226 0.6981707 0.8297101
5 Logit 498 72 1110 222 0.6916667 0.06091371 0.1545741 0.6916667 0.8736842
6 Logit 484 110 1104 204 0.7034884 0.09060956 0.1650894 0.7034884 0.8148148
7 Logit 477 98 1150 177 0.7293578 0.07852564 0.1445846 0.7293578 0.8295652
8 Logit 487 102 1143 170 0.7412481 0.08192771 0.1430074 0.7412481 0.8268251
9 Logit 451 105 1143 203 0.6896024 0.08413462 0.1619348 0.6896024 0.8111511
10 Logit 477 105 1109 211 0.6933140 0.08649094 0.1661409 0.6933140 0.8195876
There are more complete and robust R packages that do similar things. But, sometimes a simple tool is good enough.
Tags: R , Classification