A Feature Preprocessing Workflow
How I deal with wide datasets when building a predictive model
Introduction
In this post, I will describe a preprocessing workflow that I use whenever I have a lot of variables (wide data) and need to build a predictive model quickly.
The workflow has three stages:
- Univariate feature selection using the {Information} package
- Feature engineering using the {vtreat} package
- Removal of redundant features using the {caret} package
The overall goal of the approach described here is to provide a reasonable number of highly relevent and non-redundant inputs to tree-based classification algorithms, such as random forests or gradient boosting machines.
To show how it works, let’s start by loading the necessary packages, and then get some example data.
# load packages
library(dplyr)
library(Information)
library(rsample)
library(caret)
library(tidyselect)
library(vtreat)
library(stringr)
Example data
I will use a dataset from the {Information} package to illustrate the
workflow (actually two datasets, one called train
and the other called
valid
). The data represent a marketing campaign with a treat-control
design.
If we limit the dataset to the treat group, that gives us >10,000
records and 70 variables. The response variable purchase
is 1 or 0
depending on whether the customer made a purchase or not. The predictors
are mainly credit bureau variables.
Since the dataset is clean (all numeric), I’m going to dirty it up a bit
by making unique_id
a character variable, and grouping the d_region
indicators into one character variable with 4 values.
# get example datasets
df1 <- Information::train
df2 <- Information::valid
# combine and dirty up
df <- df1 %>%
bind_rows(df2) %>%
rename_with(~str_to_lower(.)) %>%
filter(treatment == 1) %>%
select(-treatment) %>%
mutate(
unique_id = as.character(unique_id),
d_region = case_when(
d_region_a == 1 ~ "a",
d_region_b == 1 ~"b",
d_region_c == 1 ~ "c",
TRUE ~ "d"
)
) %>%
select(-c("d_region_a", "d_region_b", "d_region_c"))
rm(list = c("df1", "df2"))
Data partitioning
Avoid using the same data for feature preprocessing and model training as this could result in nested model bias. Instead, do a three-way split. For example, I will use 60% of the example data for model training, 20% for feature preprocessing, and 20% for testing. See this article for more information about nested model bias and how to avoid it.
set.seed(12345)
# split train vs. the rest
split1 <- initial_split(df, 0.6, strata = purchase)
df_train <- training(split1)
df_split2 <- testing(split1)
# split preprocessing vs. test
split2 <- initial_split(df_split2, 0.5, strata = purchase)
df_pre <- training(split2)
df_test <- testing(split2)
rm(list = c("df", "df_split2", "split1", "split2"))
Check to make the sure the split worked properly by seeing if the response variable mean is the same between samples.
tibble(
pre = mean(df_pre$purchase),
train = mean(df_train$purchase),
test = mean(df_test$purchase)
)
## # A tibble: 1 x 3
## pre train test
## <dbl> <dbl> <dbl>
## 1 0.201 0.201 0.201
Information value
Information value (IV) is a highly flexible approach that lets you measure the strength of association betweeen the response and each predictor. It’s a good way to filter out irrelevant variables prior to building a model.
There are several advantages of IV over other filtering methods.
- IV can detect linear and non-linear relationships
- IV scores allow you to directly compare continuous and categorical variables
- IV can handle missing data without imputation and assess the predictive power of NAs
It is good practice to split the preprocessing dataset prior to estimating IV. This allows you to adjust the IV estimates using cross-validation to prevent weak predictors from getting past the filter by chance. See the {Information} package vignette for more details.
set.seed(666)
# split preprocessing data
iv_split <- initial_split(df_pre, 0.5, strata = "purchase")
df_iv_train <- training(iv_split)
df_iv_test <- testing(iv_split)
# calculate IV
iv <- create_infotables(
data = df_iv_train,
valid = df_iv_test,
y = "purchase"
)
Note that the unique_id
variable was ignored because it has too many
levels. This is a handy feature of the {Information} package when
dealing with large datasets. It automatically ignores “junk” variables,
like customer IDs and zip codes. Any feature that’s non-numeric with
more than 1,000 levels gets excluded.
The create_infotables()
function will create a data frame (accessible
via iv$Summary
) with an IV estimate for each predictor, along with a
cross-validation penalty, and the adjusted IV score.
Once you have the IV estimates, you will need to pick a threshold for excluding variables based on adjusted IV. This is subjective. But in general, the rule of thumb is:
You don’t want to be too restrictive at this stage, especially if
you are using a modeling approach that has a built-in feature selection
process, as is the case with tree-based algorithms. Typically, I would
drop all variables with adjusted IV <0.02. However, if most of the
variables have relatively low IV scores, I would take the top_n()
and
hope for the best.
# get top predictors
top_iv <- iv$Summary %>%
filter(AdjIV > 0.02)
# save predictor names for filtering
top_nm <- as.character(top_iv$Variable)
top_iv
## Variable IV PENALTY AdjIV
## 1 n_open_rev_acts 0.78808585 0.071311360 0.71677449
## 2 tot_hi_crdt_crdt_lmt 0.83245714 0.117402471 0.71505467
## 3 ratio_bal_to_hi_crdt 0.65154998 0.116854610 0.53469537
## 4 m_snc_oldst_retail_act_opn 0.57731695 0.089622002 0.48769495
## 5 d_na_m_snc_mst_rcnt_act_opn 0.41328900 0.024885205 0.38840379
## 6 m_snc_mst_rcnt_act_opn 0.50501697 0.166593270 0.33842370
## 7 hi_retail_crdt_lmt 0.41075125 0.102026928 0.30872432
## 8 avg_bal_all_prm_bc_acts 0.35052146 0.063045819 0.28747564
## 9 ratio_retail_bal2hi_crdt 0.34148718 0.055783357 0.28570382
## 10 n_of_satisfy_fnc_rev_acts 0.30889636 0.041741804 0.26715456
## 11 d_na_avg_bal_all_fnc_rev_acts 0.25496525 0.021389665 0.23357559
## 12 prcnt_of_acts_never_dlqnt 0.33429544 0.103424077 0.23087136
## 13 avg_bal_all_fnc_rev_acts 0.25673030 0.033986232 0.22274407
## 14 n_fnc_acts_vrfy_in_12m 0.22964383 0.063106470 0.16653736
## 15 n_fnc_instlacts 0.16350030 0.040834858 0.12266544
## 16 student_hi_cred_range 0.12048557 0.002198665 0.11828690
## 17 d_region 0.15013706 0.035344772 0.11479229
## 18 n_bc_acts_opn_in_24m 0.13566094 0.032034754 0.10362618
## 19 m_snc_mstrec_instl_trd_opn 0.15369658 0.058175939 0.09552064
## 20 d_na_m_snc_oldst_mrtg_act_opn 0.10629282 0.027184149 0.07910867
## 21 m_snc_oldst_mrtg_act_opn 0.11175094 0.033558786 0.07819215
## 22 n_bc_acts_opn_in_12m 0.10175870 0.033978431 0.06778027
## 23 tot_othrfin_hicrdt_crdtlmt 0.07942734 0.011989605 0.06743773
## 24 n_pub_rec_act_line_derogs 0.10838429 0.047075686 0.06130861
## 25 agrgt_bal_all_xcld_mrtg 0.11818141 0.061217651 0.05696376
## 26 ratio_prsnl_fnc_bal2hicrdt 0.06122417 0.012688651 0.04853552
## 27 m_snc_mstrcnt_mrtg_act_upd 0.05714766 0.011182831 0.04596483
## 28 n_satisfy_prsnl_fnc_acts 0.04345530 0.002040935 0.04141436
## 29 n_bank_instlacts 0.09145097 0.053917161 0.03753381
## 30 n_30d_and_60d_ratings 0.04249370 0.007780171 0.03471353
## 31 tot_instl_hi_crdt_crdt_lmt 0.04437595 0.012048230 0.03232772
## 32 n_disputed_acts 0.04227185 0.010734430 0.03153742
## 33 d_na_m_sncoldst_bnkinstl_actopn 0.04751751 0.024308098 0.02320941
## 34 n_30d_ratings 0.05123535 0.028309813 0.02292553
## 35 d_na_ratio_prsnl_fnc_bal2hicrdt 0.02369415 0.001193081 0.02250107
## 36 n_satisfy_instl_acts 0.06611851 0.044733739 0.02138477
## 37 n30d_orwrs_rtng_mrtg_acts 0.02693511 0.005762157 0.02117296
## 38 n_inquiries 0.03130803 0.010250884 0.02105715
## 39 n_retail_acts_opn_in_24m 0.07007429 0.049037889 0.02103640
## 40 n_fnc_acts_opn_in_12m 0.03202549 0.011258903 0.02076659
## 41 n_satisfy_oil_nationl_acts 0.03283909 0.012552947 0.02028614
As you can see, filtering by adjusted IV reduces the number of predictors in our example dataset to 41. That implies that 37% of the original 65 predictors where probably “useless.”
Feature engineering
{vtreat} is my go-to R package for common feature engineering tasks. For a formal description of the package read this paper. The package is also described in the excellent book Practical Data Science with R by the package authors Nina Zumel and John Mount.
The {vtreat} package has functions that will automatically:
- Replace NAs with the column mean value (numeric) or majority class (non-numeric)
- Create missing-indicator variables
- Dummy code all non-numeric variables with frequency >2% (rare levels get grouped together)
- Truncate numeric distributions to mitigate outliers
- Create derived versions of non-numeric variables using prevalance coding and impact coding
Prevalence coding replaces the levels of a categorical variable with the proportion each level is observed in the dataset. Impact coding uses the marginal effect from a single-variable logistic regression as a replacement for each level in a categorical variable. Both derived variables are numeric.
Create treatment plan
Use the designTreatmentsC()
function to create a variable treatment
plan for classification models. There are a lot of arguments for this
function, so check the documentation. Save the treatment plan
(vtreat_plan
) as an .RDS object so you can apply it to non-training
data prior to generating model predictions.
# filter preprocessing data by IV
df_vtreat <- df_pre %>%
select(all_of(top_nm), purchase)
# create plan
vtreat_plan <- designTreatmentsC(
dframe = df_vtreat,
varlist = top_nm,
outcomename = "purchase",
outcometarget = 1,
collarProb = .025
)
## [1] "vtreat 1.6.0 inspecting inputs Thu Jul 9 16:31:02 2020"
## [1] "designing treatments Thu Jul 9 16:31:02 2020"
## [1] " have initial level statistics Thu Jul 9 16:31:02 2020"
## [1] " scoring treatments Thu Jul 9 16:31:02 2020"
## [1] "have treatment plan Thu Jul 9 16:31:02 2020"
## [1] "rescoring complex variables Thu Jul 9 16:31:02 2020"
## [1] "done rescoring complex variables Thu Jul 9 16:31:03 2020"
Prepare training data
After you have the treatment plan object, you can apply it to a new
dataset using prepare()
. This creates a new data frame with treated
variables based on the codeRestriction
argument. Read
this vignette
for a description of the different {vtreat} variable types.
# create treated data frame
df_train2 <- prepare(
treatmentplan = vtreat_plan,
dframe = df_train,
codeRestriction = c("clean", "lev", "catB", "catP", "isBAD"),
doCollar = TRUE
)
Explore derived variables
# check the d_region variable
df_train2 %>%
select(contains("d_region")) %>%
head()
## d_region_catP d_region_catB d_region_lev_x_a d_region_lev_x_b
## 1 0.3105683 -0.53748855 1 0
## 2 0.2592223 0.53943640 0 0
## 3 0.3105683 -0.53748855 1 0
## 4 0.3105683 -0.53748855 1 0
## 5 0.3105683 -0.53748855 1 0
## 6 0.1884347 -0.08657372 0 1
## d_region_lev_x_c d_region_lev_x_d
## 1 0 0
## 2 0 1
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
Notice that the original d_region
character variable has been
transformed into 4 dummy indicators, plus 2 derived variables based on
prevalence (“catP”) and impact coding (“catB”). The prepare()
function
will do this automatically for every non-numeric variable in the
dataset.
Remove redundant variables
Redundant variables are predictors that are highly correlated with one or more other predictors in the dataset.
From a predictive accuracy standpoint, it is not strictly necessary to remove redundant variables prior to model fitting. This is one of the many distinctions between predictive and explanatory modeling.
However, when using tree-based algorithms, it is helpful to remove redundant predictors in order to get more accurate variable importance rankings. To find the most redudant features in a dataset, I use the findCorrelation() function from the {Caret} package.
# get names of redundant predictors
corr_vars <- findCorrelation(
cor(
df_train2,
method = "spearman"
),
cutoff = 0.9,
names = TRUE,
exact = TRUE
)
corr_vars
## [1] "tot_instl_hi_crdt_crdt_lmt" "n_bank_instlacts"
## [3] "n_30d_ratings"
And voila…
# filter out redundant predictors
df_train3 <- df_train2 %>% select(-all_of(corr_vars))
str(df_train3)
## 'data.frame': 6020 obs. of 44 variables:
## $ n_open_rev_acts : num 2 1 6 1 0 18 0 6 1 1 ...
## $ tot_hi_crdt_crdt_lmt : num 24300 11500 33600 200 0 ...
## $ ratio_bal_to_hi_crdt : num 5 0 0.4 0 36.1 ...
## $ m_snc_oldst_retail_act_opn : num 164 164 71 367 164 ...
## $ d_na_m_snc_mst_rcnt_act_opn : num 0 0 0 0 1 0 0 0 0 0 ...
## $ m_snc_mst_rcnt_act_opn : num 92 23 9 161.8 29.9 ...
## $ hi_retail_crdt_lmt : num 0 0 600 200 0 400 0 3000 400 0 ...
## $ avg_bal_all_prm_bc_acts : num 607 0 61 2494 2494 ...
## $ ratio_retail_bal2hi_crdt : num 11.5 11.5 0 0 11.5 ...
## $ n_of_satisfy_fnc_rev_acts : num 0 0 0 0 0 3 0 1 0 0 ...
## $ d_na_avg_bal_all_fnc_rev_acts : num 1 1 1 1 1 0 1 1 1 1 ...
## $ prcnt_of_acts_never_dlqnt : num 100 100 100 100 80.8 ...
## $ avg_bal_all_fnc_rev_acts : num 1767 1767 1767 1767 1767 ...
## $ n_fnc_acts_vrfy_in_12m : num 0 0 0 0 0 3 0 0 0 3 ...
## $ n_fnc_instlacts : num 1 0 0 0 0 2 0 1 0 5 ...
## $ student_hi_cred_range : num 0 0 0 0 0 0 0 0 0 0 ...
## $ d_region_catP : num 0.311 0.259 0.311 0.311 0.311 ...
## $ d_region_catB : num -0.537 0.539 -0.537 -0.537 -0.537 ...
## $ n_bc_acts_opn_in_24m : num 0 1 1 0 0 2 0 1 0 1 ...
## $ m_snc_mstrec_instl_trd_opn : num 126.9 41.8 41.8 41.8 41.8 ...
## $ d_na_m_snc_oldst_mrtg_act_opn : num 1 1 1 1 1 0 1 1 1 0 ...
## $ m_snc_oldst_mrtg_act_opn : num 139 139 139 139 139 ...
## $ n_bc_acts_opn_in_12m : num 0 0 1 0 0 1 0 0 0 1 ...
## $ tot_othrfin_hicrdt_crdtlmt : num 0 0 0 0 0 ...
## $ n_pub_rec_act_line_derogs : num 0 0 0 0 2 0 2 0 0 5 ...
## $ agrgt_bal_all_xcld_mrtg : num 1214 0 122 0 0 ...
## $ ratio_prsnl_fnc_bal2hicrdt : num 51.7 51.7 51.7 51.7 51.7 ...
## $ m_snc_mstrcnt_mrtg_act_upd : num 0.665 0.665 0.665 0.665 0.665 ...
## $ n_satisfy_prsnl_fnc_acts : num 0 0 0 0 0 1 0 0 0 1 ...
## $ n_30d_and_60d_ratings : num 0 0 0 0 0 5 0 0 0 16 ...
## $ n_disputed_acts : num 0 0 0 0 0 0 0 0 0 0 ...
## $ d_na_m_sncoldst_bnkinstl_actopn: num 1 1 1 1 1 1 1 1 1 1 ...
## $ d_na_ratio_prsnl_fnc_bal2hicrdt: num 1 1 1 1 1 0 1 1 1 0 ...
## $ n_satisfy_instl_acts : num 0 0 0 0 0 0 0 0 0 1 ...
## $ n30d_orwrs_rtng_mrtg_acts : num 0 0 0 0 0 0 0 0 0 8 ...
## $ n_inquiries : num 0 0 1 0 4 1 2 0 0 16 ...
## $ n_retail_acts_opn_in_24m : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_fnc_acts_opn_in_12m : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_satisfy_oil_nationl_acts : num 0 0 0 0 0 0 0 0 0 0 ...
## $ d_region_lev_x_a : num 1 0 1 1 1 0 1 1 0 1 ...
## $ d_region_lev_x_b : num 0 0 0 0 0 1 0 0 0 0 ...
## $ d_region_lev_x_c : num 0 0 0 0 0 0 0 0 0 0 ...
## $ d_region_lev_x_d : num 0 1 0 0 0 0 0 0 1 0 ...
## $ purchase : num 0 0 0 0 0 0 0 0 0 0 ...
After filtering by IV, treating with {vtreat}, and removing 3 redundant variables, the final dataset has 43 predictors that are all numeric and ready for model training.
Conclusion
The preprocessing workflow described here works well for the sorts of modeling projects I work on because it’s basically an excercie in data mining. But if you have a relatively small set of well-understood predictors, the methods for variable selection and data reduction described in Frank Harrell’s book Regression Modeling Strategies might be more appropriate. Also, to take your preprocessing workflow to the next level, consider using the tidymodels package {recipes}.
Questions or comments?
Feel free to reach out to me at any of the social links below.
For more R content, please visit R-bloggers and RWeekly.org.