Package funModeling

Introduction

This package covers common aspects in predictive modeling:

  1. Data Cleaning
  2. Variable importance analysis
  3. Assessing model performance

Main purpose of this package is to teach some predictive modeling using a practical toolbox of functions and concepts, to people who is starting in data science, small data and big data. With special focus on results and analysis understanding.

Part 1: Data cleaning

Overview: Quantity of zeros, NA, unique values; as well as the data type may lead to a good or bad model. Here an approach to cover the very first step in data modeling.

## Loading needed libraries
library(funModeling)
data(heart_disease)

Checking NA, zeros, data type and unique values

my_data_status=df_status(heart_disease)
##                  variable q_zeros p_zeros q_na p_na    type unique
## 1                     age       0    0.00    0 0.00 integer     41
## 2                  gender       0    0.00    0 0.00  factor      2
## 3              chest_pain       0    0.00    0 0.00  factor      4
## 4  resting_blood_pressure       0    0.00    0 0.00 integer     50
## 5       serum_cholestoral       0    0.00    0 0.00 integer    152
## 6     fasting_blood_sugar     258   85.15    0 0.00  factor      2
## 7         resting_electro     151   49.83    0 0.00  factor      3
## 8          max_heart_rate       0    0.00    0 0.00 integer     91
## 9             exer_angina     204   67.33    0 0.00 integer      2
## 10                oldpeak      99   32.67    0 0.00 numeric     40
## 11                  slope       0    0.00    0 0.00 integer      3
## 12      num_vessels_flour     176   58.09    4 1.32 integer      4
## 13                   thal       0    0.00    2 0.66  factor      3
## 14 heart_disease_severity     164   54.13    0 0.00 integer      5
## 15           exter_angina     204   67.33    0 0.00  factor      2
## 16      has_heart_disease       0    0.00    0 0.00  factor      2

Why are these metrics important?

Filtering unwanted cases

Function df_status takes a data frame and returns a the status table to quickly remove unwanted cases.

Removing variables with high number of NA/zeros

# Removing variables with 60% of zero values
vars_to_remove=subset(my_data_status, my_data_status$p_zeros > 60)
vars_to_remove["variable"]
##               variable
## 6  fasting_blood_sugar
## 9          exer_angina
## 15        exter_angina
## Keeping all except vars_to_remove 
heart_disease_2=heart_disease[, !(names(heart_disease) %in% vars_to_remove[,"variable"])]

Ordering data by percentage of zeros

my_data_status[order(-my_data_status$p_zeros),]
##                  variable q_zeros p_zeros q_na p_na    type unique
## 6     fasting_blood_sugar     258   85.15    0 0.00  factor      2
## 9             exer_angina     204   67.33    0 0.00 integer      2
## 15           exter_angina     204   67.33    0 0.00  factor      2
## 12      num_vessels_flour     176   58.09    4 1.32 integer      4
## 14 heart_disease_severity     164   54.13    0 0.00 integer      5
## 7         resting_electro     151   49.83    0 0.00  factor      3
## 10                oldpeak      99   32.67    0 0.00 numeric     40
## 1                     age       0    0.00    0 0.00 integer     41
## 2                  gender       0    0.00    0 0.00  factor      2
## 3              chest_pain       0    0.00    0 0.00  factor      4
## 4  resting_blood_pressure       0    0.00    0 0.00 integer     50
## 5       serum_cholestoral       0    0.00    0 0.00 integer    152
## 8          max_heart_rate       0    0.00    0 0.00 integer     91
## 11                  slope       0    0.00    0 0.00 integer      3
## 13                   thal       0    0.00    2 0.66  factor      3
## 16      has_heart_disease       0    0.00    0 0.00  factor      2

Part 2: Variable importance with cross_plot

Constraint: Target variable must have only 2 values. If it has NA values, they will be removed.

Note: Please note there are many ways for selecting best variables to build a model, here is presented one more based on visual analysis.

Example 2.1: Is gender correlated with heart disease?

cross_gender=cross_plot(heart_disease, str_input="gender", str_target="has_heart_disease")

plot of chunk variable_importance1

Last two plots have the same data source, showing the distribution of has_heart_disease in terms of gender. The one on the left shows in percentage value, while the one on the right shows in absolute value.

How to extract conclusions from the plots? (Short version)

Gender variable seems to be a good predictor, since the likelihood of having heart disease is different given the female/male groups. it gives an order to the data.

How to extract conclusions from the plots? (Long version)

From 1st plot (%):
  1. The likelihood of having heart disease for males is 55.3%, while for females is: 25.8%.
  2. The heart disease rate for males doubles the rate for females (55.3 vs 25.8, respectively).
From 2nd plot (count):
  1. There are a total of 97 females:

  2. There are a total of 206 males:

  3. Total cases: Summing the values of four bars: 25+72+114+92=303.

Note: What would it happened if instead of having the rates of 25.8% vs. 55.3% (female vs male), they had been more similar like 30.2% vs. 30.6%). In this case variable gender it would have been much less relevant, since it doesn't separate the has_heart_disease event.

Example 2.2: Crossing with numerical variables

Numerical variables should be binned in order to plot them with an histogram, otherwise the plot is not showing information, as it can be seen here:

Equal frequency binning

There is a function included in the package (inherited from Hmisc package) : equal_freq, which returns the bins/buckets based on the equal frequency criteria. Which is -or tries to- have the same quantity of rows per bin.

For numerical variables, cross_plot has by default the auto_binning=T, which automtically calls the equal_freq function with n_bins=10 (or the closest number).

cross_plot(heart_disease, str_input="max_heart_rate", str_target="has_heart_disease")

plot of chunk variable_importance2

Example 2.3: Manual binning

If you don't want the automatic binning, then set the auto_binning=F in cross_plot function.

For example, creating oldpeak_2 based on equal frequency, with 3 buckets.

heart_disease$oldpeak_2=equal_freq(var=heart_disease$oldpeak, n_bins = 3)
summary(heart_disease$oldpeak_2)
## [0.0,0.2) [0.2,1.5) [1.5,6.2] 
##       106       107        90

Plotting the binned variable (auto_binning = F):

cross_oldpeak_2=cross_plot(heart_disease, str_input="oldpeak_2", str_target="has_heart_disease", auto_binning = F)

plot of chunk variable_importance4

Conclusion

This new plot based on oldpeak_2 shows clearly how: the likelihood of having heart disease increases as oldpeak_2 increases as well. Again, it gives an order to the data.

Example 2.4: Noise reducing

Converting variable max_heart_rate into a one of 10 bins:

heart_disease$max_heart_rate_2=equal_freq(var=heart_disease$max_heart_rate, n_bins = 10)
cross_plot(heart_disease, str_input="max_heart_rate_2", str_target="has_heart_disease")

plot of chunk variable_importance5

At a first glance, max_heart_rate_2 shows a negative and linear relationship, however there are some buckets which add noise to the relationship. For example, the bucket (141, 146] has a higher heart disease rate than the previous bucket, and it was expected to have a lower. This could be noise in data.

Key note: One way to reduce the noise (at the cost of losing some information), is to split with less bins:

heart_disease$max_heart_rate_3=equal_freq(var=heart_disease$max_heart_rate, n_bins = 5)
cross_plot(heart_disease, str_input="max_heart_rate_3", str_target="has_heart_disease")

plot of chunk variable_importance6

Conclusion: As it can be seen, now the relationship is much clean and clear. Bucket 'N' has a higher rate than 'N+1', which implies a negative correlation.

How about saving the cross_plot result into a folder? Just set the parameter path_out with the folder you want -It creates a new one if it doesn't exists-.

cross_plot(heart_disease, str_input="max_heart_rate_3", str_target="has_heart_disease", path_out="my_plots")

It creates the folder my_plots into the working directory.

Example 4: cross_plot on multiple variables

Imagine you want to run cross_plot for several variables at the same time. To achieve this goal just define a vector containing the variable names.

If you want to analyze these 3 variables:

vars_to_analyze=c("age", "oldpeak", "max_heart_rate")
cross_plot(data=heart_disease, str_target="has_heart_disease", str_input=vars_to_analyze)

Final notes:

Part 3: Assessing model performance

Overview: Once the predictive model is developed with training data, it should be compared with test data (which wasn't seen by the model before). Here is presented a wrapper for the ROC Curve and AUC (area under ROC) and the KS (Kolmogorov-Smirnov).

Creating the model

## Training and test data. Percentage of training cases default value=80%.
index_sample=get_sample(data=heart_disease, percentage_tr_rows=0.8)

## Generating the samples
data_tr=heart_disease[index_sample,] 
data_ts=heart_disease[-index_sample,]


## Creating the model only with training data
fit_glm=glm(has_heart_disease ~ age + oldpeak, data=data_tr, family = binomial)

ROC, AUC and KS performance metrics

## Performance metrics for Training Data
model_performance(fit=fit_glm, data = data_tr, target_var = "has_heart_disease")

plot of chunk model_perfomance2

## 
## -----------
##  AUC   KS  
## ----- -----
## 0.759 0.406
## -----------
## Performance metrics for Test Data
model_performance(fit=fit_glm, data = data_ts, target_var = "has_heart_disease")

plot of chunk model_perfomance2

## 
## -----------
##  AUC   KS  
## ----- -----
## 0.748 0.456
## -----------

Key notes

Final comments