This package covers common aspects in predictive modeling:
Main purpose of this package is to teach some predictive modeling using a practical toolbox of functions and concepts, to people who is starting in data science, small data and big data. With special focus on results and analysis understanding.
Overview: Quantity of zeros, NA, unique values; as well as the data type may lead to a good or bad model. Here an approach to cover the very first step in data modeling.
## Loading needed libraries
library(funModeling)
data(heart_disease)
my_data_status=df_status(heart_disease)
## variable q_zeros p_zeros q_na p_na type unique
## 1 age 0 0.00 0 0.00 integer 41
## 2 gender 0 0.00 0 0.00 factor 2
## 3 chest_pain 0 0.00 0 0.00 factor 4
## 4 resting_blood_pressure 0 0.00 0 0.00 integer 50
## 5 serum_cholestoral 0 0.00 0 0.00 integer 152
## 6 fasting_blood_sugar 258 85.15 0 0.00 factor 2
## 7 resting_electro 151 49.83 0 0.00 factor 3
## 8 max_heart_rate 0 0.00 0 0.00 integer 91
## 9 exer_angina 204 67.33 0 0.00 integer 2
## 10 oldpeak 99 32.67 0 0.00 numeric 40
## 11 slope 0 0.00 0 0.00 integer 3
## 12 num_vessels_flour 176 58.09 4 1.32 integer 4
## 13 thal 0 0.00 2 0.66 factor 3
## 14 heart_disease_severity 164 54.13 0 0.00 integer 5
## 15 exter_angina 204 67.33 0 0.00 factor 2
## 16 has_heart_disease 0 0.00 0 0.00 factor 2
q_zeros
: quantity of zeros (p_zeros
: in percentage)q_na
: quantity of NA (p_na
: in percentage)type
: factor or numericunique
: quantity of unique valuesFunction df_status
takes a data frame and returns a the status table to quickly remove unwanted cases.
Removing variables with high number of NA/zeros
# Removing variables with 60% of zero values
vars_to_remove=subset(my_data_status, my_data_status$p_zeros > 60)
vars_to_remove["variable"]
## variable
## 6 fasting_blood_sugar
## 9 exer_angina
## 15 exter_angina
## Keeping all except vars_to_remove
heart_disease_2=heart_disease[, !(names(heart_disease) %in% vars_to_remove[,"variable"])]
Ordering data by percentage of zeros
my_data_status[order(-my_data_status$p_zeros),]
## variable q_zeros p_zeros q_na p_na type unique
## 6 fasting_blood_sugar 258 85.15 0 0.00 factor 2
## 9 exer_angina 204 67.33 0 0.00 integer 2
## 15 exter_angina 204 67.33 0 0.00 factor 2
## 12 num_vessels_flour 176 58.09 4 1.32 integer 4
## 14 heart_disease_severity 164 54.13 0 0.00 integer 5
## 7 resting_electro 151 49.83 0 0.00 factor 3
## 10 oldpeak 99 32.67 0 0.00 numeric 40
## 1 age 0 0.00 0 0.00 integer 41
## 2 gender 0 0.00 0 0.00 factor 2
## 3 chest_pain 0 0.00 0 0.00 factor 4
## 4 resting_blood_pressure 0 0.00 0 0.00 integer 50
## 5 serum_cholestoral 0 0.00 0 0.00 integer 152
## 8 max_heart_rate 0 0.00 0 0.00 integer 91
## 11 slope 0 0.00 0 0.00 integer 3
## 13 thal 0 0.00 2 0.66 factor 3
## 16 has_heart_disease 0 0.00 0 0.00 factor 2
Constraint: Target variable must have only 2 values. If it has NA
values, they will be removed.
Note: Please note there are many ways for selecting best variables to build a model, here is presented one more based on visual analysis.
cross_gender=cross_plot(heart_disease, str_input="gender", str_target="has_heart_disease")
Last two plots have the same data source, showing the distribution of has_heart_disease
in terms of gender
. The one on the left shows in percentage value, while the one on the right shows in absolute value.
Gender
variable seems to be a good predictor, since the likelihood of having heart disease is different given the female/male groups. it gives an order to the data.
There are a total of 97 females:
There are a total of 206 males:
Total cases: Summing the values of four bars: 25+72+114+92=303.
Note: What would it happened if instead of having the rates of 25.8% vs. 55.3% (female vs male), they had been more similar like 30.2% vs. 30.6%). In this case variable gender
it would have been much less relevant, since it doesn't separate the has_heart_disease
event.
Numerical variables should be binned in order to plot them with an histogram, otherwise the plot is not showing information, as it can be seen here:
There is a function included in the package (inherited from Hmisc package) : equal_freq
, which returns the bins/buckets based on the equal frequency criteria. Which is -or tries to- have the same quantity of rows per bin.
For numerical variables, cross_plot
has by default the auto_binning=T
, which automtically calls the equal_freq
function with n_bins=10
(or the closest number).
cross_plot(heart_disease, str_input="max_heart_rate", str_target="has_heart_disease")
If you don't want the automatic binning, then set the auto_binning=F
in cross_plot
function.
For example, creating oldpeak_2
based on equal frequency, with 3 buckets.
heart_disease$oldpeak_2=equal_freq(var=heart_disease$oldpeak, n_bins = 3)
summary(heart_disease$oldpeak_2)
## [0.0,0.2) [0.2,1.5) [1.5,6.2]
## 106 107 90
Plotting the binned variable (auto_binning = F
):
cross_oldpeak_2=cross_plot(heart_disease, str_input="oldpeak_2", str_target="has_heart_disease", auto_binning = F)
This new plot based on oldpeak_2
shows clearly how: the likelihood of having heart disease increases as oldpeak_2 increases as well. Again, it gives an order to the data.
Converting variable max_heart_rate
into a one of 10 bins:
heart_disease$max_heart_rate_2=equal_freq(var=heart_disease$max_heart_rate, n_bins = 10)
cross_plot(heart_disease, str_input="max_heart_rate_2", str_target="has_heart_disease")
At a first glance, max_heart_rate_2
shows a negative and linear relationship, however there are some buckets which add noise to the relationship. For example, the bucket (141, 146]
has a higher heart disease rate than the previous bucket, and it was expected to have a lower. This could be noise in data.
Key note: One way to reduce the noise (at the cost of losing some information), is to split with less bins:
heart_disease$max_heart_rate_3=equal_freq(var=heart_disease$max_heart_rate, n_bins = 5)
cross_plot(heart_disease, str_input="max_heart_rate_3", str_target="has_heart_disease")
Conclusion: As it can be seen, now the relationship is much clean and clear. Bucket 'N' has a higher rate than 'N+1', which implies a negative correlation.
How about saving the cross_plot result into a folder?
Just set the parameter path_out
with the folder you want -It creates a new one if it doesn't exists-.
cross_plot(heart_disease, str_input="max_heart_rate_3", str_target="has_heart_disease", path_out="my_plots")
It creates the folder my_plots
into the working directory.
cross_plot
on multiple variablesImagine you want to run cross_plot for several variables at the same time. To achieve this goal just define a vector containing the variable names.
If you want to analyze these 3 variables:
vars_to_analyze=c("age", "oldpeak", "max_heart_rate")
cross_plot(data=heart_disease, str_target="has_heart_disease", str_input=vars_to_analyze)
cross_plot
is good to visualize linear relationships, giving it a hint on non-linear relationships.Overview: Once the predictive model is developed with training
data, it should be compared with test
data (which wasn't seen by the model before). Here is presented a wrapper for the ROC Curve and AUC (area under ROC) and the KS (Kolmogorov-Smirnov).
## Training and test data. Percentage of training cases default value=80%.
index_sample=get_sample(data=heart_disease, percentage_tr_rows=0.8)
## Generating the samples
data_tr=heart_disease[index_sample,]
data_ts=heart_disease[-index_sample,]
## Creating the model only with training data
fit_glm=glm(has_heart_disease ~ age + oldpeak, data=data_tr, family = binomial)
## Performance metrics for Training Data
model_performance(fit=fit_glm, data = data_tr, target_var = "has_heart_disease")
##
## -----------
## AUC KS
## ----- -----
## 0.759 0.406
## -----------
## Performance metrics for Test Data
model_performance(fit=fit_glm, data = data_ts, target_var = "has_heart_disease")
##
## -----------
## AUC KS
## ----- -----
## 0.748 0.456
## -----------
Key notes
Final comments