Cross-Validation on Regression Models

Charles Pryor
The Startup
Published in
3 min readJul 15, 2020

--

Cross Validation is a very necessary tool to evaluate your model for accuracy in classification. Logistic Regression, Random Forest, and SVM have their advantages and drawbacks to their models. This is where cross validation comes in. When you cross validate, you take a number of different test and train data, look at the average and then get a score for your model. As a practice I typically fit Logistic Regression, Random Forest, and SVM to each model but what if I get a similar score for each model? A method of cross validation can tell you which one of your models is the most accurate. There are several methods of cross validation, but today I will focus on three, Validation, the “leave one out” method, and K-Fold. To understand the need for cross validation and when to use which particular method we must first understand the models and their own drawbacks.

Logistic Regression

This is a pure binomial classification model. It is generally the go to model when you have just two classes such as in medical identification. You either have Cancer or not, you either have Congestive Heart Failure or not. The advantage to this model is that it is a simple regression and works well for what it does and that is again a binomial classifier. However this model is prone to overfitting. As a review, overfitting is when a model learns and adjusts to all of its data points including it’s outliers. That means the model has been trained so well that the desired output does not match the goals of the model.

Random Forest

The major advantage of Random Forest is that it can handle a huge amount of data that has thousands of features. It resamples the data in it’s training model, which is called bootstrapping. As a review, Bootstrapping is when you use information for sampling, then throw the data back in the original data set to be used again. Also, remember that only a third of the data in Random Forest is used to test. These are whats known as Out of Bag samples (OOB) and you typically want to estimate your model on the Out of Bag Error. The problem with Random Forest is that like Logistic Regression, it is prone to Overfitting if your data set is noisy.

Support Vector Machines (SVM)

These are great when you a precisely accurate model to the data. It works well when the the features outnumber the actual samples, (a data set with 1000 columns and 800 rows). This model is also great when there are a high number of features(here features are called dimensions). It also works well when class separation is precise with no grey area. It is also less prone to overfitting. Its drawback is that it cannot work well with a huge data set. It takes a long time to train this. It’s also very difficult to understand and interpret the final model.

Here are the main three classifiers. But remember we are scientists. Once we fit and train all three models we must check with a cross validation method.

Simple Validation

You perform testing on 50% of the data and train the other 50% of the model.

Leave One out Cross Validation

This method takes one data point out from the data set and trains the rest of the data . It then iterates through the whole data set, leaving out 1 data point each time. It then takes the average and forms it’s score. This method will take some time and it will take a lot of computer resources.

K-Fold

Lastly there is K-Fold. K-Fold cross validation takes in a number, k that tells you how to split your data. Let’s use k = 5. So what happens is that it splits the data 5 ways. The first 20% will be considered the test data, with the rest of the 80 being train data. It then takes the next 20%, and the remaining 80% being the test data, and so on and so forth. It then takes the average of all 5 data sets and returns a score.

These are the common cross validation methods for regression models.

--

--