Cross Validation - THE Tutorial How To Use it

In this tutorial we will see how to simply use Cross Validation with Scikit-Learn and how to use it for prediction.

Cross Validation is a way to ensure that our Machine Learning model is at its best.

There are only 4 steps to perform a Cross Validation:

create 5 subgroups of our dataset
train a model on 4 subgroups
evaluate the model on the last subgroup
repeat steps 2 and 3 so that all subgroups are evaluated

Here, the Cross Validation will give us, at the end of the workflow, 5 different Machine Learning models.

This multiplicity of models will allow us to have a diversity in the final predictions.

Actually, Cross Validation give us the opinion of 5 experts (5 models) instead of only one.

You can choose the number of subgroups created during Cross Validation, be it 2, 3, 5 or 40. The only constraint is to have enough data in each subgroup to get a robust model.

Once we have all these opinions, we’ll have to decide which expert to follow. This is what we will see in this article.

Let’s start by loading our data! 🔥

Data

This tutorial is following our detailed article on learning Machine Learning.

But of course, you can follow this tutorial without having followed the previous one. You only have to download the dataset winequality-white.csv from this Github address.

Our dataset ranks wines according to their quality. The objective is to predict the quality level of wines from their features (acidity, alcohol level, pH, etc).

Once the dataset is loaded in your working environment, open it with the Pandas library:

import pandas as pddf = pd.read_csv("winequality-white.csv", sep=";")df.head(3)

Cross Validation (CV) divides our dataset into subgroups.

To make sure that these subgroups have a fair distribution, we first shuffle the dataset with the sample(frac=1) function:

df = df.sample(frac=1).reset_index(drop=True)

reset_index(drop=True) reset the index of each line after the shuffling.

Next, we prepare our features (X) and label (Y) for the Cross Validation:

X = df.drop(columns='quality')y = df['quality']

Note: Here we don’t need train and test data. Indeed in Cross Validation, each subgroup is used once for testing and N-1 times for training. It is, therefore, not necessary to indicate train and test set because all subgroups go through these stages.

Cross Validation score

Let’s load the best performing model from our article for learning Machine Learning: Decision Tree.

from sklearn import treedecisionTree = tree.DecisionTreeClassifier()

With this model we had obtained an accuracy of 60%.

Can we do better?

We can see that directly with sklearn cross_val_score function:

from sklearn.model_selection import cross_val_scorescores = cross_val_score(decisionTree, X, y, cv=10)

For this evaluation we’ve chosen to perform a Cross Validation on 10 subgroups by indicating cv=10.

This allow us to train 10 different models of Decision Tree.

Let’s display the result of these 10 models:

scores

Output: array([0.63265306, 0.57959184, 0.64693878, 0.6122449 , 0.65510204, 0.62040816, 0.59183673, 0.63265306, 0.63599182, 0.58282209])

Training models with Cross Validation

Now that we know Cross Validation will improve our model, we can get down to business!

First, I suggest to divide our dataset in two:

Data for Cross Validation, which we will call train_test
Data for testing the final models, which we will call gtest for global test

To separate our dataset we use the train_test_split function (gtest will be composed of 10% of our dataset):

from sklearn.model_selection import train_test_splitX_train_test, X_gtest, y_train_test, y_gtest = train_test_split(X, y, test_size=0.10)

Then let’s initialize our classifier:

from sklearn import treedecisionTree = tree.DecisionTreeClassifier()

And now we can implement the REAL Cross Validation.

For this, it’s simple, we use the cross_validate function.

This function returns several informations:

fit_time – training time for the N models
test_score – accuracy of the N models
score_time – scoring time for the N models
estimator (when return_estimator=True) – the N trained models

We run Cross Validation with 10 subgroups:

By the way, if your goal is to master Deep Learning - I've prepared the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

Plan your training
Structure your projects
Develop your Artificial Intelligence algorithms

I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.

To access it, click here :

GET MY ACTION PLAN

Now we can get back to what I was talking about earlier.

from sklearn.model_selection import cross_validatecv_results = cross_validate(decisionTree, X_train_test, y_train_test, cv=10, return_estimator=True)

We can now display the score for each of the 10 trained models:

cv_results['test_score']

Output:
array([0.6031746 , 0.59183673, 0.61904762, 0.59183673, 0.60997732,
0.63038549, 0.59637188, 0.58276644, 0.61136364, 0.65227273])

And calculate the total average :

Predicting with Cross Validation

How to use CV models for predictions?

There are different approaches depending on the practitioner:

Take the best of the N models and use it directly
Take the best of the N models and re-train it on the whole data set
Keep the N models and rely on the opinion of the majority

I want to make it clear that there is no wrong way to do this. Each method is worthwhile and will be useful depending on your particular context. The best option is to test by yourself!

After reading our article to learn Machine Learning, you should be able to use the first two options.

I propose to detail the 3rd option which is the most complex, especially since it is divided into two techniques.

In the following parts, we’ll predict the result for the first wine of X_gtest.

Compute prediction for raw results

Scikit-Learn offers two options to perform prediction:

predict() – the raw results, in our case the quality of wine: 3, 4, 5, 6, 7, 8 or 9
predict_proba() – the results as probabilities

In this part, we use the predict() option.

We predict, for each of the 10 models, the quality of the first wine of our X_gtest data:

result = []for i in range(len(cv_results['estimator'])): result.append(int(cv_results['estimator'][i].predict(X_gtest.iloc[:1])))

Each of these results are stored in a list, which can be displayed:

result

Output: [5, 5, 5, 5, 5, 6, 6, 6, 5, 5]

The objective now is to take the prediction that has appeared the most often.

Here, we see that most of our models conclude that the wine is of quality 5, when three of them predicted 6.

We extract the most frequently predicted value…

max(set(result), key=result.count)

Output: 5

… which we can compared with the real value:

y_gtest.iloc[0]

Output: 5

Here the real value is well predicted! The majority was right!

Compute prediction for probabilities

Finally, I’d like to use the predict_proba() option which is the most complex of all.

For our Machine Learning model, 7 levels of wine quality are possible: 3, 4, 5, 6, 7, 8 or 9.

With predict_proba() we get the probability that our wine is of each quality. For example: 20% that the wine is of quality 3, 8% for quality 4, 58% for quality 5, etc.

With our Cross Validation, we’ll obtain 10 lists of probabilities.

To calculate the prediction of the Cross Validation, we’ll sum all these probabilities together and divide the result by the number of subgroups, 10.

Actually, we average all our probabilities to determine the quality with the highest overall probability.

First, we sum the probabilities together:

import numpy as npresult_proba = cv_results['estimator'][0].predict_proba(X_gtest.iloc[:1])for i in range(1, len(cv_results['estimator'])): result_proba =+ np.add(result_proba, cv_results['estimator'][i].predict_proba(X_gtest.iloc[:1]))

Then we calculate the average:

result_proba = result_proba/10

We extract the index with the highest probability:

np.argmax(result_proba)

Output: 2

Here, the index is 2, it indicates a quality of 5.

Indeed, if we take our list of possible results [3, 4, 5, 6, 7, 8, 9], the first index being 0, quality 3, the second corresponds to quality 5:

wine_quality = [3, 4, 5, 6, 7, 8, 9]wine_quality[np.argmax(result_proba)]

Output: 5

Here, the end result for the raw prediction and the probabilistic prediction remains the same, but keep in mind that this is not always the case.

Conclusion

In this article, we learned how to improve the accuracy of our Machine Learning model by 1.8% and how to use Cross Validation for prediction.

Other methods exist to improve a Machine Learning model like:

Normalize data
Changing the models hyperparameters
Data Augmentation
Ensemble methods

One last thing: Cross Validation is not to be taken lightly. It is a technique used in 2022 by the best experts to push Machine Learning models to their maximum performance.

Cross Validation is even used for Deep Learning!

Soon, an article will be published on the subject.

In the meantime, if you want to stay informed, don’t hesitate to subscribe to our newsletter 😉

One last word, if you want to go further and learn about Deep Learning - I've prepared for you the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

Plan your training
Structure your projects
Develop your Artificial Intelligence algorithms

To access it, click here :

GET MY ACTION PLAN

Cross Validation - THE Tutorial How To Use it - sklearn (2024)

Data

Cross Validation score

Training models with Cross Validation

Predicting with Cross Validation

Compute prediction for raw results

Compute prediction for probabilities

Conclusion

References