Cross Validation - THE Tutorial How To Use it - sklearn (2024)

In this tutorial we will see how to simply use Cross Validation with Scikit-Learn and how to use it for prediction.

Cross Validation is a way to ensure that our Machine Learning model is at its best.

There are only 4 steps to perform a Cross Validation:

  1. create 5 subgroups of our dataset
  2. train a model on 4 subgroups
  3. evaluate the model on the last subgroup
  4. repeat steps 2 and 3 so that all subgroups are evaluated

Here, the Cross Validation will give us, at the end of the workflow, 5 different Machine Learning models.

This multiplicity of models will allow us to have a diversity in the final predictions.

Actually, Cross Validation give us the opinion of 5 experts (5 models) instead of only one.

You can choose the number of subgroups created during Cross Validation, be it 2, 3, 5 or 40. The only constraint is to have enough data in each subgroup to get a robust model.

Once we have all these opinions, we’ll have to decide which expert to follow. This is what we will see in this article.

Let’s start by loading our data! 🔥

Data

This tutorial is following our detailed article on learning Machine Learning.

But of course, you can follow this tutorial without having followed the previous one. You only have to download the dataset winequality-white.csv from this Github address.

Our dataset ranks wines according to their quality. The objective is to predict the quality level of wines from their features (acidity, alcohol level, pH, etc).

Once the dataset is loaded in your working environment, open it with the Pandas library:

import pandas as pddf = pd.read_csv("winequality-white.csv", sep=";")df.head(3)

Cross Validation (CV) divides our dataset into subgroups.

To make sure that these subgroups have a fair distribution, we first shuffle the dataset with the sample(frac=1) function:

df = df.sample(frac=1).reset_index(drop=True)

reset_index(drop=True) reset the index of each line after the shuffling.

Next, we prepare our features (X) and label (Y) for the Cross Validation:

X = df.drop(columns='quality')y = df['quality']

Note: Here we don’t need train and test data. Indeed in Cross Validation, each subgroup is used once for testing and N-1 times for training. It is, therefore, not necessary to indicate train and test set because all subgroups go through these stages.

Cross Validation score

Let’s load the best performing model from our article for learning Machine Learning: Decision Tree.

from sklearn import treedecisionTree = tree.DecisionTreeClassifier()

With this model we had obtained an accuracy of 60%.

Can we do better?

We can see that directly with sklearn cross_val_score function:

from sklearn.model_selection import cross_val_scorescores = cross_val_score(decisionTree, X, y, cv=10)

For this evaluation we’ve chosen to perform a Cross Validation on 10 subgroups by indicating cv=10.

This allow us to train 10 different models of Decision Tree.

Let’s display the result of these 10 models:

scores

Output: array([0.63265306, 0.57959184, 0.64693878, 0.6122449 , 0.65510204, 0.62040816, 0.59183673, 0.63265306, 0.63599182, 0.58282209])

Most of the models have an accuracy above 60%. This is a very good signal!

Let’s calculate the mean to know the real potential of this Cross Validation:

scores.mean()

Output: 0.619

61.9% of accuracy, that’s 1.9% more than the score obtained in the first tutorial.

The problem is that cross_val_score does not recover the trained models.

This function only test Cross Validation on our dataset and our model.

Actually, cross_val_score enables Data Scientists and Machine Learning Engineers to know if it is worth implementing Cross Validation.

Training models with Cross Validation

Now that we know Cross Validation will improve our model, we can get down to business!

First, I suggest to divide our dataset in two:

  • Data for Cross Validation, which we will call train_test
  • Data for testing the final models, which we will call gtest for global test

To separate our dataset we use the train_test_split function (gtest will be composed of 10% of our dataset):

from sklearn.model_selection import train_test_splitX_train_test, X_gtest, y_train_test, y_gtest = train_test_split(X, y, test_size=0.10)

Then let’s initialize our classifier:

from sklearn import treedecisionTree = tree.DecisionTreeClassifier()

And now we can implement the REAL Cross Validation.

For this, it’s simple, we use the cross_validate function.

This function returns several informations:

  • fit_time – training time for the N models
  • test_score – accuracy of the N models
  • score_time – scoring time for the N models
  • estimator (when return_estimator=True) – the N trained models

We run Cross Validation with 10 subgroups:

By the way, if your goal is to master Deep Learning - I've prepared the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

  • Plan your training
  • Structure your projects
  • Develop your Artificial Intelligence algorithms

I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.

To access it, click here :

GET MY ACTION PLAN

GET MY ACTION PLAN

Now we can get back to what I was talking about earlier.

from sklearn.model_selection import cross_validatecv_results = cross_validate(decisionTree, X_train_test, y_train_test, cv=10, return_estimator=True)

We can now display the score for each of the 10 trained models:

cv_results['test_score']

Output:
array([0.6031746 , 0.59183673, 0.61904762, 0.59183673, 0.60997732,
0.63038549, 0.59637188, 0.58276644, 0.61136364, 0.65227273])

And calculate the total average :

cv_results['test_score'].mean()

Output: 0.608

We’ve gained 0.8% for the test data. It’s not much but it’s an acceptable score.

What about the global test data that the model has never seen?

To measure our Cross Validation, we will go through each of our models (stored in the variable cv_results['estimator']) and calculate the score for X_gtest and y_gtest :

gtest_score = []for i in range(len(cv_results['estimator'])): val_score.append(cv_results['estimator'][i].score(X_gtest, y_gtest))

Here is the final score of the Cross Validation:

sum(gtest_score) / len(gtest_score)

Output : 0.618

We gain 1.8% of precision compared to our basic model! This is huge! 🎉

1.8% improvement in accuracy may seem not enough from the point of view of a novice in Machine Learning but any expert knows it is a huge improvement!

Indeed, Machine Learning competitions are sometimes played with only 0.001% difference in accuracy.

Predicting with Cross Validation

How to use CV models for predictions?

There are different approaches depending on the practitioner:

  • Take the best of the N models and use it directly
  • Take the best of the N models and re-train it on the whole data set
  • Keep the N models and rely on the opinion of the majority

I want to make it clear that there is no wrong way to do this. Each method is worthwhile and will be useful depending on your particular context. The best option is to test by yourself!

After reading our article to learn Machine Learning, you should be able to use the first two options.

I propose to detail the 3rd option which is the most complex, especially since it is divided into two techniques.

In the following parts, we’ll predict the result for the first wine of X_gtest.

Compute prediction for raw results

Scikit-Learn offers two options to perform prediction:

  • predict() – the raw results, in our case the quality of wine: 3, 4, 5, 6, 7, 8 or 9
  • predict_proba() – the results as probabilities

In this part, we use the predict() option.

We predict, for each of the 10 models, the quality of the first wine of our X_gtest data:

result = []for i in range(len(cv_results['estimator'])): result.append(int(cv_results['estimator'][i].predict(X_gtest.iloc[:1])))

Each of these results are stored in a list, which can be displayed:

result

Output: [5, 5, 5, 5, 5, 6, 6, 6, 5, 5]

The objective now is to take the prediction that has appeared the most often.

Here, we see that most of our models conclude that the wine is of quality 5, when three of them predicted 6.

We extract the most frequently predicted value…

max(set(result), key=result.count)

Output: 5

… which we can compared with the real value:

y_gtest.iloc[0]

Output: 5

Here the real value is well predicted! The majority was right!

Compute prediction for probabilities

Finally, I’d like to use the predict_proba() option which is the most complex of all.

For our Machine Learning model, 7 levels of wine quality are possible: 3, 4, 5, 6, 7, 8 or 9.

With predict_proba() we get the probability that our wine is of each quality. For example: 20% that the wine is of quality 3, 8% for quality 4, 58% for quality 5, etc.

With our Cross Validation, we’ll obtain 10 lists of probabilities.

To calculate the prediction of the Cross Validation, we’ll sum all these probabilities together and divide the result by the number of subgroups, 10.

Actually, we average all our probabilities to determine the quality with the highest overall probability.

First, we sum the probabilities together:

import numpy as npresult_proba = cv_results['estimator'][0].predict_proba(X_gtest.iloc[:1])for i in range(1, len(cv_results['estimator'])): result_proba =+ np.add(result_proba, cv_results['estimator'][i].predict_proba(X_gtest.iloc[:1]))

Then we calculate the average:

result_proba = result_proba/10

We extract the index with the highest probability:

np.argmax(result_proba)

Output: 2

Here, the index is 2, it indicates a quality of 5.

Indeed, if we take our list of possible results [3, 4, 5, 6, 7, 8, 9], the first index being 0, quality 3, the second corresponds to quality 5:

wine_quality = [3, 4, 5, 6, 7, 8, 9]wine_quality[np.argmax(result_proba)]

Output: 5

Here, the end result for the raw prediction and the probabilistic prediction remains the same, but keep in mind that this is not always the case.

Conclusion

In this article, we learned how to improve the accuracy of our Machine Learning model by 1.8% and how to use Cross Validation for prediction.

Other methods exist to improve a Machine Learning model like:

  • Normalize data
  • Changing the models hyperparameters
  • Data Augmentation
  • Ensemble methods

One last thing: Cross Validation is not to be taken lightly. It is a technique used in 2022 by the best experts to push Machine Learning models to their maximum performance.

Cross Validation is even used for Deep Learning!

Soon, an article will be published on the subject.

In the meantime, if you want to stay informed, don’t hesitate to subscribe to our newsletter 😉

One last word, if you want to go further and learn about Deep Learning - I've prepared for you the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

  • Plan your training
  • Structure your projects
  • Develop your Artificial Intelligence algorithms

I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.

To access it, click here :

GET MY ACTION PLAN

GET MY ACTION PLAN

Cross Validation - THE Tutorial How To Use it - sklearn (2024)

References

Top Articles
Movies Like What A Girl Want – Just Speak News
Anatomy Drawing Lessons
فیلم رهگیر دوبله فارسی بدون سانسور نماشا
What spices do Germans cook with?
Craigslist Benton Harbor Michigan
FFXIV Immortal Flames Hunting Log Guide
Nfr Daysheet
Body Rubs Austin Texas
Chalupp's Pizza Taos Menu
My Boyfriend Has No Money And I Pay For Everything
Us 25 Yard Sale Map
What's Wrong with the Chevrolet Tahoe?
2021 Tesla Model 3 Standard Range Pl electric for sale - Portland, OR - craigslist
shopping.drugsourceinc.com/imperial | Imperial Health TX AZ
Globe Position Fault Litter Robot
Diablo 3 Metascore
Munich residents spend the most online for food
Missouri Highway Patrol Crash
Robeson County Mugshots 2022
Theater X Orange Heights Florida
Canvasdiscount Black Friday Deals
Ac-15 Gungeon
E32 Ultipro Desktop Version
Encyclopaedia Metallum - WikiMili, The Best Wikipedia Reader
Aliciabibs
Silky Jet Water Flosser
Expression Home XP-452 | Grand public | Imprimantes jet d'encre | Imprimantes | Produits | Epson France
Askhistorians Book List
Till The End Of The Moon Ep 13 Eng Sub
Kleinerer: in Sinntal | markt.de
Fedex Walgreens Pickup Times
Workday Latech Edu
Missouri State Highway Patrol Will Utilize Acadis to Improve Curriculum and Testing Management
Honda Ruckus Fuse Box Diagram
Wsbtv Fish And Game Report
The Vélodrome d'Hiver (Vél d'Hiv) Roundup
Culvers Lyons Flavor Of The Day
Pp503063
Winco Money Order Hours
Anya Banerjee Feet
Wilson Tattoo Shops
Dwc Qme Database
Pulitzer And Tony Winning Play About A Mathematical Genius Crossword
Lamp Repair Kansas City Mo
Unveiling Gali_gool Leaks: Discoveries And Insights
Foxxequeen
What to Do at The 2024 Charlotte International Arts Festival | Queen City Nerve
Cabarrus County School Calendar 2024
Portal Pacjenta LUX MED
Theater X Orange Heights Florida
Gear Bicycle Sales Butler Pa
300 Fort Monroe Industrial Parkway Monroeville Oh
Latest Posts
Article information

Author: Pres. Carey Rath

Last Updated:

Views: 6107

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Pres. Carey Rath

Birthday: 1997-03-06

Address: 14955 Ledner Trail, East Rodrickfort, NE 85127-8369

Phone: +18682428114917

Job: National Technology Representative

Hobby: Sand art, Drama, Web surfing, Cycling, Brazilian jiu-jitsu, Leather crafting, Creative writing

Introduction: My name is Pres. Carey Rath, I am a faithful, funny, vast, joyous, lively, brave, glamorous person who loves writing and wants to share my knowledge and understanding with you.