Model evaluation using cross-validation — Scikit-learn course (2024)

In this notebook, we still use numerical features only.

Here we discuss the practical aspects of assessing the generalizationperformance of our model via cross-validation instead of a singletrain-test split.

Data preparation#

First, let’s load the full adult census dataset.

import pandas as pdadult_census = pd.read_csv("../datasets/adult-census.csv")

We now drop the target from the data we will use to train our predictivemodel.

target_name = "class"target = adult_census[target_name]data = adult_census.drop(columns=target_name)

Then, we select only the numerical columns, as seen in the previousnotebook.

We can now create a model using the make_pipeline tool to chain thepreprocessing and the estimator in every iteration of the cross-validation.

from sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import make_pipelinemodel = make_pipeline(StandardScaler(), LogisticRegression())

The need for cross-validation#

In the previous notebook, we split the original data into a training set and atesting set. The score of a model in general depends on the way we make such asplit. One downside of doing a single split is that it does not give anyinformation about this variability. Another downside, in a setting where theamount of data is small, is that the data available for training and testingwould be even smaller after splitting.

Instead, we can use cross-validation. Cross-validation consists of repeatingthe procedure such that the training and testing sets are different each time.Generalization performance metrics are collected for each repetition and thenaggregated. As a result we can assess the variability of our measure of themodel’s generalization performance.

Note that there exists several cross-validation strategies, each of themdefines how to repeat the fit/score procedure. In this section, we use theK-fold strategy: the entire dataset is split into K partitions. Thefit/score procedure is repeated K times where at each iteration K - 1partitions are used to fit the model and 1 partition is used to score. Thefigure below illustrates this K-fold strategy.

Model evaluation using cross-validation — Scikit-learn course (1)

Note

This figure shows the particular case of K-fold cross-validation strategy.For each cross-validation split, the procedure trains a clone of model on allthe red samples and evaluate the score of the model on the blue samples. Asmentioned earlier, there is a variety of different cross-validationstrategies. Some of these aspects will be covered in more detail in futurenotebooks.

Cross-validation is therefore computationally intensive because it requirestraining several models instead of one.

In scikit-learn, the function cross_validate allows to do cross-validationand you need to pass it the model, the data, and the target. Since thereexists several cross-validation strategies, cross_validate takes a parametercv which defines the splitting strategy.

%%timefrom sklearn.model_selection import cross_validatemodel = make_pipeline(StandardScaler(), LogisticRegression())cv_result = cross_validate(model, data_numeric, target, cv=5)cv_result
CPU times: user 475 ms, sys: 251 ms, total: 726 msWall time: 411 ms
{'fit_time': array([0.05962992, 0.05806112, 0.05891657, 0.05685925, 0.05641007]), 'score_time': array([0.01352239, 0.0138371 , 0.01350832, 0.01330352, 0.01314974]), 'test_score': array([0.79557785, 0.80049135, 0.79965192, 0.79873055, 0.80456593])}

The output of cross_validate is a Python dictionary, which by defaultcontains three entries:

  • (i) the time to train the model on the training data for each fold,fit_time

  • (ii) the time to predict with the model on the testing data for each fold,score_time

  • (iii) the default score on the testing data for each fold, test_score.

Setting cv=5 created 5 distinct splits to get 5 variations for the trainingand testing sets. Each training set is used to fit one model which is thenscored on the matching test set. The default strategy when setting cv=int isthe K-fold cross-validation where K corresponds to the (integer) number ofsplits. Setting cv=5 or cv=10 is a common practice, as it is a goodtrade-off between computation time and stability of the estimated variability.

Note that by default the cross_validate function discards the K modelsthat were trained on the different overlapping subset of the dataset. The goalof cross-validation is not to train a model, but rather to estimateapproximately the generalization performance of a model that would have beentrained to the full training set, along with an estimate of the variability(uncertainty on the generalization accuracy).

You can pass additional parameters tosklearn.model_selection.cross_validateto collect additional information, such as the training scores of the modelsobtained on each round or even return the models themselves instead ofdiscarding them. These features will be covered in a future notebook.

Let’s extract the scores computed on the test fold of each cross-validationround from the cv_result dictionary and compute the mean accuracy and thevariation of the accuracy across folds.

scores = cv_result["test_score"]print( "The mean cross-validation accuracy is: " f"{scores.mean():.3f} ± {scores.std():.3f}")
The mean cross-validation accuracy is: 0.800 ± 0.003

Note that by computing the standard-deviation of the cross-validation scores,we can estimate the uncertainty of our model generalization performance. Thisis the main advantage of cross-validation and can be crucial in practice, forexample when comparing different models to figure out whether one is betterthan the other or whether our measures of the generalization performance ofeach model are within the error bars of one-another.

In this particular case, only the first 2 decimals seem to be trustworthy. Ifyou go up in this notebook, you can check that the performance we get withcross-validation is compatible with the one from a single train-test split.

Notebook recap#

In this notebook we assessed the generalization performance of our model viacross-validation.

Model evaluation using cross-validation — Scikit-learn course (2024)

References

Top Articles
How to Harvest Mushroom Spores to Grow at Home
Psilocybin Spores for Beginners: What to Know, Where to Go, and How to Grow  - Third Wave
Spasa Parish
Gilbert Public Schools Infinite Campus
Rentals for rent in Maastricht
159R Bus Schedule Pdf
11 Best Sites Like The Chive For Funny Pictures and Memes
Finger Lakes 1 Police Beat
Craigslist Pets Huntsville Alabama
Paulette Goddard | American Actress, Modern Times, Charlie Chaplin
Red Dead Redemption 2 Legendary Fish Locations Guide (“A Fisher of Fish”)
What's the Difference Between Halal and Haram Meat & Food?
R/Skinwalker
Rugged Gentleman Barber Shop Martinsburg Wv
Jennifer Lenzini Leaving Ktiv
Havasu Lake residents boiling over water quality as EPA assumes oversight
Justified - Streams, Episodenguide und News zur Serie
Epay. Medstarhealth.org
Olde Kegg Bar & Grill Portage Menu
Half Inning In Which The Home Team Bats Crossword
Amazing Lash Bay Colony
Cyclefish 2023
Truist Bank Open Saturday
What’s Closing at Disney World? A Complete Guide
New from Simply So Good - Cherry Apricot Slab Pie
Ohio State Football Wiki
Find Words Containing Specific Letters | WordFinder®
FirstLight Power to Acquire Leading Canadian Renewable Operator and Developer Hydromega Services Inc. - FirstLight
Webmail.unt.edu
When Is Moonset Tonight
2024-25 ITH Season Preview: USC Trojans
Metro By T Mobile Sign In
Restored Republic December 1 2022
Dl 646
Apple Watch 9 vs. 10 im Vergleich: Unterschiede & Neuerungen
12 30 Pacific Time
Operation Carpe Noctem
Nail Supply Glamour Lake June
Anmed My Chart Login
Pick N Pull Near Me [Locator Map + Guide + FAQ]
'I want to be the oldest Miss Universe winner - at 31'
Gun Mayhem Watchdocumentaries
Ice Hockey Dboard
Infinity Pool Showtimes Near Maya Cinemas Bakersfield
Dermpathdiagnostics Com Pay Invoice
A look back at the history of the Capital One Tower
Alvin Isd Ixl
Maria Butina Bikini
Busted Newspaper Zapata Tx
2045 Union Ave SE, Grand Rapids, MI 49507 | Estately 🧡 | MLS# 24048395
Upgrading Fedora Linux to a New Release
Latest Posts
Article information

Author: Gregorio Kreiger

Last Updated:

Views: 6099

Rating: 4.7 / 5 (77 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Gregorio Kreiger

Birthday: 1994-12-18

Address: 89212 Tracey Ramp, Sunside, MT 08453-0951

Phone: +9014805370218

Job: Customer Designer

Hobby: Mountain biking, Orienteering, Hiking, Sewing, Backpacking, Mushroom hunting, Backpacking

Introduction: My name is Gregorio Kreiger, I am a tender, brainy, enthusiastic, combative, agreeable, gentle, gentle person who loves writing and wants to share my knowledge and understanding with you.