In this notebook, we still use numerical features only.
Here we discuss the practical aspects of assessing the generalizationperformance of our model via cross-validation instead of a singletrain-test split.
Data preparation#
First, let’s load the full adult census dataset.
import pandas as pdadult_census = pd.read_csv("../datasets/adult-census.csv")
We now drop the target from the data we will use to train our predictivemodel.
target_name = "class"target = adult_census[target_name]data = adult_census.drop(columns=target_name)
Then, we select only the numerical columns, as seen in the previousnotebook.
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]data_numeric = data[numerical_columns]
We can now create a model using the make_pipeline
tool to chain thepreprocessing and the estimator in every iteration of the cross-validation.
from sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import make_pipelinemodel = make_pipeline(StandardScaler(), LogisticRegression())
The need for cross-validation#
In the previous notebook, we split the original data into a training set and atesting set. The score of a model in general depends on the way we make such asplit. One downside of doing a single split is that it does not give anyinformation about this variability. Another downside, in a setting where theamount of data is small, is that the data available for training and testingwould be even smaller after splitting.
Instead, we can use cross-validation. Cross-validation consists of repeatingthe procedure such that the training and testing sets are different each time.Generalization performance metrics are collected for each repetition and thenaggregated. As a result we can assess the variability of our measure of themodel’s generalization performance.
Note that there exists several cross-validation strategies, each of themdefines how to repeat the fit
/score
procedure. In this section, we use theK-fold strategy: the entire dataset is split into K
partitions. Thefit
/score
procedure is repeated K
times where at each iteration K - 1
partitions are used to fit the model and 1
partition is used to score. Thefigure below illustrates this K-fold strategy.
Note
This figure shows the particular case of K-fold cross-validation strategy.For each cross-validation split, the procedure trains a clone of model on allthe red samples and evaluate the score of the model on the blue samples. Asmentioned earlier, there is a variety of different cross-validationstrategies. Some of these aspects will be covered in more detail in futurenotebooks.
Cross-validation is therefore computationally intensive because it requirestraining several models instead of one.
In scikit-learn, the function cross_validate
allows to do cross-validationand you need to pass it the model, the data, and the target. Since thereexists several cross-validation strategies, cross_validate
takes a parametercv
which defines the splitting strategy.
%%timefrom sklearn.model_selection import cross_validatemodel = make_pipeline(StandardScaler(), LogisticRegression())cv_result = cross_validate(model, data_numeric, target, cv=5)cv_result
CPU times: user 475 ms, sys: 251 ms, total: 726 msWall time: 411 ms
{'fit_time': array([0.05962992, 0.05806112, 0.05891657, 0.05685925, 0.05641007]), 'score_time': array([0.01352239, 0.0138371 , 0.01350832, 0.01330352, 0.01314974]), 'test_score': array([0.79557785, 0.80049135, 0.79965192, 0.79873055, 0.80456593])}
The output of cross_validate
is a Python dictionary, which by defaultcontains three entries:
(i) the time to train the model on the training data for each fold,
fit_time
(ii) the time to predict with the model on the testing data for each fold,
score_time
(iii) the default score on the testing data for each fold,
test_score
.
Setting cv=5
created 5 distinct splits to get 5 variations for the trainingand testing sets. Each training set is used to fit one model which is thenscored on the matching test set. The default strategy when setting cv=int
isthe K-fold cross-validation where K
corresponds to the (integer) number ofsplits. Setting cv=5
or cv=10
is a common practice, as it is a goodtrade-off between computation time and stability of the estimated variability.
Note that by default the cross_validate
function discards the K
modelsthat were trained on the different overlapping subset of the dataset. The goalof cross-validation is not to train a model, but rather to estimateapproximately the generalization performance of a model that would have beentrained to the full training set, along with an estimate of the variability(uncertainty on the generalization accuracy).
You can pass additional parameters tosklearn.model_selection.cross_validate
to collect additional information, such as the training scores of the modelsobtained on each round or even return the models themselves instead ofdiscarding them. These features will be covered in a future notebook.
Let’s extract the scores computed on the test fold of each cross-validationround from the cv_result
dictionary and compute the mean accuracy and thevariation of the accuracy across folds.
scores = cv_result["test_score"]print( "The mean cross-validation accuracy is: " f"{scores.mean():.3f} ± {scores.std():.3f}")
The mean cross-validation accuracy is: 0.800 ± 0.003
Note that by computing the standard-deviation of the cross-validation scores,we can estimate the uncertainty of our model generalization performance. Thisis the main advantage of cross-validation and can be crucial in practice, forexample when comparing different models to figure out whether one is betterthan the other or whether our measures of the generalization performance ofeach model are within the error bars of one-another.
In this particular case, only the first 2 decimals seem to be trustworthy. Ifyou go up in this notebook, you can check that the performance we get withcross-validation is compatible with the one from a single train-test split.
Notebook recap#
In this notebook we assessed the generalization performance of our model viacross-validation.