Understanding Cross Validation in Scikit-Learn with cross_validate

Machine Learning, Tutorials

Posted on:13 April 2023
Updated on: 28 April 2023
Written by:Pierian Training

Cross-validation is a powerful technique for assessing the performance ofmachine learningmodels. It allows you to make better predictions by training and evaluating the model on different subsets of the data. In this blog post, we’ll dive deep into thecross_validatefunction in the Scikit-Learn library, which allows for efficient cross-validation in Python. We’ll cover the following topics:

Introduction to Cross-Validation
Getting Startedwith Scikit-Learn and cross_validate
Customizing the cross_validate Function
Working with Different Types of Models
Handling Imbalanced Data with cross_validate
Nested Cross-Validation for Model Selection
Conclusion

1. Introduction to Cross-Validation

Cross-validation is a statistical method for evaluating the performance of machine learning models. It involves splitting the dataset into two parts: a training set and avalidation set. The model is trained on the training set, and its performance is evaluated on the validation set.

It is not recommended to learn the parameters of a prediction function and then test it on the same data. This is because a model that simply repeats the labels of the samples it has seen before would have a perfect score, but it would not be able to predict anything useful on new data. This is called overfitting. To prevent this, it is standard practice in supervised machine learning experiments to reserve a portion of available data as a test set (X_test, y_test). It’s worth noting that the term “experiment” here does not only apply to academic settings since even commercial machine learning typically begins experimentally. A typical cross-validation workflow in model training involves finding the best parameters through grid search techniques.

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (2)

The most common form of cross-validation is k-fold cross-validation. The basic idea behind K-fold cross-validation is to split the dataset into K equal parts, where K is a positive integer. Then, we train the model on K-1 parts and test it on the remaining one. This process is repeated K times, with each of the K parts serving as the testing set exactly once.

The steps for implementing K-fold cross-validation are as follows:

Split the dataset into K equally sized partitions or “folds”.
For each of the K folds, train the model on the K-1 folds and evaluate it on the remaining fold.
Record the evaluation metric (such as accuracy, precision, or recall) for each fold.
Compute the average performance across all K folds.

The main advantage of K-fold cross-validation is that it allows us to obtain a more accurate estimate of a model’s performance, as it ensures that each data point in the dataset is used for both training and testing. This is particularly useful when the dataset is small, as it allows us to make the most of the available data. Additionally, K-fold cross-validation can help prevent overfitting by providing a more representative estimate of the model’s performance on new, unseen data.

We can see the process in the diagram below:

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (3)

2. Getting Started with Scikit-Learn and cross_validate

Scikit-Learn is a popularPython libraryfor machine learning that provides simple and efficient tools for data mining and data analysis. Thecross_validatefunction is part of themodel_selectionmodule and allows you to perform k-fold cross-validation with ease. Let’s start by importing the necessary libraries and loading a sample dataset:

import numpy as npimport pandas as pdfrom sklearn.datasets import load_irisfrom sklearn.model_selection import cross_validatefrom sklearn.linear_model import LogisticRegression# Load the Iris datasetiris = load_iris()X = iris.datay = iris.target# Create a logistic regression modelmodel = LogisticRegression(max_iter=1000)

Now we can use thecross_validatefunction to perform 5-fold cross-validation on our dataset:

# Perform 5-fold cross-validationcv_results = cross_validate(model, X, y, cv=5)# Print the resultsprint(cv_results)

Thecross_validatefunction returns a dictionary containing the training andvalidation scoresfor each fold, as well as the fit and score times. For example, the output might look like this:

{'fit_time': array([0.035, 0.031, 0.028, 0.027, 0.027]), 'score_time': array([0.001, 0.001, 0.001, 0.001, 0.001]), 'test_score': array([0.967, 1. , 0.933, 0.967, 1. ])}

3. Customizing the cross_validate Function

Thecross_validatefunction offers many options for customization, including the ability to specify thescoring metric, return the training scores, and use different cross-validation strategies.

3.1 Specifying the Scoring Metric

By default, thecross_validatefunction uses thedefault scoring metricfor the estimator (e.g., accuracy for classification models). You can specify one or more customscoring metricsusing thescoringparameter. Here’s an example using precision, recall, and F1-score:

from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score# Define custom scoring metricsscoring = { 'precision': make_scorer(precision_score, average='weighted'), 'recall': make_scorer(recall_score, average='weighted'), 'f1_score': make_scorer(f1_score, average='weighted')}# Perform 5-fold cross-validation with custom scoring metricscv_results = cross_validate(model, X, y, cv=5, scoring=scoring)# Print the resultsprint(cv_results)

3.2 Returning Training Scores

By default, thecross_validatefunction only returns the validation scores. You can also return the training scores by setting thereturn_train_scoreparameter toTrue:

cv_results = cross_validate(model, X, y, cv=5, return_train_score=True)print(cv_results)

4. Workingwith Different Types of Models

Thecross_validatefunction works with any estimator that implements afitandscoremethod, which includes most models in Scikit-Learn. Here’s an example using asupport vector machine(SVM) and arandom forest classifier:

from sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifier# Create an SVM model and a random forest modelsvm = SVC(kernel='linear', C=1, random_state=42)rf = RandomForestClassifier(n_estimators=100, random_state=42)# Perform 5-fold cross-validation for both modelscv_results_svm = cross_validate(svm, X, y, cv=5)cv_results_rf = cross_validate(rf, X, y, cv=5)# Print the resultsprint("SVM:", cv_results_svm)print("Random Forest:", cv_results_rf)

5. Handling Imbalanced Data with cross_validate

When dealing withimbalanced datasets, it’s important to use cross-validation strategies that maintain the class distribution in each fold. Scikit-Learn provides theStratifiedKFoldclass for this purpose. Here’s an example:

from sklearn.model_selection import StratifiedKFold# Create a stratified k-fold cross-validatorstratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)# Perform 5-fold stratified cross-validationcv_results = cross_validate(model, X, y, cv=stratified_cv)# Print the resultsprint(cv_results)

6. Nested Cross-Validation forModel Selection

Nested cross-validation is a technique for model selection and hyperparameter tuning. It involves performing cross-validation on both the training and validation sets, which helps to avoid overfitting and selection bias. You can use thecross_validatefunction in anested loopto perform nested cross-validation. Here’s an example using different values of theCparameter in alogistic regression model:

from sklearn.model_selection import KFoldfrom sklearn.metrics import accuracy_score# Define the outer and inner cross-validation strategiesouter_cv = KFold(n_splits=5, shuffle=True, random_state=42)inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)# Define the parameter gridC_values = [0.001, 0.01, 0.1, 1, 10, 100]# Nested cross-validationouter_scores = []for train_index, val_index in outer_cv.split(X, y): X_train, X_val = X[train_index], X[val_index] y_train, y_val = y[train_index], y[val_index] best_score = 0 best_C = None for C in C_values: model = LogisticRegression(C=C, max_iter=1000) inner_scores = cross_validate(model, X_train, y_train, cv=inner_cv, scoring='accuracy')['test_score'] score = np.mean(inner_scores) if score > best_score: best_score = score best_C = C # Train the model with the best C value on the outer training set model = LogisticRegression(C=best_C, max_iter=1000) model.fit(X_train, y_train) val_score = accuracy_score(y_val, model.predict(X_val)) outer_scores.append(val_score)# Print the average accuracy across the outer foldsprint("Average accuracy:", np.mean(outer_scores))

7. Conclusion

In this blog post, we explored thecross_validatefunction in Scikit-Learn for performing cross-validation in Python. We covered how to use the function with different types of models, customize the scoring metrics, handleimbalanced data, and perform nested cross-validation for model selection. Thecross_validatefunction is a powerful tool for assessing the performance ofmachine learning modelsand should be an essential part of yourdata sciencetoolkit.

If you’re interested in learning more about becoming a Data Scientist, check out our free guide below:

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (4)

Your FREE Guide to Become a Data Scientist

Discover the path to becoming a data scientist with our comprehensive FREE guide! Unlock your potential in this in-demand field and access valuable resources to kickstart your journey.

Don’t wait, download now and transform your career!

Pierian Training

Pierian Training is a leading provider of high-quality technology training, with a focus on data science and cloud computing.Pierian Training offers live instructor-led training, self-paced online video courses, and private group and cohort training programs to support enterprises looking to upskill their employees.

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (2024)

Table of Contents

1. Introduction to Cross-Validation

2. Getting Started with Scikit-Learn and cross_validate

3. Customizing the cross_validate Function

3.1 Specifying the Scoring Metric

3.2 Returning Training Scores

4. Workingwith Different Types of Models

5. Handling Imbalanced Data with cross_validate

6. Nested Cross-Validation forModel Selection

7. Conclusion

Pierian Training

You May Also Like

Guide to NLTK – Natural Language Toolkit for Python

GridSearchCV with Scikit-Learn and Python

Plotting Time Series in Python: A Complete Guide

References