Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (2024)

  • Machine Learning, Tutorials
  • Posted on:
  • Updated on: 28 April 2023
  • Written by:Pierian Training

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (1)

Cross-validation is a powerful technique for assessing the performance ofmachine learningmodels. It allows you to make better predictions by training and evaluating the model on different subsets of the data. In this blog post, we’ll dive deep into thecross_validatefunction in the Scikit-Learn library, which allows for efficient cross-validation in Python. We’ll cover the following topics:

Table of Contents

  1. Introduction to Cross-Validation
  2. Getting Startedwith Scikit-Learn and cross_validate
  3. Customizing the cross_validate Function
  4. Working with Different Types of Models
  5. Handling Imbalanced Data with cross_validate
  6. Nested Cross-Validation for Model Selection
  7. Conclusion

1. Introduction to Cross-Validation

Cross-validation is a statistical method for evaluating the performance of machine learning models. It involves splitting the dataset into two parts: a training set and avalidation set. The model is trained on the training set, and its performance is evaluated on the validation set.

It is not recommended to learn the parameters of a prediction function and then test it on the same data. This is because a model that simply repeats the labels of the samples it has seen before would have a perfect score, but it would not be able to predict anything useful on new data. This is called overfitting. To prevent this, it is standard practice in supervised machine learning experiments to reserve a portion of available data as a test set (X_test, y_test). It’s worth noting that the term “experiment” here does not only apply to academic settings since even commercial machine learning typically begins experimentally. A typical cross-validation workflow in model training involves finding the best parameters through grid search techniques.

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (2)

The most common form of cross-validation is k-fold cross-validation. The basic idea behind K-fold cross-validation is to split the dataset into K equal parts, where K is a positive integer. Then, we train the model on K-1 parts and test it on the remaining one. This process is repeated K times, with each of the K parts serving as the testing set exactly once.

The steps for implementing K-fold cross-validation are as follows:

  1. Split the dataset into K equally sized partitions or “folds”.
  2. For each of the K folds, train the model on the K-1 folds and evaluate it on the remaining fold.
  3. Record the evaluation metric (such as accuracy, precision, or recall) for each fold.
  4. Compute the average performance across all K folds.

The main advantage of K-fold cross-validation is that it allows us to obtain a more accurate estimate of a model’s performance, as it ensures that each data point in the dataset is used for both training and testing. This is particularly useful when the dataset is small, as it allows us to make the most of the available data. Additionally, K-fold cross-validation can help prevent overfitting by providing a more representative estimate of the model’s performance on new, unseen data.

We can see the process in the diagram below:

2. Getting Started with Scikit-Learn and cross_validate

Scikit-Learn is a popularPython libraryfor machine learning that provides simple and efficient tools for data mining and data analysis. Thecross_validatefunction is part of themodel_selectionmodule and allows you to perform k-fold cross-validation with ease. Let’s start by importing the necessary libraries and loading a sample dataset:

import numpy as npimport pandas as pdfrom sklearn.datasets import load_irisfrom sklearn.model_selection import cross_validatefrom sklearn.linear_model import LogisticRegression# Load the Iris datasetiris = load_iris()X = iris.datay = iris.target# Create a logistic regression modelmodel = LogisticRegression(max_iter=1000)

Now we can use thecross_validatefunction to perform 5-fold cross-validation on our dataset:

# Perform 5-fold cross-validationcv_results = cross_validate(model, X, y, cv=5)# Print the resultsprint(cv_results)

Thecross_validatefunction returns a dictionary containing the training andvalidation scoresfor each fold, as well as the fit and score times. For example, the output might look like this:

{'fit_time': array([0.035, 0.031, 0.028, 0.027, 0.027]), 'score_time': array([0.001, 0.001, 0.001, 0.001, 0.001]), 'test_score': array([0.967, 1. , 0.933, 0.967, 1. ])}

3. Customizing the cross_validate Function

Thecross_validatefunction offers many options for customization, including the ability to specify thescoring metric, return the training scores, and use different cross-validation strategies.

3.1 Specifying the Scoring Metric

By default, thecross_validatefunction uses thedefault scoring metricfor the estimator (e.g., accuracy for classification models). You can specify one or more customscoring metricsusing thescoringparameter. Here’s an example using precision, recall, and F1-score:

from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score# Define custom scoring metricsscoring = { 'precision': make_scorer(precision_score, average='weighted'), 'recall': make_scorer(recall_score, average='weighted'), 'f1_score': make_scorer(f1_score, average='weighted')}# Perform 5-fold cross-validation with custom scoring metricscv_results = cross_validate(model, X, y, cv=5, scoring=scoring)# Print the resultsprint(cv_results)

3.2 Returning Training Scores

By default, thecross_validatefunction only returns the validation scores. You can also return the training scores by setting thereturn_train_scoreparameter toTrue:

cv_results = cross_validate(model, X, y, cv=5, return_train_score=True)print(cv_results)

4. Workingwith Different Types of Models

Thecross_validatefunction works with any estimator that implements afitandscoremethod, which includes most models in Scikit-Learn. Here’s an example using asupport vector machine(SVM) and arandom forest classifier:

from sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifier# Create an SVM model and a random forest modelsvm = SVC(kernel='linear', C=1, random_state=42)rf = RandomForestClassifier(n_estimators=100, random_state=42)# Perform 5-fold cross-validation for both modelscv_results_svm = cross_validate(svm, X, y, cv=5)cv_results_rf = cross_validate(rf, X, y, cv=5)# Print the resultsprint("SVM:", cv_results_svm)print("Random Forest:", cv_results_rf)

5. Handling Imbalanced Data with cross_validate

When dealing withimbalanced datasets, it’s important to use cross-validation strategies that maintain the class distribution in each fold. Scikit-Learn provides theStratifiedKFoldclass for this purpose. Here’s an example:

from sklearn.model_selection import StratifiedKFold# Create a stratified k-fold cross-validatorstratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)# Perform 5-fold stratified cross-validationcv_results = cross_validate(model, X, y, cv=stratified_cv)# Print the resultsprint(cv_results)

6. Nested Cross-Validation forModel Selection

Nested cross-validation is a technique for model selection and hyperparameter tuning. It involves performing cross-validation on both the training and validation sets, which helps to avoid overfitting and selection bias. You can use thecross_validatefunction in anested loopto perform nested cross-validation. Here’s an example using different values of theCparameter in alogistic regression model:

from sklearn.model_selection import KFoldfrom sklearn.metrics import accuracy_score# Define the outer and inner cross-validation strategiesouter_cv = KFold(n_splits=5, shuffle=True, random_state=42)inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)# Define the parameter gridC_values = [0.001, 0.01, 0.1, 1, 10, 100]# Nested cross-validationouter_scores = []for train_index, val_index in outer_cv.split(X, y): X_train, X_val = X[train_index], X[val_index] y_train, y_val = y[train_index], y[val_index] best_score = 0 best_C = None for C in C_values: model = LogisticRegression(C=C, max_iter=1000) inner_scores = cross_validate(model, X_train, y_train, cv=inner_cv, scoring='accuracy')['test_score'] score = np.mean(inner_scores) if score > best_score: best_score = score best_C = C # Train the model with the best C value on the outer training set model = LogisticRegression(C=best_C, max_iter=1000) model.fit(X_train, y_train) val_score = accuracy_score(y_val, model.predict(X_val)) outer_scores.append(val_score)# Print the average accuracy across the outer foldsprint("Average accuracy:", np.mean(outer_scores))

7. Conclusion

In this blog post, we explored thecross_validatefunction in Scikit-Learn for performing cross-validation in Python. We covered how to use the function with different types of models, customize the scoring metrics, handleimbalanced data, and perform nested cross-validation for model selection. Thecross_validatefunction is a powerful tool for assessing the performance ofmachine learning modelsand should be an essential part of yourdata sciencetoolkit.

If you’re interested in learning more about becoming a Data Scientist, check out our free guide below:

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (4)

Your FREE Guide to Become a Data Scientist

Discover the path to becoming a data scientist with our comprehensive FREE guide! Unlock your potential in this in-demand field and access valuable resources to kickstart your journey.

Don’t wait, download now and transform your career!

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (5)

Pierian Training

Pierian Training is a leading provider of high-quality technology training, with a focus on data science and cloud computing.Pierian Training offers live instructor-led training, self-paced online video courses, and private group and cohort training programs to support enterprises looking to upskill their employees.

You May Also Like

Data Science, Tutorials

Guide to NLTK – Natural Language Toolkit for Python

Introduction Natural Language Processing (NLP) lies at the heart of countless applications we use every day, from voice assistants to spam filters and machine translation. It allows machines to understand, interpret, and generate human language, bridging the gap between humans and computers. Within the vast landscape of NLP tools and techniques, the Natural Language Toolkit […]

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (7)

Pierian Training

Read Post

Machine Learning, Tutorials

GridSearchCV with Scikit-Learn and Python

Introduction In the world of machine learning, finding the optimal set of hyperparameters for a model can significantly impact its performance and accuracy. However, searching through all possible combinations manually can be an incredibly time-consuming and error-prone process. This is where GridSearchCV, a powerful tool provided by Scikit-Learn library in Python, comes to the rescue. […]

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (9)

Pierian Training

Read Post

Python Basics, Tutorials

Plotting Time Series in Python: A Complete Guide

Introduction Time series data is a type of data that is collected over time at regular intervals. It can be used to analyze trends, patterns, and behaviors over time. In order to effectively analyze time series data, it is important to visualize it in a way that is easy to understand. This is where plotting […]

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (11)

Pierian Training

Read Post

Understanding Cross Validation in Scikit-Learn with cross_validate - Pierian Training (2024)

References

Top Articles
Gianna Maria-Onore Bryant Autopsy Report
Jodie Sweetin Leaks
Spasa Parish
Rentals for rent in Maastricht
159R Bus Schedule Pdf
Sallisaw Bin Store
Black Adam Showtimes Near Maya Cinemas Delano
Espn Transfer Portal Basketball
Pollen Levels Richmond
11 Best Sites Like The Chive For Funny Pictures and Memes
Things to do in Wichita Falls on weekends 12-15 September
Craigslist Pets Huntsville Alabama
Paulette Goddard | American Actress, Modern Times, Charlie Chaplin
Red Dead Redemption 2 Legendary Fish Locations Guide (“A Fisher of Fish”)
What's the Difference Between Halal and Haram Meat & Food?
R/Skinwalker
Rugged Gentleman Barber Shop Martinsburg Wv
Jennifer Lenzini Leaving Ktiv
Justified - Streams, Episodenguide und News zur Serie
Epay. Medstarhealth.org
Olde Kegg Bar & Grill Portage Menu
Cubilabras
Half Inning In Which The Home Team Bats Crossword
Amazing Lash Bay Colony
Juego Friv Poki
Dirt Devil Ud70181 Parts Diagram
Truist Bank Open Saturday
Water Leaks in Your Car When It Rains? Common Causes & Fixes
What’s Closing at Disney World? A Complete Guide
New from Simply So Good - Cherry Apricot Slab Pie
Drys Pharmacy
Ohio State Football Wiki
Find Words Containing Specific Letters | WordFinder®
FirstLight Power to Acquire Leading Canadian Renewable Operator and Developer Hydromega Services Inc. - FirstLight
Webmail.unt.edu
2024-25 ITH Season Preview: USC Trojans
Metro By T Mobile Sign In
Restored Republic December 1 2022
12 30 Pacific Time
Jami Lafay Gofundme
Greenbrier Bunker Tour Coupon
Pick N Pull Near Me [Locator Map + Guide + FAQ]
Crystal Westbrooks Nipple
Ice Hockey Dboard
Über 60 Prozent Rabatt auf E-Bikes: Aldi reduziert sämtliche Pedelecs stark im Preis - nur noch für kurze Zeit
Wie blocke ich einen Bot aus Boardman/USA - sellerforum.de
Infinity Pool Showtimes Near Maya Cinemas Bakersfield
Dermpathdiagnostics Com Pay Invoice
How To Use Price Chopper Points At Quiktrip
Maria Butina Bikini
Busted Newspaper Zapata Tx
Latest Posts
Article information

Author: Jerrold Considine

Last Updated:

Views: 6101

Rating: 4.8 / 5 (78 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Jerrold Considine

Birthday: 1993-11-03

Address: Suite 447 3463 Marybelle Circles, New Marlin, AL 20765

Phone: +5816749283868

Job: Sales Executive

Hobby: Air sports, Sand art, Electronics, LARPing, Baseball, Book restoration, Puzzles

Introduction: My name is Jerrold Considine, I am a combative, cheerful, encouraging, happy, enthusiastic, funny, kind person who loves writing and wants to share my knowledge and understanding with you.