- Machine Learning, Tutorials
- Posted on:
- Updated on: 28 April 2023
- Written by:Pierian Training
Cross-validation is a powerful technique for assessing the performance ofmachine learningmodels. It allows you to make better predictions by training and evaluating the model on different subsets of the data. In this blog post, we’ll dive deep into thecross_validate
function in the Scikit-Learn library, which allows for efficient cross-validation in Python. We’ll cover the following topics:
Table of Contents
- Introduction to Cross-Validation
- Getting Startedwith Scikit-Learn and cross_validate
- Customizing the cross_validate Function
- Working with Different Types of Models
- Handling Imbalanced Data with cross_validate
- Nested Cross-Validation for Model Selection
- Conclusion
1. Introduction to Cross-Validation
Cross-validation is a statistical method for evaluating the performance of machine learning models. It involves splitting the dataset into two parts: a training set and avalidation set. The model is trained on the training set, and its performance is evaluated on the validation set.
It is not recommended to learn the parameters of a prediction function and then test it on the same data. This is because a model that simply repeats the labels of the samples it has seen before would have a perfect score, but it would not be able to predict anything useful on new data. This is called overfitting. To prevent this, it is standard practice in supervised machine learning experiments to reserve a portion of available data as a test set (X_test, y_test). It’s worth noting that the term “experiment” here does not only apply to academic settings since even commercial machine learning typically begins experimentally. A typical cross-validation workflow in model training involves finding the best parameters through grid search techniques.
The most common form of cross-validation is k-fold cross-validation. The basic idea behind K-fold cross-validation is to split the dataset into K equal parts, where K is a positive integer. Then, we train the model on K-1 parts and test it on the remaining one. This process is repeated K times, with each of the K parts serving as the testing set exactly once.
The steps for implementing K-fold cross-validation are as follows:
- Split the dataset into K equally sized partitions or “folds”.
- For each of the K folds, train the model on the K-1 folds and evaluate it on the remaining fold.
- Record the evaluation metric (such as accuracy, precision, or recall) for each fold.
- Compute the average performance across all K folds.
The main advantage of K-fold cross-validation is that it allows us to obtain a more accurate estimate of a model’s performance, as it ensures that each data point in the dataset is used for both training and testing. This is particularly useful when the dataset is small, as it allows us to make the most of the available data. Additionally, K-fold cross-validation can help prevent overfitting by providing a more representative estimate of the model’s performance on new, unseen data.
We can see the process in the diagram below:
2. Getting Started with Scikit-Learn and cross_validate
Scikit-Learn is a popularPython libraryfor machine learning that provides simple and efficient tools for data mining and data analysis. Thecross_validate
function is part of themodel_selection
module and allows you to perform k-fold cross-validation with ease. Let’s start by importing the necessary libraries and loading a sample dataset:
import numpy as npimport pandas as pdfrom sklearn.datasets import load_irisfrom sklearn.model_selection import cross_validatefrom sklearn.linear_model import LogisticRegression# Load the Iris datasetiris = load_iris()X = iris.datay = iris.target# Create a logistic regression modelmodel = LogisticRegression(max_iter=1000)
Now we can use thecross_validate
function to perform 5-fold cross-validation on our dataset:
# Perform 5-fold cross-validationcv_results = cross_validate(model, X, y, cv=5)# Print the resultsprint(cv_results)
Thecross_validate
function returns a dictionary containing the training andvalidation scoresfor each fold, as well as the fit and score times. For example, the output might look like this:
{'fit_time': array([0.035, 0.031, 0.028, 0.027, 0.027]), 'score_time': array([0.001, 0.001, 0.001, 0.001, 0.001]), 'test_score': array([0.967, 1. , 0.933, 0.967, 1. ])}
3. Customizing the cross_validate Function
Thecross_validate
function offers many options for customization, including the ability to specify thescoring metric, return the training scores, and use different cross-validation strategies.
3.1 Specifying the Scoring Metric
By default, thecross_validate
function uses thedefault scoring metricfor the estimator (e.g., accuracy for classification models). You can specify one or more customscoring metricsusing thescoring
parameter. Here’s an example using precision, recall, and F1-score:
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score# Define custom scoring metricsscoring = { 'precision': make_scorer(precision_score, average='weighted'), 'recall': make_scorer(recall_score, average='weighted'), 'f1_score': make_scorer(f1_score, average='weighted')}# Perform 5-fold cross-validation with custom scoring metricscv_results = cross_validate(model, X, y, cv=5, scoring=scoring)# Print the resultsprint(cv_results)
3.2 Returning Training Scores
By default, thecross_validate
function only returns the validation scores. You can also return the training scores by setting thereturn_train_score
parameter toTrue
:
cv_results = cross_validate(model, X, y, cv=5, return_train_score=True)print(cv_results)
4. Workingwith Different Types of Models
Thecross_validate
function works with any estimator that implements afit
andscore
method, which includes most models in Scikit-Learn. Here’s an example using asupport vector machine(SVM) and arandom forest classifier:
from sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifier# Create an SVM model and a random forest modelsvm = SVC(kernel='linear', C=1, random_state=42)rf = RandomForestClassifier(n_estimators=100, random_state=42)# Perform 5-fold cross-validation for both modelscv_results_svm = cross_validate(svm, X, y, cv=5)cv_results_rf = cross_validate(rf, X, y, cv=5)# Print the resultsprint("SVM:", cv_results_svm)print("Random Forest:", cv_results_rf)
5. Handling Imbalanced Data with cross_validate
When dealing withimbalanced datasets, it’s important to use cross-validation strategies that maintain the class distribution in each fold. Scikit-Learn provides theStratifiedKFold
class for this purpose. Here’s an example:
from sklearn.model_selection import StratifiedKFold# Create a stratified k-fold cross-validatorstratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)# Perform 5-fold stratified cross-validationcv_results = cross_validate(model, X, y, cv=stratified_cv)# Print the resultsprint(cv_results)
6. Nested Cross-Validation forModel Selection
Nested cross-validation is a technique for model selection and hyperparameter tuning. It involves performing cross-validation on both the training and validation sets, which helps to avoid overfitting and selection bias. You can use thecross_validate
function in anested loopto perform nested cross-validation. Here’s an example using different values of theC
parameter in alogistic regression model:
from sklearn.model_selection import KFoldfrom sklearn.metrics import accuracy_score# Define the outer and inner cross-validation strategiesouter_cv = KFold(n_splits=5, shuffle=True, random_state=42)inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)# Define the parameter gridC_values = [0.001, 0.01, 0.1, 1, 10, 100]# Nested cross-validationouter_scores = []for train_index, val_index in outer_cv.split(X, y): X_train, X_val = X[train_index], X[val_index] y_train, y_val = y[train_index], y[val_index] best_score = 0 best_C = None for C in C_values: model = LogisticRegression(C=C, max_iter=1000) inner_scores = cross_validate(model, X_train, y_train, cv=inner_cv, scoring='accuracy')['test_score'] score = np.mean(inner_scores) if score > best_score: best_score = score best_C = C # Train the model with the best C value on the outer training set model = LogisticRegression(C=best_C, max_iter=1000) model.fit(X_train, y_train) val_score = accuracy_score(y_val, model.predict(X_val)) outer_scores.append(val_score)# Print the average accuracy across the outer foldsprint("Average accuracy:", np.mean(outer_scores))
7. Conclusion
In this blog post, we explored thecross_validate
function in Scikit-Learn for performing cross-validation in Python. We covered how to use the function with different types of models, customize the scoring metrics, handleimbalanced data, and perform nested cross-validation for model selection. Thecross_validate
function is a powerful tool for assessing the performance ofmachine learning modelsand should be an essential part of yourdata sciencetoolkit.
If you’re interested in learning more about becoming a Data Scientist, check out our free guide below:
Your FREE Guide to Become a Data Scientist
Discover the path to becoming a data scientist with our comprehensive FREE guide! Unlock your potential in this in-demand field and access valuable resources to kickstart your journey.
Don’t wait, download now and transform your career!
Pierian Training
Pierian Training is a leading provider of high-quality technology training, with a focus on data science and cloud computing.Pierian Training offers live instructor-led training, self-paced online video courses, and private group and cohort training programs to support enterprises looking to upskill their employees.
You May Also Like
Data Science, Tutorials
Guide to NLTK – Natural Language Toolkit for Python
Introduction Natural Language Processing (NLP) lies at the heart of countless applications we use every day, from voice assistants to spam filters and machine translation. It allows machines to understand, interpret, and generate human language, bridging the gap between humans and computers. Within the vast landscape of NLP tools and techniques, the Natural Language Toolkit […]
Pierian Training
Read Post
Machine Learning, Tutorials
GridSearchCV with Scikit-Learn and Python
Introduction In the world of machine learning, finding the optimal set of hyperparameters for a model can significantly impact its performance and accuracy. However, searching through all possible combinations manually can be an incredibly time-consuming and error-prone process. This is where GridSearchCV, a powerful tool provided by Scikit-Learn library in Python, comes to the rescue. […]
Pierian Training
Read Post
Python Basics, Tutorials
Plotting Time Series in Python: A Complete Guide
Introduction Time series data is a type of data that is collected over time at regular intervals. It can be used to analyze trends, patterns, and behaviors over time. In order to effectively analyze time series data, it is important to visualize it in a way that is easy to understand. This is where plotting […]
Pierian Training
Read Post