Cross Validation Explained: Evaluating estimator performance. (2024)

Cross Validation Explained: Evaluating estimator performance. (3)

The ultimate goal of a Machine Learning Engineer or a Data Scientist is to develop a Model in order to get Predictions on New Data or Forecast some events for future on Unseen data. A Good Model is not the one that gives accurate predictions on the known data or training data but the one which gives good predictions on the new data and avoids overfitting and underfitting.

After completing this tutorial, you will know:

  • That why to use cross validation is a procedure used to estimate the skill of the model on new data.
  • There are common tactics that you can use to select the value of k for your dataset.
  • There are commonly used variations on cross-validation such as stratified and LOOCV that are available in scikit-learn.
  • Practical Implementation of k-Fold Cross Validation in Python

To derive a solution we should first understand the problem. Before we proceed to Understanding Cross Validation let us first understand Overfitting and Underfitting

Understanding Underfitting and Overfitting:

Overfit Model: Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too well.

Overfitting a model result in good accuracy for training data set but poor results on new data sets. Such a model is not of any use in the real world as it is not able to predict outcomes for new cases.

Underfit Model: Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Underfitting is often a result of an excessively simple model. By simple we mean that the missing data is not handled properly, no outlier treatment, removing of irrelevant features or features which do not contribute much to the predictor variable.

Cross Validation Explained: Evaluating estimator performance. (4)

How to tackle Problem of Overfitting:

The answer is Cross Validation

A key challenge with overfitting, and with machine learning in general, is that we can’t know how well our model will perform on new data until we actually test it.

To address this, we can split our initial dataset into separate training and test subsets.

There are different types of Cross Validation Techniques but the overall concept remains the same,

To partition the data into a number of subsets

Hold out a set at a time and train the model on remaining set

Test model on hold out set

Repeat the process for each subset of the dataset

Cross Validation Explained: Evaluating estimator performance. (5)

Types of Cross Validation:

•K-Fold Cross Validation

•Stratified K-fold Cross Validation

•Leave One Out Cross Validation

Let’s understand each type one by one

k-Fold Cross Validation:

Cross Validation Explained: Evaluating estimator performance. (6)

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

If k=5 the dataset will be divided into 5 equal parts and the below process will run 5 times, each time with a different holdout set.

1. Take the group as a holdout or test data set

2. Take the remaining groups as a training data set

3. Fit a model on the training set and evaluate it on the test set

4. Retain the evaluation score and discard the model

At the end of the above process Summarize the skill of the model using the sample of model evaluation scores.

How to decide the value of k?

The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.

A value of k=10 is very common in the field of applied machine learning, and is recommend if you are struggling to choose a value for your dataset.

If a value for k is chosen that does not evenly split the data sample, then one group will contain a remainder of the examples. It is preferable to split the data sample into k groups with the same number of samples, such that the sample of model skill scores are all equivalent.

Stratified k-Fold Cross Validation:

Same as K-Fold Cross Validation, just a slight difference

The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.

In below image, the stratified k-fold validation is set on basis of Gender whether M or F

Cross Validation Explained: Evaluating estimator performance. (7)

Leave One Out Cross Validation (LOOCV):

This approach leaves 1 data point out of training data, i.e. if there are n data points in the original sample then, n-1 samples are used to train the model and p points are used as the validation set. This is repeated for all combinations in which the original sample can be separated this way, and then the error is averaged for all trials, to give overall effectiveness.

The number of possible combinations is equal to the number of data points in the original sample or n.

Cross Validation Explained: Evaluating estimator performance. (8)

Cross Validation is a very useful technique for assessing the effectiveness of your model, particularly in cases where you need to mitigate over-fitting.

Implementation of Cross Validation In Python:

We do not need to call the fit method separately while using cross validation, the cross_val_score method fits the data itself while implementing the cross-validation on data. Below is the example for using k-fold cross validation.

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.model_selection import cross_val_score
#read csv file
data = pd.read_csv("D://RAhil//Kaggle//Data//Iris.csv")#Create Dependent and Independent Datasets based on our Dependent #and Independent featuresX = data[['SepalLengthCm','SepalWidthCm','PetalLengthCm']]
y= data['Species']
model = svm.SVC()accuracy = cross_val_score(model, X, y, scoring='accuracy', cv = 10)
print(accuracy)
#get the mean of each fold
print("Accuracy of Model with Cross Validation is:",accuracy.mean() * 100)

Output:

Cross Validation Explained: Evaluating estimator performance. (9)

The Accuracy of the model is the average of the accuracy of each fold.

In this tutorial, you discovered why do we need to use Cross Validation, gentle introduction to different types of cross validation techniques and practical example of k-fold cross validation procedure for estimating the skill of machine learning models.

Specifically, you learned:

  • That cross validation is a procedure used to avoid overfitting and estimate the skill of the model on new data.
  • There are common tactics that you can use to select the value of k for your dataset.
  • There are commonly used variations on cross-validation, such as stratified and repeated, that are available in scikit-learn.

If you liked this blog give it some CLAPS and SHARE it with your friends, you can find more interesting articles here, stay tuned for more interesting techniques and concepts of Machine Learning.

Cross Validation Explained: Evaluating estimator performance. (2024)

References

Top Articles
Skip the Games Jacksonville Florida: Gateway to Adventure and Entertainment
Sites Like Skip the Games: Top Websites for Adult Dating
Exclusive: Baby Alien Fan Bus Leaked - Get the Inside Scoop! - Nick Lachey
Online Reading Resources for Students & Teachers | Raz-Kids
Puretalkusa.com/Amac
CSC error CS0006: Metadata file 'SonarAnalyzer.dll' could not be found
Katie Boyle Dancer Biography
Space Engineers Projector Orientation
Jessica Renee Johnson Update 2023
Mens Standard 7 Inch Printed Chappy Swim Trunks, Sardines Peachy
Nwi Arrests Lake County
Busted Barren County Ky
5 high school volleyball stars of the week: Sept. 17 edition
Sky X App » downloaden & Vorteile entdecken | Sky X
E22 Ultipro Desktop Version
Jbf Wichita Falls
Craigslist Appomattox Va
25 Best Things to Do in Palermo, Sicily (Italy)
Jordan Poyer Wiki
Bay Area Craigslist Cars For Sale By Owner
Amerisourcebergen Thoughtspot 2023
Panolian Batesville Ms Obituaries 2022
2011 Hyundai Sonata 2 4 Serpentine Belt Diagram
TMO GRC Fortworth TX | T-Mobile Community
Biografie - Geertjan Lassche
Jackass Golf Cart Gif
Trust/Family Bank Contingency Plan
Willys Pickup For Sale Craigslist
Rogold Extension
Eero Optimize For Conferencing And Gaming
Baldur's Gate 3 Dislocated Shoulder
El agente nocturno, actores y personajes: quién es quién en la serie de Netflix The Night Agent | MAG | EL COMERCIO PERÚ
Craigslist Lakeside Az
Tokyo Spa Memphis Reviews
888-333-4026
sacramento for sale by owner "boats" - craigslist
Traumasoft Butler
Vindy.com Obituaries
Dwc Qme Database
Sofia Franklyn Leaks
LumiSpa iO Activating Cleanser kaufen | 19% Rabatt | NuSkin
Quick Base Dcps
Candise Yang Acupuncture
Jigidi Free Jigsaw
Jane Powell, MGM musical star of 'Seven Brides for Seven Brothers,' 'Royal Wedding,' dead at 92
Sam's Club Gas Price Sioux City
Marcel Boom X
UNC Charlotte Admission Requirements
Naomi Soraya Zelda
Treatise On Jewelcrafting
How To Win The Race In Sneaky Sasquatch
Estes4Me Payroll
Latest Posts
Article information

Author: Geoffrey Lueilwitz

Last Updated:

Views: 6105

Rating: 5 / 5 (80 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Geoffrey Lueilwitz

Birthday: 1997-03-23

Address: 74183 Thomas Course, Port Micheal, OK 55446-1529

Phone: +13408645881558

Job: Global Representative

Hobby: Sailing, Vehicle restoration, Rowing, Ghost hunting, Scrapbooking, Rugby, Board sports

Introduction: My name is Geoffrey Lueilwitz, I am a zealous, encouraging, sparkling, enchanting, graceful, faithful, nice person who loves writing and wants to share my knowledge and understanding with you.