Complete tutorial on Cross Validation with Implementation in python using Scikit learn (2024)

@Machine Learning #Cross Validation

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (3)

In this article we will be seeing theoretical concept behind Cross validation, different types of it and in last its practical implications using python & sklearn.

But before that, Why we need Cross Validation? Lets understand .

Before building any ML model with the given data, we split our dataset into test and train set in certain percentage depends upon the availability of count of dataset. Mostly, Test Set : 20 -30 % of data & Train Set : 70–80 % of datawhere, Accuracy / performance of model will be checked by the test dataset. But this 70- 30 % volume of data is randomly selected out of all datapoints which leads to fluctuation in accuracy. This is controlled by assigning a definite value to Variable random_state.

Random state will decide the splitting of data into test and train set and using a particular finite number(It can take any positive value) will ensure same results will be reproduced again and again. But for different random_state splitting of test and train will be different and hence accuracy obtained will be different and results in fluctuation of accuracy.

We need to validate the accuracy of our ML model and here comes the role of cross validation: It is a technique for evaluating the accuracy of ML models by training a models using different subsets of data for certain number of iterations. The final output of the model will be average of all. It also mitigates the effect of overfitting.

Mostly there are 6 types of cv methods:

  1. Hold Out Method
  2. Leave One out Cross Validation
  3. K Fold Cross Validation
  4. Stratified K Fold Cross Validation
  5. Time Series Cross Validation
  6. Repeated Random Test-Train Splits or Monte Carlo cross-validation

Lets see one by one:

This is simply splitting the data into training & test set. Percentage of training data is more than test data. Post that using training set for training the model and remaining test set for error estimation.

Disadvantages : There will be chances of high variance because any random sample of data and pattern associated with it may get selected into test data. Since we are validating model with test data, accuracy and model generalization would be negatively affected.

In this, out of all data points one data is left as test data and rest as training data. So for n data points we have to perform n iterations to cover each data point.

Leave-P-Out Cross Validation is a special case leaving p datapoints for testing and validation and n-p for training the model.

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (4)

Shortcomings :

i) High Computing power is required since many iterations are required for each data point of large datasets.

ii) Since n-1 data points are used as training data, overfitting will happen results in low bias but it won’t produce a generalized model resulting in high error and low accuracy. It was used long back, now a days no one uses it.

In this, Whole n dataset is divided into k parts with n/k =p and then this p will be taken as test data in each iteration and next p in next iteration and so on till k iterations . For ex. 20 datapoints for 5 fold cross validation, 20/5 =4, so the given dataset will be divided as shown in below image. Sets will be different for different fold. Each data will be considered one time in test set and k-1 times in training set enhancing the effectiveness of this method. Each fold will give different accuracy and final accuracy will be average of all these 5 accuracies. Also, we will be able to obtain minimum and maximum accuracy of this particular model.

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (5)

Advantages:

i) Efficient use of data as each data point is used for both training and testing purpose.

Low bias because most of the data is used for training.

Low variance because almost each datapoint is used in test set as well.

ii)Accuracy is high.

Ideally a value between 5 -10 is preferred for K. But it can take any value. Higher value of K will leads in accuracy similar to LOOCV method.

Disadvantages :

i) Imbalanced dataset results low accuracy with this method , Lets say for a binary classification problem, in test data we have maximum instances of output 1 so it won’t give accurate result with respect to particular model. Or in price prediction, all the data selected for test set have high price so again accuracy will be affected.

To overcome this, we use stratified Cross Validation.

In this, random sample populated in train and test dataset is such that the number of instances of each class in each iterations of training and test data splitting is taken in good proportion of yes and no, 0 & 1, or highs and lows so that model gives good accuracy.

It is completely for time series data like stock price prediction, sales prediction. Input is sequentially getting added into the training data as shown below.

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (6)

It involves both traditional train test split and K-fold CV. Here random splitting of dataset is done into train and test set and then further process of splitting and performance measurement is repeated for number of times specified by us. Cross Validation is performed.

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (7)

Disadvantages:

i)It is not suitable for imbalance dataset.

ii)Chances are that some samples didn’t get selected for either of train and test data.

Now we are implementing all above techniques using python and sklearn for building a simple ML model. It’s simply for understanding the Cross validation techniques so other hyperparameters of Regressor Classifiers are at default value.

We are considering a dataset of cancer to predict the type of cancer on the basis of various feature i.e, Benign (B) & Malignant (M).

import pandas as pd
data=pd.read_csv(r'/content/drive/MyDrive/cancer_dataset.csv')
data.head()
Complete tutorial on Cross Validation with Implementation in python using Scikit learn (8)
#Removing Null Values
data.isnull().sum()
Complete tutorial on Cross Validation with Implementation in python using Scikit learn (9)
#last column have all NaN value so we can drop thatdata1=data.drop(['Unnamed: 32'],axis='columns')### Dividing dataset into dependent & independent feature#diagnosis is the output and rest all are input features.x=data1.iloc[:,2:]
y=data1.iloc[:,1]
#to check if the dataset is balanaced or noty.value_counts()
Complete tutorial on Cross Validation with Implementation in python using Scikit learn (10)

Now we are building ML models using Different CV techniques.

1.HoldOut Validation

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_splitx_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=0)
model=DecisionTreeClassifier()
model.fit(x_train,y_train)
mod_score1=model.score(x_test,y_test)
mod_score1
Complete tutorial on Cross Validation with Implementation in python using Scikit learn (11)

2. Leave One Out Cross Validation(LOOCV)

from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
model=DecisionTreeClassifier()
leave_val=LeaveOneOut()
mod_score2=cross_val_score(model,x,y,cv=leave_val)
print(np.mean(mod_score2))
Complete tutorial on Cross Validation with Implementation in python using Scikit learn (12)

3.K Fold Cross Validation

from sklearn.model_selection import KFold
model=DecisionTreeClassifier()
kfold_validation=KFold(10)
import numpy as np
from sklearn.model_selection import cross_val_score
mod_score3=cross_val_score(model,x,y,cv=kfold_validation)
print(mod_score3)
#Overall accuracy of the model will be average of all values.
print(np.mean(mod_score3))
Complete tutorial on Cross Validation with Implementation in python using Scikit learn (13)

4. Stratified K-Fold Cross Validation

from sklearn.model_selection import StratifiedKFold
sk_fold=StratifiedKFold(n_splits=5)
model=DecisionTreeClassifier()
mod_score4=cross_val_score(model,x,y,cv=sk_fold)
print(np.mean(mod_score4))
print(mod_score4)
Complete tutorial on Cross Validation with Implementation in python using Scikit learn (14)

5.Repeated Random Test-Train Split

from sklearn.model_selection import ShuffleSplit
model=DecisionTreeClassifier()
s_split=ShuffleSplit(n_splits=10,test_size=0.30)
mod_score5=cross_val_score(model,x,y,cv=s_split)
print(mod_score5)
print(np.mean(mod_score5))
Complete tutorial on Cross Validation with Implementation in python using Scikit learn (15)

With this, we have covered almost every point of cross validation.

Thanks for reading !!

Complete tutorial on Cross Validation with Implementation in python using Scikit learn (2024)

References

Top Articles
Stop Mushrooms from Popping Up in Your Yard with These 4 Natural Solutions
Bannanameat
Pollen Count Centreville Va
Housing near Juneau, WI - craigslist
Danielle Moodie-Mills Net Worth
Phcs Medishare Provider Portal
Craigslist Mpls Mn Apartments
Overnight Cleaner Jobs
Gabrielle Abbate Obituary
Mohawkind Docagent
The Best English Movie Theaters In Germany [Ultimate Guide]
Swimgs Yung Wong Travels Sophie Koch Hits 3 Tabs Winnie The Pooh Halloween Bob The Builder Christmas Springs Cow Dog Pig Hollywood Studios Beach House Flying Fun Hot Air Balloons, Riding Lessons And Bikes Pack Both Up Away The Alpha Baa Baa Twinkle
Items/Tm/Hm cheats for Pokemon FireRed on GBA
Craigslist Deming
7440 Dean Martin Dr Suite 204 Directions
National Office Liquidators Llc
Dutch Bros San Angelo Tx
Committees Of Correspondence | Encyclopedia.com
Geometry Review Quiz 5 Answer Key
Att.com/Myatt.
2013 Ford Fusion Serpentine Belt Diagram
Aes Salt Lake City Showdown
R. Kelly Net Worth 2024: The King Of R&B's Rise And Fall
Delectable Birthday Dyes
Soul Eater Resonance Wavelength Tier List
Snohomish Hairmasters
Ultra Ball Pixelmon
Lawrence Ks Police Scanner
Craig Woolard Net Worth
Martin Village Stm 16 & Imax
140000 Kilometers To Miles
Palmadise Rv Lot
Roch Hodech Nissan 2023
Exploring TrippleThePotatoes: A Popular Game - Unblocked Hub
Shnvme Com
Goodwill Houston Select Stores Photos
Cruise Ships Archives
ENDOCRINOLOGY-PSR in Lewes, DE for Beebe Healthcare
1v1.LOL Game [Unblocked] | Play Online
„Wir sind gut positioniert“
The Holdovers Showtimes Near Regal Huebner Oaks
Electric Toothbrush Feature Crossword
Craigslist Com St Cloud Mn
Garland County Mugshots Today
Mychart University Of Iowa Hospital
Rescare Training Online
Kaamel Hasaun Wikipedia
Stoughton Commuter Rail Schedule
Ewwwww Gif
Dmv Kiosk Bakersfield
Asisn Massage Near Me
Dr Seuss Star Bellied Sneetches Pdf
Latest Posts
Article information

Author: Dong Thiel

Last Updated:

Views: 6103

Rating: 4.9 / 5 (79 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Dong Thiel

Birthday: 2001-07-14

Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

Phone: +3512198379449

Job: Design Planner

Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.