@Machine Learning #Cross Validation
CV Concepts, types & practical implications.
Deeksha Singh · Follow
Published in · 6 min read · Feb 25, 2022
--
In this article we will be seeing theoretical concept behind Cross validation, different types of it and in last its practical implications using python & sklearn.
But before that, Why we need Cross Validation? Lets understand .
Before building any ML model with the given data, we split our dataset into test and train set in certain percentage depends upon the availability of count of dataset. Mostly, Test Set : 20 -30 % of data
& Train Set : 70–80 % of data
where, Accuracy / performance of model will be checked by the test dataset. But this 70- 30 % volume of data is randomly selected out of all datapoints which leads to fluctuation in accuracy. This is controlled by assigning a definite value to Variable random_state.
Random state will decide the splitting of data into test and train set and using a particular finite number(It can take any positive value) will ensure same results will be reproduced again and again. But for different random_state splitting of test and train will be different and hence accuracy obtained will be different and results in fluctuation of accuracy.
We need to validate the accuracy of our ML model and here comes the role of cross validation: It is a technique for evaluating the accuracy of ML models by training a models using different subsets of data for certain number of iterations. The final output of the model will be average of all. It also mitigates the effect of overfitting.
Mostly there are 6 types of cv methods:
- Hold Out Method
- Leave One out Cross Validation
- K Fold Cross Validation
- Stratified K Fold Cross Validation
- Time Series Cross Validation
- Repeated Random Test-Train Splits or Monte Carlo cross-validation
Lets see one by one:
This is simply splitting the data into training & test set. Percentage of training data is more than test data. Post that using training set for training the model and remaining test set for error estimation.
Disadvantages : There will be chances of high variance because any random sample of data and pattern associated with it may get selected into test data. Since we are validating model with test data, accuracy and model generalization would be negatively affected.
In this, out of all data points one data is left as test data and rest as training data. So for n data points we have to perform n iterations to cover each data point.
Leave-P-Out Cross Validation is a special case leaving p datapoints for testing and validation and n-p for training the model.
Shortcomings :
i) High Computing power is required since many iterations are required for each data point of large datasets.
ii) Since n-1 data points are used as training data, overfitting will happen results in low bias but it won’t produce a generalized model resulting in high error and low accuracy. It was used long back, now a days no one uses it.
In this, Whole n dataset is divided into k parts with n/k =p and then this p will be taken as test data in each iteration and next p in next iteration and so on till k iterations . For ex. 20 datapoints for 5 fold cross validation, 20/5 =4, so the given dataset will be divided as shown in below image. Sets will be different for different fold. Each data will be considered one time in test set and k-1 times in training set enhancing the effectiveness of this method. Each fold will give different accuracy and final accuracy will be average of all these 5 accuracies. Also, we will be able to obtain minimum and maximum accuracy of this particular model.
Advantages:
i) Efficient use of data as each data point is used for both training and testing purpose.
Low bias because most of the data is used for training.
Low variance because almost each datapoint is used in test set as well.
ii)Accuracy is high.
Ideally a value between 5 -10 is preferred for K. But it can take any value. Higher value of K will leads in accuracy similar to LOOCV method.
Disadvantages :
i) Imbalanced dataset results low accuracy with this method , Lets say for a binary classification problem, in test data we have maximum instances of output 1 so it won’t give accurate result with respect to particular model. Or in price prediction, all the data selected for test set have high price so again accuracy will be affected.
To overcome this, we use stratified Cross Validation.
In this, random sample populated in train and test dataset is such that the number of instances of each class in each iterations of training and test data splitting is taken in good proportion of yes and no, 0 & 1, or highs and lows so that model gives good accuracy.
It is completely for time series data like stock price prediction, sales prediction. Input is sequentially getting added into the training data as shown below.
It involves both traditional train test split and K-fold CV. Here random splitting of dataset is done into train and test set and then further process of splitting and performance measurement is repeated for number of times specified by us. Cross Validation is performed.
Disadvantages:
i)It is not suitable for imbalance dataset.
ii)Chances are that some samples didn’t get selected for either of train and test data.
Now we are implementing all above techniques using python and sklearn for building a simple ML model. It’s simply for understanding the Cross validation techniques so other hyperparameters of Regressor Classifiers are at default value.
We are considering a dataset of cancer to predict the type of cancer on the basis of various feature i.e, Benign (B) & Malignant (M).
import pandas as pd
data=pd.read_csv(r'/content/drive/MyDrive/cancer_dataset.csv')
data.head()
#Removing Null Values
data.isnull().sum()
#last column have all NaN value so we can drop thatdata1=data.drop(['Unnamed: 32'],axis='columns')### Dividing dataset into dependent & independent feature#diagnosis is the output and rest all are input features.x=data1.iloc[:,2:]
y=data1.iloc[:,1]#to check if the dataset is balanaced or noty.value_counts()
Now we are building ML models using Different CV techniques.
1.HoldOut Validation
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_splitx_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=0)
model=DecisionTreeClassifier()
model.fit(x_train,y_train)
mod_score1=model.score(x_test,y_test)
mod_score1
2. Leave One Out Cross Validation(LOOCV)
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
model=DecisionTreeClassifier()
leave_val=LeaveOneOut()
mod_score2=cross_val_score(model,x,y,cv=leave_val)print(np.mean(mod_score2))
3.K Fold Cross Validation
from sklearn.model_selection import KFold
model=DecisionTreeClassifier()
kfold_validation=KFold(10)import numpy as np
from sklearn.model_selection import cross_val_score
mod_score3=cross_val_score(model,x,y,cv=kfold_validation)
print(mod_score3)#Overall accuracy of the model will be average of all values.
print(np.mean(mod_score3))
4. Stratified K-Fold Cross Validation
from sklearn.model_selection import StratifiedKFold
sk_fold=StratifiedKFold(n_splits=5)
model=DecisionTreeClassifier()
mod_score4=cross_val_score(model,x,y,cv=sk_fold)
print(np.mean(mod_score4))
print(mod_score4)
5.Repeated Random Test-Train Split
from sklearn.model_selection import ShuffleSplit
model=DecisionTreeClassifier()
s_split=ShuffleSplit(n_splits=10,test_size=0.30)
mod_score5=cross_val_score(model,x,y,cv=s_split)
print(mod_score5)
print(np.mean(mod_score5))
With this, we have covered almost every point of cross validation.
Thanks for reading !!