Dry beans (Phaseolus Vulgaris) are one of the world’s
longest-cultivated crops. They are rich in protein, fiber, and various
micronutrients and are widely consumed throughout the world. In some
areas like Turkey and Asian Countries, dry beans are usually cultivated
in the form of populations containing mixed species of seeds. It is
reported that using bean seeds containing mixed species will decrease
the market value and it is time-consuming and costly for bean producers
to classify beans by manpower[1]. Therefore,
the purpose of analyzing this data set is to build suitable models to
classify seven beans from a mixed population as precise as
possible.
In this report, six machine learning methods were applied to address the problem, including linear discriminant analysis, logistic regression, k nearest neighbor (KNN), support vector machine (SVM), decision tree and random forest. SVM achieves the highest accuracy in classifying seven different types of beans. This model can help dry beans producers to achieve uniform dry beans.
This data set consists of the morphological information extracted from digital images of different types dry beans. The features includes: By the botanical characteristic of dry beans, they are classified as seven different types (Barbunya, Battal, Bombay, Calı, Dermason, Horoz, Tombul, Selanik and Seker).
The dataset can be found in UCI.
library(readxl)
library(class)
library(moments)
library(car)
library(ggplot2)
library(reshape2)
library(MASS)
library(e1071)
library(caret)
library(nnet)
library(tree)
library(randomForest)
library(maboost)
library(knitr)
library(KernelKnn)
beans <- data.frame(read_excel("./DryBeanDataset/Dry_Bean_Dataset.xlsx"))
In order to have a comprehensive view of the dataset and make it convenient for future use, I wrote a function to show the general characteristic of the data.
data_str <- function(df){
rownames <- colnames(df)
types <- c()
nas <- c()
na_ratio <- c()
means <- c()
stds <- c()
skewness <- c()
kurtosis <- c()
# type of each predictor
for(p in 1:dim(df)[2]){
if(typeof(df[,p]) == "character"){
sk <- NA
ku <- NA
me <- NA
st <- NA
}
else{
sk <- round(skewness(df[,p]),6)
ku <- round(kurtosis(df[,p]),6)
me <- round(mean(df[,p]),6)
st <- round(sd(df[,p]),6)
}
types <- append(types,typeof(df[,p]))
nas <- append(nas,sum(is.na(df[,p])))
na_ratio <- append(na_ratio,nas[p]/dim(df)[1])
skewness <- append(skewness,sk)
kurtosis <- append(kurtosis,ku)
means <- append(means,me)
stds <- append(stds,st)
}
res <- data.frame(cbind(rownames,types,nas,na_ratio,means,stds,skewness,kurtosis))
colnames(res) <- c("","types","NA","NA_ratio","mean","std","skewness","kurtosis")
return(res)
}
res <- data_str(beans);res
## types NA NA_ratio mean std skewness
## 1 Area double 0 0 53048.284549 29324.095717 2.95228
## 2 Perimeter double 0 0 855.283459 214.289696 1.625765
## 3 MajorAxisLength double 0 0 320.141867 85.694186 1.357516
## 4 MinorAxisLength double 0 0 202.270714 44.970091 2.237717
## 5 AspectRation double 0 0 1.583242 0.246678 0.582445
## 6 Eccentricity double 0 0 0.750895 0.092002 -1.06259
## 7 ConvexArea double 0 0 53768.200206 29774.915817 2.941173
## 8 EquivDiameter double 0 0 253.06422 59.17712 1.948528
## 9 Extent double 0 0 0.749733 0.049086 -0.895151
## 10 Solidity double 0 0 0.987143 0.00466 -2.549531
## 11 roundness double 0 0 0.873282 0.05952 -0.635609
## 12 Compactness double 0 0 0.799864 0.061713 0.037107
## 13 ShapeFactor1 double 0 0 0.006564 0.001128 -0.534023
## 14 ShapeFactor2 double 0 0 0.001716 0.000596 0.30116
## 15 ShapeFactor3 double 0 0 0.64359 0.098996 0.242427
## 16 ShapeFactor4 double 0 0 0.995063 0.004366 -2.758875
## 17 Class character 0 0 <NA> <NA> <NA>
## kurtosis
## 1 10.794379
## 2 3.585397
## 3 2.529719
## 4 6.646765
## 5 0.112874
## 6 1.385861
## 7 10.737234
## 8 5.188506
## 9 0.642107
## 10 12.792158
## 11 0.373232
## 12 -0.224226
## 13 0.713106
## 14 -0.859694
## 15 -0.145282
## 16 13.030482
## 17 <NA>
We can notice that there is no missing value in this data set.
Area (A): The area of a bean zone and the number of pixels within its boundaries.
Perimeter (P): Bean circumference is defined as the length of its border.
Major axis length (L): The distance between the ends of the longest line that can be drawn from a bean.
Minor axis length (l): The longest line that can be drawn from the bean while standing perpendicular to the main axis.
Aspect ratio (K): Defines the relationship between L and l.
Eccentricity (Ec): Eccentricity of the ellipse having the same moments as the region.
Convex area (C): Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
Equivalent diameter (Ed): The diameter of a circle having the same area as a bean seed area: \(\sqrt{4A/\pi}\)
Extent (Ex): The ratio of the pixels in the bounding box to the bean area.
Solidity (S): Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
Roundness (R): Calculated with the following formula: \((4\pi A/P^2)\)
Compactness (CO): Measures the roundness of an object: \(Ed/L\)
ShapeFactor1 (SF1) : \(\frac{L}{A}\)
ShapeFactor2 (SF2) : \(\frac{A}{L^3}\)
ShapeFactor3 (SF3) : \(\frac{A}{(L/2)(L/2)\pi}\)
ShapeFactor4 (SF4) : \(\frac{A}{(L/2)(l/2)\pi}\)
Class (Seker, Barbunya, Bombay, Cali, Dermosan, Horoz and Sira)
These morphological features are extracted from images captured by a computer vision system[2]. There are papers about wheat kernel morphological variation pointing out the above features have significant influence to similar grain classification problems like rice and wheat grains[3].
Scale data
b.s <- scale(beans[-17])
cor_heatmap <- function(mydata){
# Compute the correlation matrix
cormat <- round(cor(mydata),2)
# Get lower triangle of the correlation matrix
cormat[upper.tri(cormat)] <- NA
upper_tri <- cormat
# Finished correlation matrix heatmap
melted_cormat <- melt(upper_tri, na.rm = TRUE)
ggplot(data = melted_cormat, aes(Var1, Var2, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "#267373", high = "#bf5940", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
}
Correlation Heat Map
cor_heatmap(b.s)
The first normality plot is for the area predictor and we can see it is obvious not normal. Even after the box-cox transformation, the plot is still not normal.
test <- beans[,1]
a1 <- qqPlot(test)
bc <- boxcox(test~1,lambda = seq(-5,1,0.1),plotit = TRUE)
lambda <- bc$x[which.max(bc$y)]
trans <- bcPower(test,lambda)
a2 <- qqPlot(trans)
Linear discriminant analysis (LDA) is a method used in statistics and
other fields, to find a linear combination of features that
characterizes or separates two or more classes of objects or events. The
resulting combination may be used as a linear classifier, or, more
commonly, for dimensionality reduction before later
classification.
LDA works when the measurements made on independent variables for each
observation are continuous quantities, which the data set here
satisfies.
The main idea of multi-class LDA is to find a linear combinations of
coefficients vector that maximize the ratio of the between-class
covariance scattering to the within-class covariance scattering, which
is an optimization problem that can be written as: \[
max_w \frac{w^TS_bw}{w^TS_ww}
\] The solution is given by solving \(S_bw = \lambda S_ww\).
\(S_b\): between-class scatter
matrix.
\(S_w\): within-class scatter
matrix.
Generally, at most m - 1 generalized eigenvectors are useful to
discriminate between m classes.
There are two important assumptions LDA make: 1. assume each class
density is multivariate Gaussian and 2. they have equal
covariance.
LDA and PCA are similar in the sense that both of them reduce the
data dimensions but LDA provides better separation between groups of
experimental data compared to PCA. This is because LDA models the
differences between the classes of data, whereas PCA does not take
account of these differences.
LDA may not have ideal performance if the assumptions of LDA are not met.
Train 80%, test 20%.
set.seed(723)
ind.train <- sample(1:nrow(beans),round(0.8*nrow(beans)))
lda.train <- beans[ind.train,]
lda.test <- beans[-ind.train,]
The below pie chart shows the imbalanced samples in different bean
types. “BOMBAY” has the smallest portion in this dataset. It is reported
that imbalanced dataset have negative effect on the performance of
applying LDA according the paper (Jigang Xie∗, Zhengding Qiu)[4].
There are several sampling methods to solve the imbalanced problem. We
can use random oversampling methods according to the paper[4] or stratified sampling. Considering using
oversampling may introduce new problems, stratified sampling method is a
more simple and easy choice.
# seperate data into classes
SE <- subset(lda.train,Class == "SEKER")
BA <- subset(lda.train,Class == "BARBUNYA")
BO <- subset(lda.train,Class == "BOMBAY")
CA <- subset(lda.train,Class == "CALI")
HO <- subset(lda.train,Class == "HOROZ")
SI <- subset(lda.train,Class == "SIRA")
DE <- subset(lda.train,Class == "DERMASON")
sep.samples <- list(SE,BA,BO,CA,HO,SI,DE)
# seperate data into classes
pie.se <- dim(SE)[1]
pie.ba <- dim(BA)[1]
pie.bo <- dim(BO)[1]
pie.ca <- dim(CA)[1]
pie.ho <- dim(HO)[1]
pie.si <- dim(SI)[1]
pie.de <- dim(DE)[1]
num_samples <- c(pie.se,pie.ba,pie.bo,pie.ca,pie.ho,pie.si,pie.de)
pie.labs <- c("SEKER","BARBUNYA","BOMBAY","CALI","HOROZ","SIRA","DERMASON")
ggdata <- data.frame(X = pie.labs,Y = num_samples)
# plot pie chart
piechart <- ggplot(data=ggdata,aes(x="", y=Y, fill=X)) +
geom_bar(width=1, stat="identity") +
coord_polar("y", start=0) +
geom_text(aes(label = paste0(Y,
" (",
scales::percent(Y / sum(Y)),
")")),
position = position_stack(vjust = 0.5)) +
xlab("") +
ylab("Value") +
theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())
piechart
By using stratified, it is guaranteed that certain percent of the samples of each class will be included into the training data.
# sampling
all <- lda.train
sep.samples <- list(SE,BA,BO,CA,HO,SI,DE)
set.seed(12)
ind <- sample(1:dim(all)[1])
# folds
folds <- 10
# result
stratified.samples <- data.frame()
# used for update
resp <- sep.samples
# iterate through data that has been classified by type for k(folds) times
# get a certain portion from each class population and then update it
for(k in 1:folds){
stratified.samples.k <- data.frame()
for(t in 1:length(resp)){
if( k == folds){
# for the last run, get the rest data
str.samples.i <- as.data.frame(resp[t])
stratified.samples.k <- rbind(stratified.samples.k,str.samples.i)
resp[t] <- list(data.frame())
}else{
resp.df <- as.data.frame(resp[t])
set.seed(2010)
# number of original class samples
s.i <- dim(data.frame(sep.samples[t]))[1]
n.i <- dim(data.frame(resp[t]))[1]
ind.i <- sample(1:n.i,floor(s.i/folds))
str.samples.i <- resp.df[ind.i,]
stratified.samples.k <- rbind(stratified.samples.k,str.samples.i)
resp[t] <- list(resp.df[-ind.i,])
}
}
set.seed(2021)
# random sample in each fold
ind.k <- sample(1:dim(stratified.samples.k)[1],dim(stratified.samples.k)[1])
stratified.samples <- rbind(stratified.samples,stratified.samples.k[ind.k,])
}
Below is a function by which we can apply, say a 5 fold cross validation LDA method on a data set.
# lda cv func
lda_cv <- function(x,cl,k) {
# x : training data
# cl : labels
# k : number of folds
n <- dim(x)[1]
cl <- as.factor(unlist(cl))
# runs
runs <- n %/% k
rem <- n %% k
preds <- c()
# calculate each part
for (i in 1:k) {
# test sub index
s.low <- (i-1)*runs + 1
s.high <- i*runs
# plus remainder in last run
if(i==k){
s.high <- i*runs + rem
}
# test set
test.i <- x[s.low:s.high,]
# train set
train.i <- x[ -s.low:-s.high, ]
# train label
train.lab.i <- cl[-s.low:-s.high]
# apply lda
zi <- lda(train.i,grouping = train.lab.i)
pred.i <- predict(zi,test.i)$class
preds <- append(preds,pred.i)
}
return(mean(preds != cl))
}
lda_cv(lda.train[,-17],lda.train[17],10)
## [1] 0.09505005
lda_cv(stratified.samples[,-17],stratified.samples[17],10)
## [1] 0.09468271
The error rate improved a little after applying stratified sampling.
# lda model
lda.train.model <- lda(lda.train[,-17],grouping = as.factor(lda.train[,17]));lda.train.model
## Call:
## lda(lda.train[, -17], grouping = as.factor(lda.train[, 17]))
##
## Prior probabilities of groups:
## BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
## 0.09909083 0.03802002 0.12076407 0.25998714 0.14170264 0.14950868 0.19092662
##
## Group means:
## Area Perimeter MajorAxisLength MinorAxisLength AspectRation
## BARBUNYA 69635.23 1044.5266 369.5023 240.0512 1.543494
## BOMBAY 173966.72 1587.6997 594.1228 374.7352 1.586388
## CALI 75510.91 1057.4184 409.2946 236.3801 1.732633
## DERMASON 32028.41 664.2551 246.2582 165.3972 1.491065
## HOROZ 53670.72 920.2530 372.8479 184.1156 2.028204
## SEKER 39865.47 727.3007 251.2236 201.8871 1.244990
## SIRA 44685.40 796.1663 299.3148 190.6648 1.570867
## Eccentricity ConvexArea EquivDiameter Extent Solidity roundness
## BARBUNYA 0.7544642 70849.64 296.9490 0.7496429 0.9828458 0.8005454
## BOMBAY 0.7707459 176303.92 469.5669 0.7767922 0.9869174 0.8645735
## CALI 0.8145843 76663.67 309.4691 0.7584368 0.9849875 0.8459386
## DERMASON 0.7368492 32407.27 201.4022 0.7527180 0.9882247 0.9081119
## HOROZ 0.8678106 54461.61 260.7932 0.7073376 0.9854818 0.7941008
## SEKER 0.5847062 40253.71 224.9061 0.7715761 0.9903483 0.9449821
## SIRA 0.7676287 45228.71 238.2270 0.7495036 0.9879727 0.8843964
## Compactness ShapeFactor1 ShapeFactor2 ShapeFactor3 ShapeFactor4
## BARBUNYA 0.8052266 0.005361971 0.0013969972 0.6494858 0.9958101
## BOMBAY 0.7924492 0.003438468 0.0008429636 0.6289350 0.9918752
## CALI 0.7569268 0.005458558 0.0011087572 0.5733574 0.9906071
## DERMASON 0.8189590 0.007767047 0.0021625004 0.6713938 0.9969159
## HOROZ 0.7004834 0.007008403 0.0010451582 0.4912143 0.9919209
## SEKER 0.8968908 0.006335015 0.0025413119 0.8052300 0.9983766
## SIRA 0.7971260 0.006724178 0.0016816646 0.6359963 0.9953882
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4
## Area 6.370327e-04 -8.548769e-04 -2.149029e-03 -1.236161e-03
## Perimeter 5.147173e-02 -3.456127e-02 -1.543476e-03 5.452820e-03
## MajorAxisLength 4.656642e-01 -4.031309e-01 -8.874835e-01 1.042715e-02
## MinorAxisLength 4.991452e-01 -2.925348e-01 -1.860430e+00 -1.544286e-01
## AspectRation -2.335707e+01 3.647306e+00 -1.072802e+02 1.111226e+02
## Eccentricity -8.827933e+00 4.201642e+01 1.377181e+02 -1.515717e+02
## ConvexArea -5.201061e-04 6.688401e-04 1.884725e-03 7.505564e-04
## EquivDiameter -1.367585e+00 9.226504e-01 2.746601e+00 3.956364e-01
## Extent -1.065512e+00 4.548037e-01 8.428483e-01 2.051990e+00
## Solidity -1.657671e+01 8.725235e+01 9.048091e+01 1.104365e+01
## roundness 2.963510e+01 -2.779383e+01 -9.412537e+00 -7.631068e+00
## Compactness 7.565722e-01 -7.744607e+02 -3.220193e+03 3.252973e+03
## ShapeFactor1 -1.974251e+03 -4.402880e+03 3.985271e+03 3.175435e+03
## ShapeFactor2 -9.496369e+03 1.425541e+04 -1.498958e+04 1.888510e+03
## ShapeFactor3 -2.908922e+00 3.833879e+02 2.035491e+03 -1.849051e+03
## ShapeFactor4 1.496561e+02 -1.755406e+02 -1.945698e+02 -1.008880e+02
## LD5 LD6
## Area -4.784732e-04 4.720198e-04
## Perimeter 8.342181e-02 -3.129432e-02
## MajorAxisLength -5.933094e-01 -3.799290e-01
## MinorAxisLength -8.253265e-01 -3.815315e-01
## AspectRation 8.537150e+01 9.520055e+01
## Eccentricity 1.404652e+01 -8.523257e+01
## ConvexArea 6.877649e-04 3.149340e-04
## EquivDiameter 9.894705e-01 2.850796e-01
## Extent -8.391612e-01 -8.234672e-01
## Solidity 4.376266e+01 6.272076e+01
## roundness 1.758497e+01 -1.669280e+01
## Compactness 7.161696e+02 1.496363e+03
## ShapeFactor1 8.675478e+02 -1.024055e+04
## ShapeFactor2 -8.608727e+03 -3.411115e+03
## ShapeFactor3 -1.954528e+02 -8.459609e+02
## ShapeFactor4 -2.371495e+02 -2.385934e+02
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5 LD6
## 0.5350 0.2204 0.1070 0.0922 0.0305 0.0148
There are 6 discriminant functions for classification. The first 3 has covered 86.2%, percentage separation achieved by each discriminant function. It is not ideal mainly because the data does not satisfy the LDA assumptions.
# training error
pred.lda.train <- predict(lda.train.model,lda.train[,-17])$class
lda.train.err <- mean(pred.lda.train != lda.train[,17]);lda.train.err
## [1] 0.09459087
# test error
pred.lda.test <- predict(lda.train.model,lda.test[,-17])$class
lda.test.err <- mean(pred.lda.test != lda.test[,17]);lda.test.err
## [1] 0.09772226
Train error: 0.0945909, test error: 0.0977223.
# confusion matrix
lda.cm <- caret::confusionMatrix(pred.lda.test,as.factor(lda.test[,17]));lda.cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
## BARBUNYA 201 0 1 2 0 5 1
## BOMBAY 0 107 0 0 0 0 0
## CALI 23 1 306 0 12 0 2
## DERMASON 0 0 0 600 3 4 28
## HOROZ 1 0 4 0 357 0 4
## SEKER 2 0 0 6 0 366 3
## SIRA 16 0 4 107 13 24 519
##
## Overall Statistics
##
## Accuracy : 0.9023
## 95% CI : (0.8905, 0.9132)
## No Information Rate : 0.2627
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8815
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: BARBUNYA Class: BOMBAY Class: CALI Class: DERMASON
## Sensitivity 0.82716 0.99074 0.9714 0.8392
## Specificity 0.99637 1.00000 0.9842 0.9826
## Pos Pred Value 0.95714 1.00000 0.8895 0.9449
## Neg Pred Value 0.98328 0.99962 0.9962 0.9449
## Prevalence 0.08927 0.03968 0.1157 0.2627
## Detection Rate 0.07384 0.03931 0.1124 0.2204
## Detection Prevalence 0.07715 0.03931 0.1264 0.2333
## Balanced Accuracy 0.91176 0.99537 0.9778 0.9109
## Class: HOROZ Class: SEKER Class: SIRA
## Sensitivity 0.9273 0.9173 0.9318
## Specificity 0.9961 0.9953 0.9242
## Pos Pred Value 0.9754 0.9708 0.7599
## Neg Pred Value 0.9881 0.9859 0.9814
## Prevalence 0.1414 0.1466 0.2046
## Detection Rate 0.1312 0.1345 0.1907
## Detection Prevalence 0.1345 0.1385 0.2509
## Balanced Accuracy 0.9617 0.9563 0.9280
logistic regression is a model that is used to model the posterior probabilities of the K classes via linear functions in x while at the same time ensuring they sun to one and remain in [0,1]. The model has the form:
\[ Pr(G=k|X=x) = \frac{exp(\beta_{k0} + \beta_k^Tx)}{1 + \sum_{l=1}^{K-1}exp(\beta_{l0} + \beta_l^Tx )}, k = 1,..., K-1 \]
Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood of G given X.
\[ L(\theta) = \sum_{i-1}^Nlogp_{g_i}(x_i;\theta), \] where \(p_k(x_i;\theta) = Pr(G=k|X=x_i;\theta).\)
It makes no assumptions about distributions of classes in feature
space.
Good accuracy for many simple data sets and it performs well when the
data set is linearly separable.
Logistic regression is less inclined to over-fitting in low dimension
data.
It can interpret model coefficients as indicators of feature
importance.
The major limitation of Logistic Regression is the assumption of
linearity between the dependent variable and the independent
variables.
It can only be used to predict discrete functions.
Take 20% data for testing, 80% for training.
set.seed(7105)
# sampling
number_test <- floor(dim(beans)[1]/5)
index.list <- 1:dim(beans)[1]
sample.index.test <- sample(index.list,number_test)
sample.index.train <- index.list[-sample.index.test]
sample.index.train <- sample(sample.index.train,length(sample.index.train))
lg.sample.test <- beans[sample.index.test,]
lg.sample.train <- beans[sample.index.train,]
lg.model <- multinom(Class ~ ., data = lg.sample.train)
## # weights: 126 (102 variable)
## initial value 21189.015613
## iter 10 value 15950.879558
## iter 20 value 12477.784498
## iter 30 value 11271.034033
## iter 40 value 4628.086901
## iter 50 value 2389.189924
## iter 60 value 2278.372612
## iter 70 value 2240.838383
## iter 80 value 2217.585964
## iter 90 value 2210.672006
## iter 100 value 2205.986818
## final value 2205.986818
## stopped after 100 iterations
# training error
lg.pred <- predict(lg.model,lg.sample.train[,-17])
lg.train.error <- mean(lg.pred != as.factor(lg.sample.train[,17]));lg.train.error
## [1] 0.07227477
# test error
lg.pred <- predict(lg.model,lg.sample.test[,-17])
lg.test.error <- mean(lg.pred != as.factor(lg.sample.test[,17]));lg.test.error
## [1] 0.07861866
The training error is 0.07227477 and the test error is 0.07861866
The k-nearest neighbors algorithm (k-NN) is a non-parametric classification method. In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.
It does not make any strict assumptions about the data.
It does not learn anything in the training period. It just stores the
training data set and learns from it only at the time of making real
time predictions.
KNN is sensitive to the local data structure. The larger class tends
to dominate the prediction, as their examples show up more frequently in
the neighborhood.
We need to do feature scaling before applying KNN algorithm to any data
set due to curse of dimensionality.
A function to calculate test error using knn.
# set knn error result function
get_knn_err <- function(k.list,train_data,test_data,train_label,test_label){
# return test error list for different k
test_err = list()
for (ki in k.list) {
knn_i_train <- knn(train_data,test_data,train_label,k=ki)
knn_i_error <- mean(knn_i_train != test_label)
test_err <- append(test_err,knn_i_error)
}
return(test_err)
}
Take 20% data for testing, 80% for training.
set.seed(2031)
# sampling
number_test <- floor(dim(beans)[1]/5)
number_tune <- floor(dim(beans)[1]/5)
index.list <- 1:dim(beans)[1]
sample.index.test <- sample(index.list,number_test)
index.list.test <- index.list[-sample.index.test]
sample.index.tune <- sample(index.list.test,number_tune)
index.list.tune <- index.list.test[-sample.index.tune]
sample.index.train <- index.list.tune
k.sample.test <- beans[sample.index.test,]
k.sample.tune <- beans[sample.index.tune,]
k.sample.train <- beans[sample.index.train,]
# train data
k.train <- k.sample.train[,-17]
k.train.lab <- k.sample.train[,17]
# test data
k.test <- k.sample.test[,-17]
k.test.lab <- k.sample.test[,17]
k.train.scaled <- scale(k.train,center=T,scale = T)
k.test.scaled <- scale(k.test,center=T,scale = T)
k.list <- 1:30
set.seed(2089)
# tune error
knn.test.errs <- get_knn_err(k.list,k.train.scaled,k.test.scaled,k.train.lab,k.test.lab)
plot(k.list,knn.test.errs)
# the best k
knn.test.errs[which.min(knn.test.errs)];which.min(knn.test.errs)
## [[1]]
## [1] 0.07825129
## [1] 5
# 10 best k's
k10th <- sort(as.numeric(knn.test.errs))[10]
# best 10 k's
k10 <- which(knn.test.errs <= k10th );k10
## [1] 5 6 8 9 10 11 13 14 19 22 23
# train error
knn.train.err <- get_knn_err(k10[1],k.train.scaled,k.train.scaled,k.train.lab,k.train.lab);knn.train.err
## [[1]]
## [1] 0.05722822
# prediction error
knn.pred.err <- get_knn_err(k10[1],k.train.scaled,k.test.scaled,k.train.lab,k.test.lab);knn.pred.err
## [[1]]
## [1] 0.07788391
Train error: 0.05734314, Test error: 0.07861866.
Considering the impact of imbalanced data on majority vote decision, we will take into account the distance from x to each of the k nearest neighbors.
Using accuracy as a criterion for choosing optimal k is not stable, it is suggested to do a K-fold cross validation to tune the model.
weighted_knn_cv_err <- function(K,x,cl,k,method){
# K : number of folds
# x : training data
# cl : labels
# k.list : a list of k to be tuned
n <- nrow(x)
# runs
runs <- n %/% K
rem <- n %% K
preds <- c()
# calculate each part
for (i in 1:K) {
# test sub index
s.low <- (i-1)*runs + 1
s.high <- i*runs
# plus remainder in last run
if(i==K){
s.high <- i*runs + rem
}
# test set
test.i <- x[s.low:s.high,]
# train set
train.i <- x[ -s.low:-s.high, ]
# train label
train.lab.i <- cl[-s.low:-s.high]
# apply weighted knn
w.knn.pred.matrix <- KernelKnn(x,y=cl,k=k,method = method,Levels = 1:7)
w.knn.pred <- apply(w.knn.pred.matrix,1,FUN = function(x) which.max(x) )
# knn_i_pred <- apply(test.i,1,function(x) weighted_knn_predict(train.i,train.lab.i,x,k))
preds <- append(preds,w.knn.pred)
}
return(mean(preds != cl))
}
Transfer Label To Numeric
w.k.train.scaled <- cbind(data.frame(scale(k.sample.train[,-17])),Class = k.sample.train[,17])
w.k.test.scaled <- cbind(data.frame(scale(k.sample.test[,-17])),Class = k.sample.test[,17])
# "SEKER" "BARBUNYA" "BOMBAY" "CALI" "HOROZ" "SIRA" "DERMASON"
# 1 2 3 4 5 6 7
labels <- c("SEKER", "BARBUNYA", "BOMBAY", "CALI", "HOROZ" , "SIRA","DERMASON")
num_lab_table <- rbind(labels,1:length(labels))
for(i in 1:7){
ind.i.train <- which(w.k.train.scaled[,17] == labels[i])
ind.i.test <- which(w.k.test.scaled[,17] == labels[i])
w.k.train.scaled[ind.i.train,17] <- i
w.k.test.scaled[ind.i.test,17] <- i
}
w.k.train.scaled[,17] <- as.numeric(w.k.train.scaled[,17] )
w.k.test.scaled[,17] <- as.numeric(w.k.test.scaled[,17] )
# w.knn.pred.matrix <- KernelKnn(w.k.train.scaled[,-17],y=w.k.train.scaled[,17],k=5,method = "euclidean",Levels = 1:7)
n <- nrow(round(w.k.train.scaled)*0.1)
ind.train <- sample(1:n,n)
k.list <- 6:10
# nunmber of cross validation
K_fold <- 5
# list of methods of distance
method.list <- c('euclidean', 'manhattan')
w.knn.cv.errs <- c()
ta.knn.df <- data.frame(matrix(0,length(method.list),length(k.list) ))
row.names(ta.knn.df) <- method.list
colnames(ta.knn.df) <- k.list
for(m in 1:length(method.list)){
for(k in 1:length(k.list)){
err.i <- weighted_knn_cv_err(K_fold,w.k.train.scaled[ind.train,-17],w.k.train.scaled[ind.train,17],k.list[k],method.list[m])
ta.knn.df[m,k] <- err.i
}
}
plot(k.list,ta.knn.df[1,],main = "method = euclidean")
best.k1 <- which.min(ta.knn.df[1,])
plot(k.list,ta.knn.df[2,],main = "method = manhattan")
best.k2 <- which.min(ta.knn.df[2,])
By trying two distance metrics, it turns out there there are little difference. The optimal is 8.
# train error
w.knn.pred.matrix <- KernelKnn(w.k.train.scaled[,-17],y=w.k.train.scaled[,17],k=8,method = "euclidean",Levels = 1:7)
w.knn.train.pred <- apply(w.knn.pred.matrix,1,FUN = function(x) which.max(x) )
w.knn.train.err <- mean(w.knn.train.pred != w.k.train.scaled[,17])
# test error
w.knn.pred.matrix.test <- KernelKnn(w.k.train.scaled[,-17],TEST_data = w.k.test.scaled[,-17] ,y=w.k.train.scaled[,17],k=8,method = "euclidean",Levels = 1:7)
w.knn.test.pred <- apply(w.knn.pred.matrix.test,1,FUN = function(x) which.max(x) )
w.knn.test.err <- mean(w.knn.test.pred != w.k.test.scaled[,17])
pred.knn <- knn(k.train.scaled,k.test.scaled,k.train.lab,k=14)
knn.cm <- caret::confusionMatrix(pred.knn,as.factor(k.test.lab));knn.cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
## BARBUNYA 264 0 11 0 1 2 0
## BOMBAY 0 97 0 0 0 0 0
## CALI 25 0 296 0 8 0 0
## DERMASON 0 0 0 640 2 4 53
## HOROZ 1 0 9 4 355 0 8
## SEKER 2 0 0 15 0 390 3
## SIRA 6 0 4 44 12 12 454
##
## Overall Statistics
##
## Accuracy : 0.917
## 95% CI : (0.906, 0.9271)
## No Information Rate : 0.2583
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8997
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: BARBUNYA Class: BOMBAY Class: CALI Class: DERMASON
## Sensitivity 0.88591 1.00000 0.9250 0.9104
## Specificity 0.99422 1.00000 0.9863 0.9708
## Pos Pred Value 0.94964 1.00000 0.8997 0.9156
## Neg Pred Value 0.98609 1.00000 0.9900 0.9689
## Prevalence 0.10948 0.03564 0.1176 0.2583
## Detection Rate 0.09699 0.03564 0.1087 0.2351
## Detection Prevalence 0.10213 0.03564 0.1209 0.2568
## Balanced Accuracy 0.94007 1.00000 0.9556 0.9406
## Class: HOROZ Class: SEKER Class: SIRA
## Sensitivity 0.9392 0.9559 0.8764
## Specificity 0.9906 0.9914 0.9646
## Pos Pred Value 0.9416 0.9512 0.8534
## Neg Pred Value 0.9902 0.9922 0.9708
## Prevalence 0.1389 0.1499 0.1903
## Detection Rate 0.1304 0.1433 0.1668
## Detection Prevalence 0.1385 0.1506 0.1954
## Balanced Accuracy 0.9649 0.9736 0.9205
Support vector machine tries to construct hyperplanes that maximize the margin between two or more classes.
It is efficient in solving high dimensional problems.
When data sample is large, SVM is computationally expensive.
SVM does not perform very well, when the data set has more noise.
25% data for testing, 75% for training and tuning.
# sampling
n <- dim(beans)[1]
p <- dim(beans)[2]
size_test <- floor(n/4)
set.seed(2031)
sample.index <- sample(1:n,size_test)
# train data
s.sample.train <- beans[-sample.index,-17]
s.sample.train.class <- beans[-sample.index,17]
# test data
s.sample.test <- beans[sample.index,-17]
s.sample.test.class <- beans[sample.index,17]
# train data
s.sample.train <- scale(s.sample.train)
# test data
s.sample.test <- scale(s.sample.test)
# RBF kernel
ranges_poly <- list(coef0 = 10^(-1:1),
gamma = 10^(-1:1),
degree = c(2)
)
tune.out.poly <- e1071::tune(svm,as.matrix(s.sample.train),as.factor(s.sample.train.class),
ranges = ranges_poly)
tune.out.poly$best.parameters
tune.out.poly$best.performance
Comparing the above svm with tree different kernels, we choose polynomial kernel to make prediction in terms of prediction accuracy.
svm.poly.pred <- svm(x=s.sample.train,y=as.factor(s.sample.train.class),scale = F,
kernel = "polynomial",
coef0 = 10,
gamma = 0.1,
degree = 3
)
# train error
svm.train.err <- mean(predict(svm.poly.pred,newdata = s.sample.train) !=s.sample.train.class);svm.train.err
## [1] 0.06102459
# test error
svm.pred.err <- mean(predict(svm.poly.pred,newdata = s.sample.test) !=s.sample.test.class);svm.pred.err
## [1] 0.06907701
svm.cm <- caret::confusionMatrix(predict(svm.poly.pred,newdata = s.sample.test),as.factor(s.sample.test.class));svm.cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
## BARBUNYA 352 0 14 0 4 2 2
## BOMBAY 1 118 0 0 0 0 0
## CALI 17 0 374 0 5 0 0
## DERMASON 0 0 0 822 4 7 62
## HOROZ 1 0 11 5 449 0 8
## SEKER 3 0 1 16 0 492 7
## SIRA 3 0 2 41 11 8 560
##
## Overall Statistics
##
## Accuracy : 0.9309
## 95% CI : (0.9219, 0.9392)
## No Information Rate : 0.2598
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9166
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: BARBUNYA Class: BOMBAY Class: CALI Class: DERMASON
## Sensitivity 0.9337 1.00000 0.9303 0.9299
## Specificity 0.9927 0.99970 0.9927 0.9710
## Pos Pred Value 0.9412 0.99160 0.9444 0.9184
## Neg Pred Value 0.9917 1.00000 0.9907 0.9753
## Prevalence 0.1108 0.03469 0.1182 0.2598
## Detection Rate 0.1035 0.03469 0.1099 0.2416
## Detection Prevalence 0.1099 0.03498 0.1164 0.2631
## Balanced Accuracy 0.9632 0.99985 0.9615 0.9504
## Class: HOROZ Class: SEKER Class: SIRA
## Sensitivity 0.9493 0.9666 0.8764
## Specificity 0.9915 0.9907 0.9765
## Pos Pred Value 0.9473 0.9480 0.8960
## Neg Pred Value 0.9918 0.9941 0.9716
## Prevalence 0.1390 0.1496 0.1878
## Detection Rate 0.1320 0.1446 0.1646
## Detection Prevalence 0.1393 0.1526 0.1837
## Balanced Accuracy 0.9704 0.9786 0.9264
A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label.
It does not make any assumptions for the input data.
Missing values does not effect the process of building a tree.
The tree model is easy to interpret.
It has easy model interpretability and can even be visualized.
A little change in the training data will result completely different
tree models.
Decision trees tend to overfit the data.
Take 20% data for testing, 80% for 5 cross validation tuning and training.
set.seed(2068)
# sampling
number_test <- floor(dim(beans)[1]/5)
index.list <- 1:dim(beans)[1]
sample.index.test <- sample(index.list,number_test)
sample.index.train <- index.list[-sample.index.test]
sample.index.train <- sample(sample.index.train,length(sample.index.train))
tree.sample.test <- beans[sample.index.test,]
tree.sample.train <- beans[sample.index.train,]
Build a function to conduct k fold cross validation by using decision tree.
tree_cv_err <- function(x,cl,k,tree.controls,split){
# x : training data
# cl : labels
# k : number of folds
n <- dim(x)[1]
# runs
runs <- n %/% k
rem <- n %% k
preds <- c()
# calculate each part
for (i in 1:k) {
# test sub index
s.low <- (i-1)*runs + 1
s.high <- i*runs
# plus remainder in last run
if(i==k){
s.high <- i*runs + rem
}
# test set
test.i <- x[s.low:s.high,]
# train set
train.i <- x[ -s.low:-s.high, ]
# train label
train.lab.i <- cl[-s.low:-s.high]
tree1 <- tree(as.factor(train.lab.i) ~ ., data = train.i, split=split,control = tree.controls)
# compute the test error
test_pred1 <- predict(tree1, test.i, type="class")
preds <- append(preds,test_pred1)
}
return(cv.err = mean(preds != cl))
}
Tune the parameters in tree.control in terms of prediction
accuracy.
There are two main parameters we need to tune here. The mincut controls
the minimum numbers of observations to flow into a child node. The
mindev controls the deviance between a root and its child node. It can
be used to stop the node split.
# tune parameter minicuts using devicnace split
mincuts <- c(5,10,20,30,40,50,60)
mindev <- 10^(-1:-5)
nobs <- dim(tree.sample.train)[1]
cv.errs <- c()
for(i in mincuts){
for(mid in mindev){
cv.err <- tree_cv_err(tree.sample.train[,-17],tree.sample.train[,17],5,
tree.control(
nobs = nobs,
mincut = i,
mindev = mid),
split="deviance"
)
cv.errs <- append(cv.errs,cv.err)
}
}
which.min(cv.errs);cv.errs[which.min(cv.errs)]
## [1] 9
## [1] 0.0908256
The best parameters obtained according to the above cross validation is mincuts=10,mindev=0.0001.
set.seed(1500)
# best parameter mincut = 10, mindev = 0.0001
tree.pred <- tree(as.factor(Class) ~ ., data = tree.sample.train, split="deviance",
control = tree.control(
nobs = dim(tree.sample.train)[1],
mincut = 10,
mindev = 0.0001))
# training error
tree.train.pred <- predict(tree.pred,tree.sample.train[,-17],type = "class")
tree.train.err <- mean(tree.train.pred != tree.sample.train[,17]);tree.train.err
## [1] 0.05758105
# test error
tree.test.pred <- predict(tree.pred,tree.sample.test[,-17],type = "class")
tree.test.err <- mean(tree.test.pred != tree.sample.test[,17]);tree.test.err
## [1] 0.09478325
The training error is 0.05758 and the test error is 0.09478.
tree.cm <- caret::confusionMatrix(tree.test.pred,as.factor(tree.sample.test[,17]));tree.cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
## BARBUNYA 252 0 13 0 1 2 10
## BOMBAY 0 109 0 0 0 0 0
## CALI 23 0 285 0 12 0 2
## DERMASON 0 0 0 663 3 17 61
## HOROZ 6 0 7 1 342 0 8
## SEKER 4 0 0 3 0 379 8
## SIRA 3 0 3 46 9 16 434
##
## Overall Statistics
##
## Accuracy : 0.9052
## 95% CI : (0.8936, 0.916)
## No Information Rate : 0.2619
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8853
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: BARBUNYA Class: BOMBAY Class: CALI Class: DERMASON
## Sensitivity 0.87500 1.00000 0.9253 0.9299
## Specificity 0.98932 1.00000 0.9847 0.9597
## Pos Pred Value 0.90647 1.00000 0.8851 0.8911
## Neg Pred Value 0.98527 1.00000 0.9904 0.9747
## Prevalence 0.10580 0.04004 0.1132 0.2619
## Detection Rate 0.09258 0.04004 0.1047 0.2436
## Detection Prevalence 0.10213 0.04004 0.1183 0.2733
## Balanced Accuracy 0.93216 1.00000 0.9550 0.9448
## Class: HOROZ Class: SEKER Class: SIRA
## Sensitivity 0.9319 0.9155 0.8298
## Specificity 0.9907 0.9935 0.9650
## Pos Pred Value 0.9396 0.9619 0.8493
## Neg Pred Value 0.9894 0.9850 0.9597
## Prevalence 0.1348 0.1521 0.1921
## Detection Rate 0.1256 0.1392 0.1594
## Detection Prevalence 0.1337 0.1447 0.1877
## Balanced Accuracy 0.9613 0.9545 0.8974
Random forest is an ensemble learning method for classification,
regression and other tasks that operates by constructing a multitude of
decision trees at training time.For classification tasks, the output of
the random forest is the class selected by most trees.
It uses bootstrap as a resampling method. As the number of observations
get larger, the samples selected take a portion about 63.2%.
It has the advantages of decision tree and meanwhile it reduces the variance. It can handle very large data set.
set.seed(4021)
# sampling
number_test <- floor(dim(beans)[1]/5)
index.list <- 1:dim(beans)[1]
sample.index.test <- sample(index.list,number_test)
sample.index.train <- index.list[-sample.index.test]
sample.index.train <- sample(sample.index.train,length(sample.index.train))
rf.sample.test <- beans[sample.index.test,]
rf.sample.train <- beans[sample.index.train,]
Give the random forest 2000 for the number of trees.
set.seed(402)
rf.sample.train <- beans
rf.sample.train$Class <- factor(rf.sample.train$Class)
rf.model <- randomForest(formula = Class ~ ., data = rf.sample.train,ntree = 500,importance = T)
rf.model
##
## Call:
## randomForest(formula = Class ~ ., data = rf.sample.train, ntree = 500, importance = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 7.45%
## Confusion matrix:
## BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA class.error
## BARBUNYA 1202 1 71 1 9 12 26 0.090771558
## BOMBAY 2 520 0 0 0 0 0 0.003831418
## CALI 53 0 1529 0 32 3 13 0.061963190
## DERMASON 0 0 0 3304 5 55 182 0.068245911
## HOROZ 6 0 31 13 1833 0 45 0.049273859
## SEKER 8 0 0 49 0 1927 43 0.049333991
## SIRA 10 0 5 262 35 42 2282 0.134294385
Test error: 7.45%.
results <- data.frame(rbind(lda.res,logisticregression.res,knn.res,svm.res,tree.res,randomforest.res))
results <- cbind(results, 1- unlist(results[,2]))
colnames(results) <- c("train error","test error","accuracy")
rownames(results) <- c("LDA","Logistic Regression" ,"KNN","SVM","Decision Tree","Random Forest")
kable(results)
train error | test error | accuracy | |
---|---|---|---|
LDA | 0.09459087 | 0.09772226 | 0.9022777 |
Logistic Regression | 0.07227477 | 0.07861866 | 0.9213813 |
KNN | 0.05722822 | 0.07788391 | 0.9221161 |
SVM | 0.06102459 | 0.06907701 | 0.9309230 |
Decision Tree | 0.05758105 | 0.09478325 | 0.9052168 |
Random Forest | 0 | 0.07449606 | 0.9255039 |
By applying six machine learning methods, the goal of classifying
seven mixed types of beans is now achieved at an accuracy rate around
93%.
Before we started applying any methods, we first checked the structure
of the data set. The data set has 13611 x 16 samples with 7 classes. All
the predictors are numerical and are proved to be significant in similar
grain classification problems like rice and wheat grains. Although I did
not try any dimension reduction methods, it is still worth trying to use
PCA or LDA and so on to simplify the data set.
In data engineering section, I checked the normality of a feature and it
turned out it did not satisfy the normal distribution. I also checked
the distribution of each feature given certain class and they did not
have normal distribution either.
We already knew the assumption of LDA that the class pdf is normal
distribution was not met. After applying linear discriminant analysis
(LDA), six linear functions were obtained and the first three functions
covered 86.2%, percentage separation. The error rate is not ideal
comparing to others methods’ results mainly because the assumption of
LDA was not met.
The logistic regression does not make strict assumptions like LDA and
the result appeared to be better than that of LDA.
When applying K nearest neighbor, I also wrote a function applying
weighted KNN method in addition to unweighted KNN. However, after taking
into account the distance in doing majority vote, the improvement is not
obvious. Further study could be put on methods that can improve KNN
performance for imbalanced data set and it is suggested to try different
different metrics.
In conclusion,the SVM model had the best performance. Therefore, we
choose it as the best method for the dry beans classification
problem.
1: X. Chen, Y. Xun, W. Li, J. Zhang Combining discriminant analysis and neural networks for corn variety identification Comput. Electron. Agric., 71 (2010), pp. S48-S53, 10.1016/j.compag.2009.09.003
2: Murat Koklu, Ilker Ali Ozkan,Multiclass classification of dry beans using computer vision and machine learning techniques,Computers and Electronics in Agriculture,Volume 174,2020,105507,ISSN 0168-1699.
3: Symons, Stephen & Fulcher, R.. (1988). Determination of wheat kernel morphological variation by digital image analysis: I. Variation in Eastern Canadian milling quality wheats. Journal of Cereal Science. 8. 211-218. 10.1016/S0733-5210(88)80032-8.
4: Jigang Xie∗, Zhengding Qiu (2006), The effect of imbalanced data sets on LDA:A theoretical and empirical analysis, Pattern Recognition 40 (2007) 557 – 562