Table of Contents

Introduction

Dry beans (Phaseolus Vulgaris) are one of the world’s longest-cultivated crops. They are rich in protein, fiber, and various micronutrients and are widely consumed throughout the world. In some areas like Turkey and Asian Countries, dry beans are usually cultivated in the form of populations containing mixed species of seeds. It is reported that using bean seeds containing mixed species will decrease the market value and it is time-consuming and costly for bean producers to classify beans by manpower[1]. Therefore, the purpose of analyzing this data set is to build suitable models to classify seven beans from a mixed population as precise as possible.

In this report, six machine learning methods were applied to address the problem, including linear discriminant analysis, logistic regression, k nearest neighbor (KNN), support vector machine (SVM), decision tree and random forest. SVM achieves the highest accuracy in classifying seven different types of beans. This model can help dry beans producers to achieve uniform dry beans.

Data Set Information

This data set consists of the morphological information extracted from digital images of different types dry beans. The features includes: By the botanical characteristic of dry beans, they are classified as seven different types (Barbunya, Battal, Bombay, Calı, Dermason, Horoz, Tombul, Selanik and Seker).

The dataset can be found in UCI.

1 Preparing Environment and Loading Data

1.1 Import Packages

library(readxl)
library(class)
library(moments)
library(car)
library(ggplot2)
library(reshape2)
library(MASS)
library(e1071)   
library(caret)
library(nnet)
library(tree)
library(randomForest)
library(maboost)
library(knitr)
library(KernelKnn)

1.2 Load Dataset

beans <- data.frame(read_excel("./DryBeanDataset/Dry_Bean_Dataset.xlsx"))

2 Exploratory Data Analysis

2.1 Dataset Structure

In order to have a comprehensive view of the dataset and make it convenient for future use, I wrote a function to show the general characteristic of the data.

data_str <- function(df){
  rownames <- colnames(df)
  types <- c()
  nas <- c()
  na_ratio <- c()
  means <- c()
  stds <- c()
  skewness <- c()
  kurtosis <- c()
  
  # type of each predictor
  for(p in 1:dim(df)[2]){
    if(typeof(df[,p]) == "character"){
      sk <- NA
      ku <- NA
      me <- NA
      st <- NA
    }
    else{
      sk <- round(skewness(df[,p]),6)
      ku <- round(kurtosis(df[,p]),6)
      me <- round(mean(df[,p]),6)
      st <- round(sd(df[,p]),6)
    }
      types <- append(types,typeof(df[,p]))
      nas <- append(nas,sum(is.na(df[,p])))
      na_ratio <- append(na_ratio,nas[p]/dim(df)[1])
      
      skewness <- append(skewness,sk)
      kurtosis <- append(kurtosis,ku)
      means <- append(means,me)
      stds <- append(stds,st)
  }
  
  res <- data.frame(cbind(rownames,types,nas,na_ratio,means,stds,skewness,kurtosis))
  colnames(res) <- c("","types","NA","NA_ratio","mean","std","skewness","kurtosis")
  return(res)
}
res <- data_str(beans);res
##                        types NA NA_ratio         mean          std  skewness
## 1             Area    double  0        0 53048.284549 29324.095717   2.95228
## 2        Perimeter    double  0        0   855.283459   214.289696  1.625765
## 3  MajorAxisLength    double  0        0   320.141867    85.694186  1.357516
## 4  MinorAxisLength    double  0        0   202.270714    44.970091  2.237717
## 5     AspectRation    double  0        0     1.583242     0.246678  0.582445
## 6     Eccentricity    double  0        0     0.750895     0.092002  -1.06259
## 7       ConvexArea    double  0        0 53768.200206 29774.915817  2.941173
## 8    EquivDiameter    double  0        0    253.06422     59.17712  1.948528
## 9           Extent    double  0        0     0.749733     0.049086 -0.895151
## 10        Solidity    double  0        0     0.987143      0.00466 -2.549531
## 11       roundness    double  0        0     0.873282      0.05952 -0.635609
## 12     Compactness    double  0        0     0.799864     0.061713  0.037107
## 13    ShapeFactor1    double  0        0     0.006564     0.001128 -0.534023
## 14    ShapeFactor2    double  0        0     0.001716     0.000596   0.30116
## 15    ShapeFactor3    double  0        0      0.64359     0.098996  0.242427
## 16    ShapeFactor4    double  0        0     0.995063     0.004366 -2.758875
## 17           Class character  0        0         <NA>         <NA>      <NA>
##     kurtosis
## 1  10.794379
## 2   3.585397
## 3   2.529719
## 4   6.646765
## 5   0.112874
## 6   1.385861
## 7  10.737234
## 8   5.188506
## 9   0.642107
## 10 12.792158
## 11  0.373232
## 12 -0.224226
## 13  0.713106
## 14 -0.859694
## 15 -0.145282
## 16 13.030482
## 17      <NA>

2.2 Missing Data Check and Cleaning

We can notice that there is no missing value in this data set.

2.3 Understanding Features

  • Area (A): The area of a bean zone and the number of pixels within its boundaries.

  • Perimeter (P): Bean circumference is defined as the length of its border.

  • Major axis length (L): The distance between the ends of the longest line that can be drawn from a bean.

  • Minor axis length (l): The longest line that can be drawn from the bean while standing perpendicular to the main axis.

  • Aspect ratio (K): Defines the relationship between L and l.

  • Eccentricity (Ec): Eccentricity of the ellipse having the same moments as the region.

  • Convex area (C): Number of pixels in the smallest convex polygon that can contain the area of a bean seed.

  • Equivalent diameter (Ed): The diameter of a circle having the same area as a bean seed area: \(\sqrt{4A/\pi}\)

  • Extent (Ex): The ratio of the pixels in the bounding box to the bean area.

  • Solidity (S): Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.

  • Roundness (R): Calculated with the following formula: \((4\pi A/P^2)\)

  • Compactness (CO): Measures the roundness of an object: \(Ed/L\)

  • ShapeFactor1 (SF1) : \(\frac{L}{A}\)

  • ShapeFactor2 (SF2) : \(\frac{A}{L^3}\)

  • ShapeFactor3 (SF3) : \(\frac{A}{(L/2)(L/2)\pi}\)

  • ShapeFactor4 (SF4) : \(\frac{A}{(L/2)(l/2)\pi}\)

  • Class (Seker, Barbunya, Bombay, Cali, Dermosan, Horoz and Sira)

These morphological features are extracted from images captured by a computer vision system[2]. There are papers about wheat kernel morphological variation pointing out the above features have significant influence to similar grain classification problems like rice and wheat grains[3].

2.4 Check for the Correlation

Scale data

b.s <- scale(beans[-17])
cor_heatmap <- function(mydata){
  # Compute the correlation matrix
  cormat <- round(cor(mydata),2)
  
  # Get lower triangle of the correlation matrix
  cormat[upper.tri(cormat)] <- NA
  
  upper_tri <- cormat
  # Finished correlation matrix heatmap
  melted_cormat <- melt(upper_tri, na.rm = TRUE)
  
  ggplot(data = melted_cormat, aes(Var1, Var2, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "#267373", high = "#bf5940", mid = "white",
     midpoint = 0, limit = c(-1,1), space = "Lab",
     name="Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1,
    size = 12, hjust = 1))+
    coord_fixed()
}

Correlation Heat Map

cor_heatmap(b.s)

3 Data Engineering

Normality Check

The first normality plot is for the area predictor and we can see it is obvious not normal. Even after the box-cox transformation, the plot is still not normal.

test <- beans[,1]
a1 <- qqPlot(test)

bc <- boxcox(test~1,lambda = seq(-5,1,0.1),plotit = TRUE)

lambda <- bc$x[which.max(bc$y)]
trans <- bcPower(test,lambda)
a2 <- qqPlot(trans)

4 Modeling

4.1 Linear Discriminant Analysis (LDA)

Defination

Linear discriminant analysis (LDA) is a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
LDA works when the measurements made on independent variables for each observation are continuous quantities, which the data set here satisfies.
The main idea of multi-class LDA is to find a linear combinations of coefficients vector that maximize the ratio of the between-class covariance scattering to the within-class covariance scattering, which is an optimization problem that can be written as: \[ max_w \frac{w^TS_bw}{w^TS_ww} \] The solution is given by solving \(S_bw = \lambda S_ww\).

\(S_b\): between-class scatter matrix.
\(S_w\): within-class scatter matrix.
Generally, at most m - 1 generalized eigenvectors are useful to discriminate between m classes.

Assumption

There are two important assumptions LDA make: 1. assume each class density is multivariate Gaussian and 2. they have equal covariance.

Advantages

LDA and PCA are similar in the sense that both of them reduce the data dimensions but LDA provides better separation between groups of experimental data compared to PCA. This is because LDA models the differences between the classes of data, whereas PCA does not take account of these differences.

Limitations

LDA may not have ideal performance if the assumptions of LDA are not met.

Sampling

Train 80%, test 20%.

set.seed(723)
ind.train <- sample(1:nrow(beans),round(0.8*nrow(beans)))
lda.train <- beans[ind.train,]
lda.test <- beans[-ind.train,]

Imbalanced Data

The below pie chart shows the imbalanced samples in different bean types. “BOMBAY” has the smallest portion in this dataset. It is reported that imbalanced dataset have negative effect on the performance of applying LDA according the paper (Jigang Xie∗, Zhengding Qiu)[4].
There are several sampling methods to solve the imbalanced problem. We can use random oversampling methods according to the paper[4] or stratified sampling. Considering using oversampling may introduce new problems, stratified sampling method is a more simple and easy choice.

Pie Charts in types

# seperate data into classes
SE <- subset(lda.train,Class == "SEKER")
BA <- subset(lda.train,Class == "BARBUNYA")
BO <- subset(lda.train,Class == "BOMBAY")
CA <- subset(lda.train,Class == "CALI")
HO <- subset(lda.train,Class == "HOROZ")
SI <- subset(lda.train,Class == "SIRA")
DE <- subset(lda.train,Class == "DERMASON")
sep.samples <- list(SE,BA,BO,CA,HO,SI,DE)

# seperate data into classes
pie.se <- dim(SE)[1]
pie.ba <- dim(BA)[1]
pie.bo <- dim(BO)[1]
pie.ca <- dim(CA)[1]
pie.ho <- dim(HO)[1]
pie.si <- dim(SI)[1]
pie.de <- dim(DE)[1]
num_samples <- c(pie.se,pie.ba,pie.bo,pie.ca,pie.ho,pie.si,pie.de)
pie.labs <- c("SEKER","BARBUNYA","BOMBAY","CALI","HOROZ","SIRA","DERMASON")
ggdata <- data.frame(X = pie.labs,Y = num_samples)

# plot pie chart
piechart <- ggplot(data=ggdata,aes(x="", y=Y, fill=X)) +
  geom_bar(width=1, stat="identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Y,
                               " (",
                               scales::percent(Y / sum(Y)),
                               ")")),
            position = position_stack(vjust = 0.5)) +
  
  xlab("") +
  ylab("Value") +
  theme(axis.text.x=element_blank(),axis.ticks.x=element_blank())

piechart

Stratified Sampling

By using stratified, it is guaranteed that certain percent of the samples of each class will be included into the training data.

# sampling
all <- lda.train

sep.samples <- list(SE,BA,BO,CA,HO,SI,DE)

set.seed(12)
ind <- sample(1:dim(all)[1])

# folds 
folds <- 10
# result
stratified.samples <- data.frame()
# used for update
resp <- sep.samples
# iterate through data that has been classified by type for k(folds) times
# get a certain portion from each class population and then update it 
for(k in 1:folds){
  stratified.samples.k <- data.frame()
  
  for(t in 1:length(resp)){
    if( k == folds){
       # for the last run, get the rest data
       str.samples.i <- as.data.frame(resp[t])
       stratified.samples.k <- rbind(stratified.samples.k,str.samples.i)
       resp[t] <- list(data.frame())
  }else{
      resp.df <- as.data.frame(resp[t])
      set.seed(2010)
      # number of original class samples
      s.i <- dim(data.frame(sep.samples[t]))[1]
      n.i <- dim(data.frame(resp[t]))[1]
      ind.i <- sample(1:n.i,floor(s.i/folds))
      str.samples.i <- resp.df[ind.i,]
      stratified.samples.k <- rbind(stratified.samples.k,str.samples.i)
      resp[t] <- list(resp.df[-ind.i,])
      
  }
    }
  set.seed(2021)
  # random sample in each fold
  ind.k <- sample(1:dim(stratified.samples.k)[1],dim(stratified.samples.k)[1])
  stratified.samples <- rbind(stratified.samples,stratified.samples.k[ind.k,])
}

Apply LDA

Below is a function by which we can apply, say a 5 fold cross validation LDA method on a data set.

# lda cv func
lda_cv <- function(x,cl,k) {
  # x : training data
  # cl : labels
  # k : number of folds
  n <- dim(x)[1]
  cl <- as.factor(unlist(cl))
  
  # runs 
  runs <- n %/% k
  rem <- n %% k
  
  preds <- c()
  # calculate each part
  for (i in 1:k) {
  # test sub index
  s.low <- (i-1)*runs + 1
  s.high <- i*runs
  # plus remainder in last run
  if(i==k){
    s.high <- i*runs + rem
  }
  
  # test set
  test.i <- x[s.low:s.high,]
  # train set
  train.i <- x[ -s.low:-s.high, ]
  # train label
  train.lab.i <- cl[-s.low:-s.high]
  
  # apply lda 
  zi <- lda(train.i,grouping = train.lab.i)
  pred.i <- predict(zi,test.i)$class
  preds <- append(preds,pred.i)
  }
  return(mean(preds != cl))
}

Before Stratified

lda_cv(lda.train[,-17],lda.train[17],10)
## [1] 0.09505005

After Stratified

lda_cv(stratified.samples[,-17],stratified.samples[17],10)
## [1] 0.09468271

The error rate improved a little after applying stratified sampling.

Prediction

# lda model
lda.train.model <- lda(lda.train[,-17],grouping = as.factor(lda.train[,17]));lda.train.model
## Call:
## lda(lda.train[, -17], grouping = as.factor(lda.train[, 17]))
## 
## Prior probabilities of groups:
##   BARBUNYA     BOMBAY       CALI   DERMASON      HOROZ      SEKER       SIRA 
## 0.09909083 0.03802002 0.12076407 0.25998714 0.14170264 0.14950868 0.19092662 
## 
## Group means:
##               Area Perimeter MajorAxisLength MinorAxisLength AspectRation
## BARBUNYA  69635.23 1044.5266        369.5023        240.0512     1.543494
## BOMBAY   173966.72 1587.6997        594.1228        374.7352     1.586388
## CALI      75510.91 1057.4184        409.2946        236.3801     1.732633
## DERMASON  32028.41  664.2551        246.2582        165.3972     1.491065
## HOROZ     53670.72  920.2530        372.8479        184.1156     2.028204
## SEKER     39865.47  727.3007        251.2236        201.8871     1.244990
## SIRA      44685.40  796.1663        299.3148        190.6648     1.570867
##          Eccentricity ConvexArea EquivDiameter    Extent  Solidity roundness
## BARBUNYA    0.7544642   70849.64      296.9490 0.7496429 0.9828458 0.8005454
## BOMBAY      0.7707459  176303.92      469.5669 0.7767922 0.9869174 0.8645735
## CALI        0.8145843   76663.67      309.4691 0.7584368 0.9849875 0.8459386
## DERMASON    0.7368492   32407.27      201.4022 0.7527180 0.9882247 0.9081119
## HOROZ       0.8678106   54461.61      260.7932 0.7073376 0.9854818 0.7941008
## SEKER       0.5847062   40253.71      224.9061 0.7715761 0.9903483 0.9449821
## SIRA        0.7676287   45228.71      238.2270 0.7495036 0.9879727 0.8843964
##          Compactness ShapeFactor1 ShapeFactor2 ShapeFactor3 ShapeFactor4
## BARBUNYA   0.8052266  0.005361971 0.0013969972    0.6494858    0.9958101
## BOMBAY     0.7924492  0.003438468 0.0008429636    0.6289350    0.9918752
## CALI       0.7569268  0.005458558 0.0011087572    0.5733574    0.9906071
## DERMASON   0.8189590  0.007767047 0.0021625004    0.6713938    0.9969159
## HOROZ      0.7004834  0.007008403 0.0010451582    0.4912143    0.9919209
## SEKER      0.8968908  0.006335015 0.0025413119    0.8052300    0.9983766
## SIRA       0.7971260  0.006724178 0.0016816646    0.6359963    0.9953882
## 
## Coefficients of linear discriminants:
##                           LD1           LD2           LD3           LD4
## Area             6.370327e-04 -8.548769e-04 -2.149029e-03 -1.236161e-03
## Perimeter        5.147173e-02 -3.456127e-02 -1.543476e-03  5.452820e-03
## MajorAxisLength  4.656642e-01 -4.031309e-01 -8.874835e-01  1.042715e-02
## MinorAxisLength  4.991452e-01 -2.925348e-01 -1.860430e+00 -1.544286e-01
## AspectRation    -2.335707e+01  3.647306e+00 -1.072802e+02  1.111226e+02
## Eccentricity    -8.827933e+00  4.201642e+01  1.377181e+02 -1.515717e+02
## ConvexArea      -5.201061e-04  6.688401e-04  1.884725e-03  7.505564e-04
## EquivDiameter   -1.367585e+00  9.226504e-01  2.746601e+00  3.956364e-01
## Extent          -1.065512e+00  4.548037e-01  8.428483e-01  2.051990e+00
## Solidity        -1.657671e+01  8.725235e+01  9.048091e+01  1.104365e+01
## roundness        2.963510e+01 -2.779383e+01 -9.412537e+00 -7.631068e+00
## Compactness      7.565722e-01 -7.744607e+02 -3.220193e+03  3.252973e+03
## ShapeFactor1    -1.974251e+03 -4.402880e+03  3.985271e+03  3.175435e+03
## ShapeFactor2    -9.496369e+03  1.425541e+04 -1.498958e+04  1.888510e+03
## ShapeFactor3    -2.908922e+00  3.833879e+02  2.035491e+03 -1.849051e+03
## ShapeFactor4     1.496561e+02 -1.755406e+02 -1.945698e+02 -1.008880e+02
##                           LD5           LD6
## Area            -4.784732e-04  4.720198e-04
## Perimeter        8.342181e-02 -3.129432e-02
## MajorAxisLength -5.933094e-01 -3.799290e-01
## MinorAxisLength -8.253265e-01 -3.815315e-01
## AspectRation     8.537150e+01  9.520055e+01
## Eccentricity     1.404652e+01 -8.523257e+01
## ConvexArea       6.877649e-04  3.149340e-04
## EquivDiameter    9.894705e-01  2.850796e-01
## Extent          -8.391612e-01 -8.234672e-01
## Solidity         4.376266e+01  6.272076e+01
## roundness        1.758497e+01 -1.669280e+01
## Compactness      7.161696e+02  1.496363e+03
## ShapeFactor1     8.675478e+02 -1.024055e+04
## ShapeFactor2    -8.608727e+03 -3.411115e+03
## ShapeFactor3    -1.954528e+02 -8.459609e+02
## ShapeFactor4    -2.371495e+02 -2.385934e+02
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4    LD5    LD6 
## 0.5350 0.2204 0.1070 0.0922 0.0305 0.0148

There are 6 discriminant functions for classification. The first 3 has covered 86.2%, percentage separation achieved by each discriminant function. It is not ideal mainly because the data does not satisfy the LDA assumptions.

# training error
pred.lda.train <- predict(lda.train.model,lda.train[,-17])$class
lda.train.err <- mean(pred.lda.train != lda.train[,17]);lda.train.err
## [1] 0.09459087
# test error
pred.lda.test <- predict(lda.train.model,lda.test[,-17])$class
lda.test.err <- mean(pred.lda.test != lda.test[,17]);lda.test.err
## [1] 0.09772226

Train error: 0.0945909, test error: 0.0977223.

Confusion Matrix

# confusion matrix
lda.cm <- caret::confusionMatrix(pred.lda.test,as.factor(lda.test[,17]));lda.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
##   BARBUNYA      201      0    1        2     0     5    1
##   BOMBAY          0    107    0        0     0     0    0
##   CALI           23      1  306        0    12     0    2
##   DERMASON        0      0    0      600     3     4   28
##   HOROZ           1      0    4        0   357     0    4
##   SEKER           2      0    0        6     0   366    3
##   SIRA           16      0    4      107    13    24  519
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9023          
##                  95% CI : (0.8905, 0.9132)
##     No Information Rate : 0.2627          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8815          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: BARBUNYA Class: BOMBAY Class: CALI Class: DERMASON
## Sensitivity                  0.82716       0.99074      0.9714          0.8392
## Specificity                  0.99637       1.00000      0.9842          0.9826
## Pos Pred Value               0.95714       1.00000      0.8895          0.9449
## Neg Pred Value               0.98328       0.99962      0.9962          0.9449
## Prevalence                   0.08927       0.03968      0.1157          0.2627
## Detection Rate               0.07384       0.03931      0.1124          0.2204
## Detection Prevalence         0.07715       0.03931      0.1264          0.2333
## Balanced Accuracy            0.91176       0.99537      0.9778          0.9109
##                      Class: HOROZ Class: SEKER Class: SIRA
## Sensitivity                0.9273       0.9173      0.9318
## Specificity                0.9961       0.9953      0.9242
## Pos Pred Value             0.9754       0.9708      0.7599
## Neg Pred Value             0.9881       0.9859      0.9814
## Prevalence                 0.1414       0.1466      0.2046
## Detection Rate             0.1312       0.1345      0.1907
## Detection Prevalence       0.1345       0.1385      0.2509
## Balanced Accuracy          0.9617       0.9563      0.9280

4.2 Logistic Regression

Definition

logistic regression is a model that is used to model the posterior probabilities of the K classes via linear functions in x while at the same time ensuring they sun to one and remain in [0,1]. The model has the form:

\[ Pr(G=k|X=x) = \frac{exp(\beta_{k0} + \beta_k^Tx)}{1 + \sum_{l=1}^{K-1}exp(\beta_{l0} + \beta_l^Tx )}, k = 1,..., K-1 \]

Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood of G given X.

\[ L(\theta) = \sum_{i-1}^Nlogp_{g_i}(x_i;\theta), \] where \(p_k(x_i;\theta) = Pr(G=k|X=x_i;\theta).\)

Advantages

It makes no assumptions about distributions of classes in feature space.
Good accuracy for many simple data sets and it performs well when the data set is linearly separable.
Logistic regression is less inclined to over-fitting in low dimension data.
It can interpret model coefficients as indicators of feature importance.

Limitations

The major limitation of Logistic Regression is the assumption of linearity between the dependent variable and the independent variables.
It can only be used to predict discrete functions.

Apply Multinomial Logistic Regression

Sampling

Take 20% data for testing, 80% for training.

set.seed(7105)
# sampling
number_test <- floor(dim(beans)[1]/5)
index.list <- 1:dim(beans)[1]
sample.index.test <- sample(index.list,number_test)
sample.index.train <- index.list[-sample.index.test]
sample.index.train <- sample(sample.index.train,length(sample.index.train))

lg.sample.test <- beans[sample.index.test,]
lg.sample.train <- beans[sample.index.train,]

Prediction

lg.model <- multinom(Class ~ ., data = lg.sample.train)
## # weights:  126 (102 variable)
## initial  value 21189.015613 
## iter  10 value 15950.879558
## iter  20 value 12477.784498
## iter  30 value 11271.034033
## iter  40 value 4628.086901
## iter  50 value 2389.189924
## iter  60 value 2278.372612
## iter  70 value 2240.838383
## iter  80 value 2217.585964
## iter  90 value 2210.672006
## iter 100 value 2205.986818
## final  value 2205.986818 
## stopped after 100 iterations
# training error
lg.pred <- predict(lg.model,lg.sample.train[,-17])
lg.train.error <- mean(lg.pred != as.factor(lg.sample.train[,17]));lg.train.error
## [1] 0.07227477
# test error
lg.pred <- predict(lg.model,lg.sample.test[,-17])
lg.test.error <- mean(lg.pred != as.factor(lg.sample.test[,17]));lg.test.error
## [1] 0.07861866

The training error is 0.07227477 and the test error is 0.07861866

4.3 KNN

Definition

The k-nearest neighbors algorithm (k-NN) is a non-parametric classification method. In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.

Advantages

It does not make any strict assumptions about the data.
It does not learn anything in the training period. It just stores the training data set and learns from it only at the time of making real time predictions.

Limitations

KNN is sensitive to the local data structure. The larger class tends to dominate the prediction, as their examples show up more frequently in the neighborhood.
We need to do feature scaling before applying KNN algorithm to any data set due to curse of dimensionality.

A function to calculate test error using knn.

# set knn error result function
get_knn_err <- function(k.list,train_data,test_data,train_label,test_label){
  # return test error list for different k
  test_err = list()
  for (ki in k.list) {
    knn_i_train <- knn(train_data,test_data,train_label,k=ki)
    knn_i_error <- mean(knn_i_train != test_label)
    test_err <- append(test_err,knn_i_error)
  }
  return(test_err)
}

Apply KNN

Sampling

Take 20% data for testing, 80% for training.

set.seed(2031)
# sampling
number_test <- floor(dim(beans)[1]/5)
number_tune <- floor(dim(beans)[1]/5)

index.list <- 1:dim(beans)[1]

sample.index.test <- sample(index.list,number_test)
index.list.test <- index.list[-sample.index.test]

sample.index.tune <- sample(index.list.test,number_tune)
index.list.tune <- index.list.test[-sample.index.tune]

sample.index.train <- index.list.tune

k.sample.test <- beans[sample.index.test,]
k.sample.tune <- beans[sample.index.tune,]
k.sample.train <- beans[sample.index.train,]

# train data
k.train <- k.sample.train[,-17]
k.train.lab <- k.sample.train[,17]
# test data
k.test <- k.sample.test[,-17]
k.test.lab <- k.sample.test[,17]

Scale

k.train.scaled <- scale(k.train,center=T,scale = T)
k.test.scaled <- scale(k.test,center=T,scale = T)

Tuning for k

k.list <- 1:30
set.seed(2089)
# tune error
knn.test.errs <- get_knn_err(k.list,k.train.scaled,k.test.scaled,k.train.lab,k.test.lab)
plot(k.list,knn.test.errs)

# the best k
knn.test.errs[which.min(knn.test.errs)];which.min(knn.test.errs)
## [[1]]
## [1] 0.07825129
## [1] 5
# 10 best k's
k10th <- sort(as.numeric(knn.test.errs))[10]
# best 10 k's
k10 <- which(knn.test.errs <= k10th );k10
##  [1]  5  6  8  9 10 11 13 14 19 22 23

KNN Prediction

# train error
knn.train.err <- get_knn_err(k10[1],k.train.scaled,k.train.scaled,k.train.lab,k.train.lab);knn.train.err
## [[1]]
## [1] 0.05722822
# prediction error
knn.pred.err <- get_knn_err(k10[1],k.train.scaled,k.test.scaled,k.train.lab,k.test.lab);knn.pred.err
## [[1]]
## [1] 0.07788391

Train error: 0.05734314, Test error: 0.07861866.

Apply Weighted KNN

Considering the impact of imbalanced data on majority vote decision, we will take into account the distance from x to each of the k nearest neighbors.

Using accuracy as a criterion for choosing optimal k is not stable, it is suggested to do a K-fold cross validation to tune the model.

weighted_knn_cv_err <- function(K,x,cl,k,method){
  # K : number of folds
  # x : training data
  # cl : labels
  # k.list : a list of k to be tuned
  n <- nrow(x)
  # runs 
  runs <- n %/% K
  rem <- n %% K
  
  preds <- c()
  # calculate each part
  for (i in 1:K) {
  # test sub index
  s.low <- (i-1)*runs + 1
  s.high <- i*runs
  # plus remainder in last run
  if(i==K){
    s.high <- i*runs + rem
  }
  
  # test set
  test.i <- x[s.low:s.high,]
  # train set
  train.i <- x[ -s.low:-s.high, ]
  # train label
  train.lab.i <- cl[-s.low:-s.high]

  # apply weighted knn
  w.knn.pred.matrix <- KernelKnn(x,y=cl,k=k,method = method,Levels = 1:7)
  w.knn.pred <- apply(w.knn.pred.matrix,1,FUN = function(x) which.max(x) )

  # knn_i_pred <- apply(test.i,1,function(x) weighted_knn_predict(train.i,train.lab.i,x,k))
  preds <- append(preds,w.knn.pred)
  }
  return(mean(preds != cl))
}

Transfer Label To Numeric

w.k.train.scaled <- cbind(data.frame(scale(k.sample.train[,-17])),Class = k.sample.train[,17])
w.k.test.scaled <-  cbind(data.frame(scale(k.sample.test[,-17])),Class = k.sample.test[,17])
# "SEKER"    "BARBUNYA" "BOMBAY"   "CALI"     "HOROZ"    "SIRA"     "DERMASON"
#  1           2           3         4          5          6           7
labels <- c("SEKER", "BARBUNYA", "BOMBAY",  "CALI", "HOROZ" ,  "SIRA","DERMASON")
num_lab_table <- rbind(labels,1:length(labels))
for(i in 1:7){
  ind.i.train <- which(w.k.train.scaled[,17] == labels[i])
  ind.i.test <- which(w.k.test.scaled[,17] == labels[i])
  w.k.train.scaled[ind.i.train,17] <- i
   w.k.test.scaled[ind.i.test,17] <- i
}



w.k.train.scaled[,17] <- as.numeric(w.k.train.scaled[,17] )
w.k.test.scaled[,17] <- as.numeric(w.k.test.scaled[,17] )
# w.knn.pred.matrix <- KernelKnn(w.k.train.scaled[,-17],y=w.k.train.scaled[,17],k=5,method = "euclidean",Levels = 1:7)

n <- nrow(round(w.k.train.scaled)*0.1)
ind.train <- sample(1:n,n)

k.list <- 6:10
# nunmber of cross validation
K_fold <- 5
# list of methods of distance
method.list <- c('euclidean', 'manhattan') 

w.knn.cv.errs <- c()
ta.knn.df <- data.frame(matrix(0,length(method.list),length(k.list) ))
row.names(ta.knn.df) <- method.list
colnames(ta.knn.df) <- k.list

for(m in 1:length(method.list)){
  for(k in 1:length(k.list)){
      err.i <- weighted_knn_cv_err(K_fold,w.k.train.scaled[ind.train,-17],w.k.train.scaled[ind.train,17],k.list[k],method.list[m]) 
  ta.knn.df[m,k] <- err.i
    }
  }
plot(k.list,ta.knn.df[1,],main = "method = euclidean")
best.k1 <- which.min(ta.knn.df[1,])

plot(k.list,ta.knn.df[2,],main = "method = manhattan")
best.k2 <- which.min(ta.knn.df[2,])

By trying two distance metrics, it turns out there there are little difference. The optimal is 8.

Weighted KNN Prediction

# train error
w.knn.pred.matrix <- KernelKnn(w.k.train.scaled[,-17],y=w.k.train.scaled[,17],k=8,method = "euclidean",Levels = 1:7)
w.knn.train.pred <- apply(w.knn.pred.matrix,1,FUN = function(x) which.max(x) )
w.knn.train.err <-  mean(w.knn.train.pred != w.k.train.scaled[,17])
# test error
w.knn.pred.matrix.test <- KernelKnn(w.k.train.scaled[,-17],TEST_data = w.k.test.scaled[,-17] ,y=w.k.train.scaled[,17],k=8,method = "euclidean",Levels = 1:7)
w.knn.test.pred <- apply(w.knn.pred.matrix.test,1,FUN = function(x) which.max(x) )
w.knn.test.err <-  mean(w.knn.test.pred != w.k.test.scaled[,17])

Confusion Matrix

pred.knn <- knn(k.train.scaled,k.test.scaled,k.train.lab,k=14)
knn.cm <- caret::confusionMatrix(pred.knn,as.factor(k.test.lab));knn.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
##   BARBUNYA      264      0   11        0     1     2    0
##   BOMBAY          0     97    0        0     0     0    0
##   CALI           25      0  296        0     8     0    0
##   DERMASON        0      0    0      640     2     4   53
##   HOROZ           1      0    9        4   355     0    8
##   SEKER           2      0    0       15     0   390    3
##   SIRA            6      0    4       44    12    12  454
## 
## Overall Statistics
##                                          
##                Accuracy : 0.917          
##                  95% CI : (0.906, 0.9271)
##     No Information Rate : 0.2583         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8997         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: BARBUNYA Class: BOMBAY Class: CALI Class: DERMASON
## Sensitivity                  0.88591       1.00000      0.9250          0.9104
## Specificity                  0.99422       1.00000      0.9863          0.9708
## Pos Pred Value               0.94964       1.00000      0.8997          0.9156
## Neg Pred Value               0.98609       1.00000      0.9900          0.9689
## Prevalence                   0.10948       0.03564      0.1176          0.2583
## Detection Rate               0.09699       0.03564      0.1087          0.2351
## Detection Prevalence         0.10213       0.03564      0.1209          0.2568
## Balanced Accuracy            0.94007       1.00000      0.9556          0.9406
##                      Class: HOROZ Class: SEKER Class: SIRA
## Sensitivity                0.9392       0.9559      0.8764
## Specificity                0.9906       0.9914      0.9646
## Pos Pred Value             0.9416       0.9512      0.8534
## Neg Pred Value             0.9902       0.9922      0.9708
## Prevalence                 0.1389       0.1499      0.1903
## Detection Rate             0.1304       0.1433      0.1668
## Detection Prevalence       0.1385       0.1506      0.1954
## Balanced Accuracy          0.9649       0.9736      0.9205

4.4 Support Vector Machine (SVM)

Definition

Support vector machine tries to construct hyperplanes that maximize the margin between two or more classes.

Advantages

It is efficient in solving high dimensional problems.

Limitations

When data sample is large, SVM is computationally expensive.
SVM does not perform very well, when the data set has more noise.

Sampling

25% data for testing, 75% for training and tuning.

# sampling
n <- dim(beans)[1]
p <- dim(beans)[2]
size_test <- floor(n/4)

set.seed(2031)
sample.index <- sample(1:n,size_test)
# train data
s.sample.train <- beans[-sample.index,-17]
s.sample.train.class <- beans[-sample.index,17]
# test data
s.sample.test <- beans[sample.index,-17]
s.sample.test.class <- beans[sample.index,17]

Scaling

# train data
s.sample.train <- scale(s.sample.train)
# test data
s.sample.test <- scale(s.sample.test)

Apply SVM

Gaussian Kernel

Polynomial Kernal

# RBF kernel
ranges_poly <- list(coef0 = 10^(-1:1),
                 gamma = 10^(-1:1),
                 degree = c(2)
                 )
tune.out.poly <- e1071::tune(svm,as.matrix(s.sample.train),as.factor(s.sample.train.class),
                 ranges = ranges_poly)


tune.out.poly$best.parameters
tune.out.poly$best.performance

Prediction

Comparing the above svm with tree different kernels, we choose polynomial kernel to make prediction in terms of prediction accuracy.

svm.poly.pred <- svm(x=s.sample.train,y=as.factor(s.sample.train.class),scale = F,
                    kernel = "polynomial",
                    coef0 = 10, 
                    gamma = 0.1,
                    degree = 3
                    )
# train error
svm.train.err <- mean(predict(svm.poly.pred,newdata = s.sample.train) !=s.sample.train.class);svm.train.err
## [1] 0.06102459
# test error

svm.pred.err <- mean(predict(svm.poly.pred,newdata = s.sample.test) !=s.sample.test.class);svm.pred.err
## [1] 0.06907701

Condusion Matrix

svm.cm <- caret::confusionMatrix(predict(svm.poly.pred,newdata = s.sample.test),as.factor(s.sample.test.class));svm.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
##   BARBUNYA      352      0   14        0     4     2    2
##   BOMBAY          1    118    0        0     0     0    0
##   CALI           17      0  374        0     5     0    0
##   DERMASON        0      0    0      822     4     7   62
##   HOROZ           1      0   11        5   449     0    8
##   SEKER           3      0    1       16     0   492    7
##   SIRA            3      0    2       41    11     8  560
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9309          
##                  95% CI : (0.9219, 0.9392)
##     No Information Rate : 0.2598          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9166          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: BARBUNYA Class: BOMBAY Class: CALI Class: DERMASON
## Sensitivity                   0.9337       1.00000      0.9303          0.9299
## Specificity                   0.9927       0.99970      0.9927          0.9710
## Pos Pred Value                0.9412       0.99160      0.9444          0.9184
## Neg Pred Value                0.9917       1.00000      0.9907          0.9753
## Prevalence                    0.1108       0.03469      0.1182          0.2598
## Detection Rate                0.1035       0.03469      0.1099          0.2416
## Detection Prevalence          0.1099       0.03498      0.1164          0.2631
## Balanced Accuracy             0.9632       0.99985      0.9615          0.9504
##                      Class: HOROZ Class: SEKER Class: SIRA
## Sensitivity                0.9493       0.9666      0.8764
## Specificity                0.9915       0.9907      0.9765
## Pos Pred Value             0.9473       0.9480      0.8960
## Neg Pred Value             0.9918       0.9941      0.9716
## Prevalence                 0.1390       0.1496      0.1878
## Detection Rate             0.1320       0.1446      0.1646
## Detection Prevalence       0.1393       0.1526      0.1837
## Balanced Accuracy          0.9704       0.9786      0.9264

4.5 Decision Tree

Definition

A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label.

Advantages

It does not make any assumptions for the input data.
Missing values does not effect the process of building a tree.
The tree model is easy to interpret.
It has easy model interpretability and can even be visualized.

Limitations

A little change in the training data will result completely different tree models.
Decision trees tend to overfit the data.

Apply Decision Tree

Sampling

Take 20% data for testing, 80% for 5 cross validation tuning and training.

set.seed(2068)
# sampling
number_test <- floor(dim(beans)[1]/5)
index.list <- 1:dim(beans)[1]
sample.index.test <- sample(index.list,number_test)
sample.index.train <- index.list[-sample.index.test]
sample.index.train <- sample(sample.index.train,length(sample.index.train))

tree.sample.test <- beans[sample.index.test,]
tree.sample.train <- beans[sample.index.train,]

Build a function to conduct k fold cross validation by using decision tree.

tree_cv_err <- function(x,cl,k,tree.controls,split){
  # x : training data
  # cl : labels
  # k : number of folds
  n <- dim(x)[1]
  # runs 
  runs <- n %/% k
  rem <- n %% k
  
  preds <- c()
  
  # calculate each part
  for (i in 1:k) {
    
  # test sub index
  s.low <- (i-1)*runs + 1
  s.high <- i*runs
  # plus remainder in last run
  if(i==k){
    s.high <- i*runs + rem
  }
  
  # test set
  test.i <- x[s.low:s.high,]
  # train set
  train.i <- x[ -s.low:-s.high, ]
  # train label
  train.lab.i <- cl[-s.low:-s.high]
  
  tree1 <- tree(as.factor(train.lab.i) ~ ., data = train.i, split=split,control = tree.controls)
  
   # compute the test error
  test_pred1 <- predict(tree1, test.i, type="class")
  preds <- append(preds,test_pred1)
  }
  return(cv.err = mean(preds != cl))
}

Tuning

Tune the parameters in tree.control in terms of prediction accuracy.
There are two main parameters we need to tune here. The mincut controls the minimum numbers of observations to flow into a child node. The mindev controls the deviance between a root and its child node. It can be used to stop the node split.

# tune parameter minicuts using devicnace split
mincuts <- c(5,10,20,30,40,50,60)
mindev <- 10^(-1:-5)
nobs <- dim(tree.sample.train)[1]
cv.errs <- c()
for(i in mincuts){
  for(mid in mindev){
      cv.err <- tree_cv_err(tree.sample.train[,-17],tree.sample.train[,17],5,
                           tree.control(
                             nobs = nobs,
                             mincut = i,
                             mindev = mid),
                           split="deviance"
                           )
      cv.errs <- append(cv.errs,cv.err)
  }
}
which.min(cv.errs);cv.errs[which.min(cv.errs)]
## [1] 9
## [1] 0.0908256

The best parameters obtained according to the above cross validation is mincuts=10,mindev=0.0001.

Prediction

set.seed(1500)
# best parameter mincut = 10, mindev = 0.0001

tree.pred <- tree(as.factor(Class) ~ ., data = tree.sample.train, split="deviance",
              control = tree.control(
                nobs = dim(tree.sample.train)[1],
                mincut = 10,
                mindev = 0.0001))

# training error
tree.train.pred <- predict(tree.pred,tree.sample.train[,-17],type = "class")
tree.train.err <- mean(tree.train.pred != tree.sample.train[,17]);tree.train.err
## [1] 0.05758105
# test error
tree.test.pred <- predict(tree.pred,tree.sample.test[,-17],type = "class")
tree.test.err <- mean(tree.test.pred != tree.sample.test[,17]);tree.test.err
## [1] 0.09478325

The training error is 0.05758 and the test error is 0.09478.

Confusion Matrix

tree.cm <- caret::confusionMatrix(tree.test.pred,as.factor(tree.sample.test[,17]));tree.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
##   BARBUNYA      252      0   13        0     1     2   10
##   BOMBAY          0    109    0        0     0     0    0
##   CALI           23      0  285        0    12     0    2
##   DERMASON        0      0    0      663     3    17   61
##   HOROZ           6      0    7        1   342     0    8
##   SEKER           4      0    0        3     0   379    8
##   SIRA            3      0    3       46     9    16  434
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9052         
##                  95% CI : (0.8936, 0.916)
##     No Information Rate : 0.2619         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8853         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: BARBUNYA Class: BOMBAY Class: CALI Class: DERMASON
## Sensitivity                  0.87500       1.00000      0.9253          0.9299
## Specificity                  0.98932       1.00000      0.9847          0.9597
## Pos Pred Value               0.90647       1.00000      0.8851          0.8911
## Neg Pred Value               0.98527       1.00000      0.9904          0.9747
## Prevalence                   0.10580       0.04004      0.1132          0.2619
## Detection Rate               0.09258       0.04004      0.1047          0.2436
## Detection Prevalence         0.10213       0.04004      0.1183          0.2733
## Balanced Accuracy            0.93216       1.00000      0.9550          0.9448
##                      Class: HOROZ Class: SEKER Class: SIRA
## Sensitivity                0.9319       0.9155      0.8298
## Specificity                0.9907       0.9935      0.9650
## Pos Pred Value             0.9396       0.9619      0.8493
## Neg Pred Value             0.9894       0.9850      0.9597
## Prevalence                 0.1348       0.1521      0.1921
## Detection Rate             0.1256       0.1392      0.1594
## Detection Prevalence       0.1337       0.1447      0.1877
## Balanced Accuracy          0.9613       0.9545      0.8974

4.6 Random Forest

Definition

Random forest is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time.For classification tasks, the output of the random forest is the class selected by most trees.
It uses bootstrap as a resampling method. As the number of observations get larger, the samples selected take a portion about 63.2%.

Advantages

It has the advantages of decision tree and meanwhile it reduces the variance. It can handle very large data set.

Limitations

set.seed(4021)
# sampling
number_test <- floor(dim(beans)[1]/5)
index.list <- 1:dim(beans)[1]
sample.index.test <- sample(index.list,number_test)
sample.index.train <- index.list[-sample.index.test]
sample.index.train <- sample(sample.index.train,length(sample.index.train))

rf.sample.test <- beans[sample.index.test,]
rf.sample.train <- beans[sample.index.train,]

Apply Random Froest

Give the random forest 2000 for the number of trees.

set.seed(402)
rf.sample.train <- beans
rf.sample.train$Class <- factor(rf.sample.train$Class)
rf.model <- randomForest(formula = Class ~ ., data = rf.sample.train,ntree = 500,importance = T)

Prediction

rf.model
## 
## Call:
##  randomForest(formula = Class ~ ., data = rf.sample.train, ntree = 500,      importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 7.45%
## Confusion matrix:
##          BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA class.error
## BARBUNYA     1202      1   71        1     9    12   26 0.090771558
## BOMBAY          2    520    0        0     0     0    0 0.003831418
## CALI           53      0 1529        0    32     3   13 0.061963190
## DERMASON        0      0    0     3304     5    55  182 0.068245911
## HOROZ           6      0   31       13  1833     0   45 0.049273859
## SEKER           8      0    0       49     0  1927   43 0.049333991
## SIRA           10      0    5      262    35    42 2282 0.134294385

Test error: 7.45%.

5 Conclusion

results <- data.frame(rbind(lda.res,logisticregression.res,knn.res,svm.res,tree.res,randomforest.res))
results <- cbind(results, 1- unlist(results[,2]))
colnames(results) <- c("train error","test error","accuracy")
rownames(results) <- c("LDA","Logistic Regression" ,"KNN","SVM","Decision Tree","Random Forest")
kable(results)
train error test error accuracy
LDA 0.09459087 0.09772226 0.9022777
Logistic Regression 0.07227477 0.07861866 0.9213813
KNN 0.05722822 0.07788391 0.9221161
SVM 0.06102459 0.06907701 0.9309230
Decision Tree 0.05758105 0.09478325 0.9052168
Random Forest 0 0.07449606 0.9255039

By applying six machine learning methods, the goal of classifying seven mixed types of beans is now achieved at an accuracy rate around 93%.
Before we started applying any methods, we first checked the structure of the data set. The data set has 13611 x 16 samples with 7 classes. All the predictors are numerical and are proved to be significant in similar grain classification problems like rice and wheat grains. Although I did not try any dimension reduction methods, it is still worth trying to use PCA or LDA and so on to simplify the data set.
In data engineering section, I checked the normality of a feature and it turned out it did not satisfy the normal distribution. I also checked the distribution of each feature given certain class and they did not have normal distribution either.
We already knew the assumption of LDA that the class pdf is normal distribution was not met. After applying linear discriminant analysis (LDA), six linear functions were obtained and the first three functions covered 86.2%, percentage separation. The error rate is not ideal comparing to others methods’ results mainly because the assumption of LDA was not met.
The logistic regression does not make strict assumptions like LDA and the result appeared to be better than that of LDA.
When applying K nearest neighbor, I also wrote a function applying weighted KNN method in addition to unweighted KNN. However, after taking into account the distance in doing majority vote, the improvement is not obvious. Further study could be put on methods that can improve KNN performance for imbalanced data set and it is suggested to try different different metrics.
In conclusion,the SVM model had the best performance. Therefore, we choose it as the best method for the dry beans classification problem.

6 References

1: X. Chen, Y. Xun, W. Li, J. Zhang Combining discriminant analysis and neural networks for corn variety identification Comput. Electron. Agric., 71 (2010), pp. S48-S53, 10.1016/j.compag.2009.09.003

2: Murat Koklu, Ilker Ali Ozkan,Multiclass classification of dry beans using computer vision and machine learning techniques,Computers and Electronics in Agriculture,Volume 174,2020,105507,ISSN 0168-1699.

3: Symons, Stephen & Fulcher, R.. (1988). Determination of wheat kernel morphological variation by digital image analysis: I. Variation in Eastern Canadian milling quality wheats. Journal of Cereal Science. 8. 211-218. 10.1016/S0733-5210(88)80032-8.

4: Jigang Xie∗, Zhengding Qiu (2006), The effect of imbalanced data sets on LDA:A theoretical and empirical analysis, Pattern Recognition 40 (2007) 557 – 562