Introduction to Classification Algorithm

In this post we will try out various classification algorithm on a data set and examine their performance, we will start with exploring the data numerically and then with try some viszualization to get big picture of the dataset, the dataset we are using is “Weekly” dataset from ISLR library in R and the solutions will be in R language. As the title of the blog suggest its just an introduction i will try to keep it simple. This post will have two part in second part we will try out different data set.

The data set which we going to use is the Weekly percentage returns for the S&P 500 stock index between 1990 and 2010. Dataset has 1089 observation and 9 variables

  1. Year - the year in which the observation was recorded.
  2. Lag1 to Lag5 - Lag1 is the percentage return of pervious week, Lag2 for previous 2 weeks, Lag3 for 3 weeks and so on.
  3. Volume - Volume of shares traded on average in billions.
  4. Today - percentage return for this week
  5. Direction - Whether the marked had a positive or negative return on a given week.

lets take a quick glance of the data

1
2
3
# import the ISLR library
library(ISLR)
head(Weekly)
YearLag1Lag2Lag3Lag4Lag5VolumeTodayDirection
119900.8161.572-3.936-0.229-3.4840.1549760-0.270Down
21990-0.2700.8161.572-3.936-0.2290.1485740-2.576Down
31990-2.576-0.2700.8161.572-3.9360.15983753.514Up
419903.514-2.576-0.2700.8161.5720.16163000.712Up
519900.7123.514-2.576-0.2700.8160.15372801.178Up
619901.1780.7123.514-2.576-0.2700.1544440-1.372Down

Few things we can observe is :

  1. Direction is qualitative variable and rest of the variables are quantitative variables.
  2. Direction having value of Down indicates negative market return and Up indicating positive market return. Which are stored as factor levels in R you can use levels(Weekly$Direction) to see the factor levels of the Direction column.
  3. Remember Volume variable is on a billion scale.

So lets start with numerical exploraion.

1
summary(Weekly)

#Summary of the predictor variables

Variables which we are going to use for prediction are shown in the summary below.

YearLag1Lag2Lag3Lag4Lag5Volume
Min1990-18.1950-18.1950-18.1950-18.1950-18.19500.08747
1st Qu.1995-1.1540-1.1540-1.1580-1.1580-1.16600.33202
Median20000.24100.24100.24100.23800.23401.00268
Mean20000.15060.15110.14720.14580.13991.57462
3rd Qu.20051.40501.40901.40901.40901.40502.05373
Max.201012.026012.026012.026012.026012.02609.32821

If you observe the table variable Lag1 to Lag5 are in same range they look very identical atleast in summary. Data set is for year 1990 to 2010.

#Summary of the response variable

this is the label data which will be used for training the model

DirectionValuePercentage
Down48444.44%
Up60555.56%

From the above table its clear that 44.44% of the weeks between the year 1990 and 2010 market return was negative and 55.56% was positive.

#Coorelation Matrix

Coorelation matrix tells to whats extend two variables have a linear relationship with each other. Value is always between -1 to 1, value close to zero indicates less linearly dependent and zero means no relationship, and value close to 1 indicate a good amount of linear relationship. Positive or a negative value indicates a positive or a negative linear dependency respectively.

1
cor(Weekly[,-9])

As we can see the values in the matrix are too low to form any opinion about coorelation of any two variable except for Volume and Year these two variable have highest coorelation of 0.84194, let take it title further and viszualise these two variables in correspondence to Direction variable.

#Viszualise the data set

pairs(Weekly) displays the pairwise scatter plot. It can be thought of the visual representation of coorelation matrix. Dataset cariables can be seen as the diagonal elements in the plot, each variable is plotted against each other variable, so the scatter plot of Lag5 again Volume is the plot which is on the left of Lag5 and Up of Volume.

lets Viszualise scatter plot of Volume against Year and point on the scatter plot will appear red if the market return for that observation is negative and blue if the market return is positive.

1
2
3
4
5
attach(Weekly)
# point on the scatter point will appear red if the Direction in the data set
# is Down and blue otherwise.
plot( Year, Volume, col = ifelse(Direction == "Down", "red", "blue"))
legend("topleft", legend = c("Down","Up"), col = c("blue","red"), lwd=3)

visualization dont seem to reveal any patten.

#Logistic Regression

Now let perform logistic regression on the data set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# logistic regression on Weekly data
# model is trained using entire dataset and calculate the training error

> glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = binomial)
> summary(glm.fit)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6949 -1.2565 0.9913 1.0849 1.4579

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019 **
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296 *
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
| | Actual |
Number of Fisher Scoring iterations: 4

Coefficient of all he variables are negative except for Lag2 and Lag2 also has the lowest p-value and also Intercept has low p-value other variables dont seem to be statistical significant

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# we havn't passed and data for prediction so the function will use the
# the training data this can be helpful to calculate training error.
> glm.probs = predict(glm.fit, type = "response")
# rep function will create a vector of "Down" value of the length of glm.probs
# and assign it to glm.pred
> glm.pred = rep("Down", length(glm.probs))
# if the probability of the predication is above 0.5 it is marked as Up
> glm.pred[glm.probs > 0.5] = "Up"
# return confusion table
> table(glm.pred, Direction)
Direction
glm.pred Down Up
Down 54 48
Up 430 557
> mean(glm.pred == Direction) * 100
[1] 56.10652

As we can se the accuracy of the model is 56%. Let analyzise the confusion table a bit more to get more insight on the accuracy.

Confusion table

Actual DownActual UpTotal Pred.Pred. AccuracyPred. Error
Pred. Down544811248.21%51.79%
Pred. Up43055798756.43%43.57%

Confusion table give us more deeper understanding of the performance of our model, as you can see for When the market is down predication it correct 48% and if market goes Up correct 56% of the time, and over all accuracy is 56%. All the metrics which we have calculated is on training data so we have only calculated training error.

Now lets break down the data set into two parts training data and a data set to testing our model. The idea is that the model is trainied on a part of the data set and the tested using other data set which it has not seen before which will help to get better understanding on how our model behaves when it faces a new data. So now we can calculate training error and test error.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# breaking data in to train and test set
library(ISLR)
attach(Weekly)
# the below expression will return boolean/logical vector with same length as
# Year vector, all the position which are less then 2009 will be set to TRUE
# and rest of the position will be set to FALSE i.e. position for which
# Year is 2009 or 2010
train = Year < 2009
# Weekly[train, ] will filter out then data between 1990 to 2008
Weekly.train = Weekly[train, ]
# Weekly.test will have data set from year 2009 to 2010
# !train will flip the values of train vector and the data which as not
# included in training is used as the test set.
Weekly.test = Weekly[!train, ]

# same things as above but for the labels for the data set are seperate for
# training and test set.

# labels for for the training data
Direction.train = Direction[train]
# labels for for the test data
Direction.test = Direction[!train]
# train our model on training data we have passed extra variable subset which
# will seperate the training data in the similar manner as we did above
> glm.fit_1990 = glm(Direction ~ Lag2, family = binomial, subset = train, data = Weekly)

notice here we have only considered the Lag2 variable only, and we saw previously only Lag2 variable was statistically significant so we ignored all the other variables and consider only the Lag2 variable and create our model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# create vector for Down as a dummy variables
> glm.pred_1990 = rep("Down", length(glm.probs_1990))
# if the probability of the predication is above 0.5 the assign it as Up
# otherwise its down anyways
> glm.pred_1990[glm.probs_1990 > 0.5] = "Up"
# confusion table for the test data
> table(glm.pred_1990, Direction.test)
Direction.test
glm.pred_1990 Down Up
Down 9 5
Up 34 56
# overall accuracy of the test data
> mean(glm.pred_1990 == Direction.test) * 100
62.5

62.5% accuracy is quiet good. lets take it a step further and in our current settings we have set the threshold to 0.5 probability lets try out different probabilities to see if we get a better accuracy at some other value of probability.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# training the model
> glm.fit_1990 = glm(Direction ~ Lag2, family = binomial, subset = train, data = Weekly)
# perform prediction on the test set
> glm.probs_1990 = predict(glm.fit_1990, Weekly.test, type = "response")
# empty vector which will be later filled with the test error
# for each threshold use set
test_error = c()
# seq will return vector of 20 number between 0 to 1 which we will use for the
# threshold in prediction
prob_threshold = seq(0,1,20)
for( threshold in prob_threshold){
glm.pred_1990 = rep("Down", length(glm.probs_1990))
# if the probability of a prediction is above threshold
# then mark the observation as Up
glm.pred_1990[glm.probs_1990 > threshold] = "Up"
# calculate the test error and append it to the test_error vector
test_error = c(test_error, mean(glm.pred_1990 == Direction.test)
}
# create a plot for error rate v/s threshold
plot(prob_threshold, test_error, type = "l", xlab = "threshold", ylab = "error rate")
```a---
title: Introduction to Classification Algorithm
tags:
- Classification
- ISLR
categories:
- Machine Learning
date: 2016-10-06 14:35:30
---

In this post we will try out various classification algorithm on a data set and examine their performance, we will start with exploring the data numerically and then willtrying some visualization to get big picture of the dataset, the dataset we are using is "Weekly" data set from ISLR library in R and the solutions will be in R language. As the title of the blog suggests it's just an introduction I will try to keep it simple.This post will have two part in the second part we will try out different data set.
<!-- more -->

The data set which we going to use is the Weekly percentage returns for the S&P 500 stock index between 1990 and 2010. Dataset has 1089 observation and 9 variables
1. **Year** - the year in which the observation was recorded.
1. **Lag1** to **Lag5** - Lag1 is the percentage return of previous week, Lag2 for previous 2 weeks, Lag3 for 3 weeks and so on.
1. **Volume** - Volume of shares traded on average in billions.
1. **Today** - percentage return for this week
1. **Direction** - Whether the market had a positive or negative return on a given week.

let's take a quick glance at the data

```R
# import the ISLR library
library(ISLR)
head(Weekly)
YearLag1Lag2Lag3Lag4Lag5VolumeTodayDirection
119900.8161.572-3.936-0.229-3.4840.1549760-0.270Down
21990-0.2700.8161.572-3.936-0.2290.1485740-2.576Down
31990-2.576-0.2700.8161.572-3.9360.15983753.514Up
419903.514-2.576-0.2700.8161.5720.16163000.712Up
519900.7123.514-2.576-0.2700.8160.15372801.178Up
619901.1780.7123.514-2.576-0.2700.1544440-1.372Down

Few things we can observe is :

  1. The direction is qualitative variable and rest of the variables are quantitative variables.
  2. Direction having a value of Down indicates negative market return and Up indicating the positive market return. Which are stored as factor levels in R you can use levels(Weekly$Direction) to see the factor levels of the Direction column.
  3. Remember Volume variable is on a billion scale.

So let’s start with numerical exploration.

1
summary(Weekly)

#Summary of the predictor variables

Variables which we are going to use for prediction are shown in the summary below.

YearLag1Lag2Lag3Lag4Lag5Volume
Min1990-18.1950-18.1950-18.1950-18.1950-18.19500.08747
1st Qu.1995-1.1540-1.1540-1.1580-1.1580-1.16600.33202
Median20000.24100.24100.24100.23800.23401.00268
Mean20000.15060.15110.14720.14580.13991.57462
3rd Qu.20051.40501.40901.40901.40901.40502.05373
Max.201012.026012.026012.026012.026012.02609.32821

If you observe the table variable Lag1 to Lag5 are in same range they look very identical at least in summary. Data set is for the year 1990 to 2010.

#Summary of the response variable

this is the label data which will be used for training the model

DirectionValuePercentage
Down48444.44%
Up60555.56%

From the above table its clear that 44.44% of the weeks between the year 1990 and 2010 market return was negative and 55.56% was positive.

#Correlation Matrix

Correlation matrix tells to whats extend two variables have a linear relationship with each other. Value is always between -1 to 1, value close to zero indicates less linearly dependent and zero means no relationship, and value close to 1 indicates a good amount of linear relationship. Positive or a negative value indicates a positive or a negative linear dependency respectively.

1
cor(Weekly[,-9])

As we can see the values in the matrix are too low to form any opinion about correlation of any two variable except for Volume and Year these two variable have highest correlation of 0.84194, let take it title further and visualise these two variables in correspondence to Direction variable.

#Visualize the data set

pairs(Weekly) displays the pairwise scatter plot. It can be thought of the visual representation of correlation matrix. Dataset variables can be seen as the diagonal elements in the plot, each variable is plotted against each other variable, so the scatter plot of Lag5 again Volume is the plot which is on the left of Lag5 and Up of Volume.

let’s Viszualise scatter plot of Volume against Year and point on the scatter plot will appear red if the market return for that observation is negative and blue if the market return is positive.

1
2
3
4
5
attach(Weekly)
# point on the scatter point will appear red if the Direction in the data set
# is Down and blue otherwise.
plot( Year, Volume, col = ifelse(Direction == "Down", "red", "blue"))
legend("topleft", legend = c("Down","Up"), col = c("blue","red"), lwd=3)

visualization don’t seem to reveal any pattern.

#Logistic Regression

Now let perform logistic regression on the data set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# logistic regression on Weekly data
# model is trained using entire dataset and calculate the training error

> glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = binomial)
> summary(glm.fit)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6949 -1.2565 0.9913 1.0849 1.4579

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019 **
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296 *
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
| | Actual |
Number of Fisher Scoring iterations: 4

Coefficient of all he variables are negative except for Lag2 and Lag2 also has the lowest p-value and also Intercept has low p-value other variables don’t seem to be statistically significant

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# we havn't passed and data for prediction so the function will use the
# the training data this can be helpful to calculate training error.
> glm.probs = predict(glm.fit, type = "response")
# rep function will create a vector of "Down" value of the length of glm.probs
# and assign it to glm.pred
> glm.pred = rep("Down", length(glm.probs))
# if the probability of the predication is above 0.5 it is marked as Up
> glm.pred[glm.probs > 0.5] = "Up"
# return confusion table
> table(glm.pred, Direction)
Direction
glm.pred Down Up
Down 54 48
Up 430 557
> mean(glm.pred == Direction) * 100
[1] 56.10652

As we can see the accuracy of the model is 56%. Let analyze the confusion table a bit more to get more insight on the accuracy.

Confusion table

Actual DownActual UpTotal Pred.Pred. AccuracyPred. Error
Pred. Down544811248.21%51.79%
Pred. Up43055798756.43%43.57%

Confusion table give us a deeper understanding of the performance of our model, as you can see for When the market is down prediction it correct 48% and if the market goes Up correct 56% of the time, and over all accuracy is 56%. All the metrics which we have calculated is on training data so we have only calculated training error.

Now let’s break down the data set into two parts training data and a data set to test our model. The idea is that the model is trained on a part of the data set and the tested using other data set which it has not seen before which will help to get a better understanding of how our model behaves when it faces a new data. So now we can calculate training error and test error.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# breaking data into train and test set
library(ISLR)
attach(Weekly)
# the below expression will return boolean/logical vector with the same length as
# Year vector, all the position which is less than 2009 will be set to TRUE
# and rest of the position will be set to FALSE i.e. position for which
# Year is 2009 or 2010
train = Year < 2009
# Weekly[train, ] will filter out the data between 1990 to 2008
Weekly.train = Weekly[train, ]
# Weekly.test will have data set from the year 2009 to 2010
# !train will flip the values of train vector and the data which is not
# included in training is used as the test set.
Weekly.test = Weekly[!train, ]

# same things as above but for the labels for the data set are separate for
# training and test set.

# labels for the training data
Direction.train = Direction[train]
# labels for the test data
Direction.test = Direction[!train]
# train our model on training data we have passed extra variable subset which
# will separate the training data in the similar manner as we did above
> glm.fit_1990 = glm(Direction ~ Lag2, family = binomial, subset = train, data = Weekly)

notice here we have only considered the Lag2 variable only, and we saw previously only Lag2 variable was statistically significant so we ignored all the other variables and consider only the Lag2 variable and create our model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# create vector for Down as a dummy variables
> glm.pred_1990 = rep("Down", length(glm.probs_1990))
# if the probability of the predication is above 0.5 the assign it as Up
# otherwise its down anyways
> glm.pred_1990[glm.probs_1990 > 0.5] = "Up"
# confusion table for the test data
> table(glm.pred_1990, Direction.test)
Direction.test
glm.pred_1990 Down Up
Down 9 5
Up 34 56
# overall accuracy of the test data
> mean(glm.pred_1990 == Direction.test) * 100
62.5

62.5% accuracy is quite good. let’s take it a step further and in our current settings, we have set the threshold to 0.5 probability lets try out different probabilities to see if we get a better accuracy at some other value of probability.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# training the model
> glm.fit_1990 = glm(Direction ~ Lag2, family = binomial, subset = train, data = Weekly)
# perform prediction on the test set
> glm.probs_1990 = predict(glm.fit_1990, Weekly.test, type = "response")
# empty vector which will be later filled with the test error
# for each threshold use set
test_error = c()
# seq will return vector of 20 number between 0 to 1 which we will use for the
# threshold in prediction
prob_threshold = seq(0,1,20)
for( threshold in prob_threshold){
glm.pred_1990 = rep("Down", length(glm.probs_1990))
# if the probability of a prediction is above threshold
# then mark the observation as Up
glm.pred_1990[glm.probs_1990 > threshold] = "Up"
# calculate the test error and append it to the test_error vector
test_error = c(test_error, mean(glm.pred_1990 == Direction.test)
}
# create a plot for error rate v/s threshold
plot(prob_threshold, test_error, type = "l", xlab = "threshold", ylab = "error rate")

the graph seems to suggest the 0.5 is the optimal threshold.

#Linear Discriminate Analysis (LDA)

So far we had used logistic regression for classification, it makes no assumption about the data and it drawback is that is doing terrible job if the data is well separated, we will use another algorithm called Linear Discriminate Analysis(LDA) which takes different approach, it assumes that the data from each class is normally distributed and has common variance and its also assume that the boundary separating data is also linear, I understand the assumptions are too stringent but we will relax this assumption in next algorithm(Quadratic Discriminate Analysis). Since all the class don’t actually have same variance LDA function internally compute the average of the variance of all classes and uses that as the common variance for computation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# model is trained using entire dataset and calculate the training error
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = train)
# plot the distribution of the data
plot(lda.fit)
# perform prediction on the test data
lda.pred = predict(lda.fit, Weekly.test)
# get the prediction for the test data
lda.class = lda.pred$class
# print the confusion table
table(lda.class, Direction.test)
Direction.test
lda.class Down Up
Down 9 5
Up 34 56
# print test accuracy
mean(lda.class == Direction.test) * 100
[1] 62.5

62.5% test accuracy which is same as that of logistic regression. In fact event the confusion table produced is also same.

#Quadratic Discriminate Analysis

Quadratic Discriminate Analysis has relaxed the assumption that all the class share the same variance and still relies on the fact that the class is normally distributed. Class boundaries are assumed to be Quadratic in this algorithm. The code is some what same in both the QDA expect the LDA function call is replaced by QDA function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# model is trained using entire dataset and calculate the training error
qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = train)
# perform prediction on the test data
qda.pred = predict(qda.fit, Weekly.test)
# get the prediction for the test data
qda.class = qda.pred$class
table(qda.class, Direction.test)
Direction.test
qda.class Down Up
Down 0 0
Up 43 61
# print test accuracy
> mean(qda.class == Direction.test) * 100
[1] 58.65385
>

Accuracy of 58% even thought it assumes Up the whole time. terrible performance its quiet clear the class boundaries are not Quadratic.

#K - Nearest Neighbour (KNN)

KNN is Nearest Neighbour algorithm considers the highest votes of the K nearest Neighbour of the point of prediction. For simplicity, we will use the considered value of K as 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# to produce same result everytime we run the algorithm
set.seed(1)
# calculate standard deviation for Lag2 variable
> sd(Lag2)
# calculate standard deviation for Lag2 after standardizing variable
[1] 2.357254
> sd(scale(Lag2))
[1] 1
# calculate mean for Lag2 variable
> mean(Lag2)
[1] 0.151079
# calculate mean for Lag2 after standardizing variable
> mean(scale(Lag2))
[1] 1.625766e-17
> train.X = as.matrix(scale(Lag2[train]))
> test.X = as.matrix(scale(Lag2[!train]))
> knn.pred = knn(train.X, test.X, Direction.train, k = 1)
> table(knn.pred, Direction.test)
Direction.test
knn.pred Down Up
Down 21 30
Up 22 31
> mean(knn.pred == Direction.test) * 100
[1] 50

scale function is used to scale the Lag2 vector values so it has mean 0 and standard deviation of 1. If we don’t do so then the predictor variable which is very high values for eg salary variable will usually have values like 1000 or above will influence the KNN in its favor, to avoid that we have to scale the variable so that all of them have a standard deviation of 1 and mean of 0.

50% accuracy as good as random a guess. 58.5% of the time is correctly guessed Up and 41.11% of the time it guesses as Down.

#Comparison of performance of different algorithm :

AlgorithmTest ErrorUp AccuracyDown Accuracy
Logistic regression62.562.2264.29
Linear Discriminate Analysis62.562.2264.29
Quadratic Discriminate Analysis58.65055.23
KNN5058.541.11

Logistic regression and LDA seems to have had same accuracy and QDA and KNN seem to perform very bad.

#Conclusion

We have seen how we get different algorithm give a different performance depending on the nature of data set. We could have used other techniques like the decision tree, neural networks but it will be the discussion of the future post.

the graph seem to suggest the 0.5 is the optimal threshold.

#Linear Discriminate Analysis (LDA)

So far we had used logistic regression for classification, it makes no assumption about the data and it drawback is that is does terrible job if the data is well seperated, we will use another algorithm called Linear Discriminate Analysis(LDA) which takes different approach, it assumes that the data from each class is normally distributed and has common variance and its also assume that the boundary seprating data is also linear, I understand the assumptions are too stringent but we will relax this assumption in next algorithm(Quadratic Discriminate Analysis). Since all the class have dont acutally have same variance LDA function internally compute the average of the variance of all classes and uses that as the common variance for computation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# model is trained using entire dataset and calculate the training error
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = train)
# plot the distribution of the data
plot(lda.fit)
# perform prediction on the test data
lda.pred = predict(lda.fit, Weekly.test)
# get the prediction for the test data
lda.class = lda.pred$class
# print the confusion table
table(lda.class, Direction.test)
Direction.test
lda.class Down Up
Down 9 5
Up 34 56
# print test accuracy
mean(lda.class == Direction.test) * 100
[1] 62.5

62.5% test accuracy which is same as that of logistic regression. In fact event the confusion table produced is also same.

#Quadratic Discriminate Analysis

Quadratic Discriminate Analysis is relaxes the assumption that all the class share the same variance and still relies on the fact that the class are normally distributed. Class boundaries are assumed to be Quadratic in this algorithm. Code is some what same in both the QDA expect the lda function call is replaced by qda function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# model is trained using entire dataset and calculate the training error
qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = train)
# perform prediction on the test data
qda.pred = predict(qda.fit, Weekly.test)
# get the prediction for the test data
qda.class = qda.pred$class
table(qda.class, Direction.test)
Direction.test
qda.class Down Up
Down 0 0
Up 43 61
# print test accuracy
> mean(qda.class == Direction.test) * 100
[1] 58.65385
>

Accuracy of 58% even thought it assumes Up the whole time. terrible performance its quiet clear the class boundaries are not Quadratic.

#K - Nearest Neighbour (KNN)

KNN is Nearest Neighbour algorithm considers the highest votes of the K nearest Neighbour of point of predication. For simplicity we will use the consider value of K as 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# to produce same result everytime we run the algorithm
set.seed(1)
# calculate standard deviation for Lag2 variable
> sd(Lag2)
# calculate standard deviation for Lag2 after standardizing variable
[1] 2.357254
> sd(scale(Lag2))
[1] 1
# calculate mean for Lag2 variable
> mean(Lag2)
[1] 0.151079
# calculate mean for Lag2 after standardizing variable
> mean(scale(Lag2))
[1] 1.625766e-17
> train.X = as.matrix(scale(Lag2[train]))
> test.X = as.matrix(scale(Lag2[!train]))
> knn.pred = knn(train.X, test.X, Direction.train, k = 1)
> table(knn.pred, Direction.test)
Direction.test
knn.pred Down Up
Down 21 30
Up 22 31
> mean(knn.pred == Direction.test) * 100
[1] 50

scale function is use to scale the Lag2 vector values so the its has mean 0 and standard deviation of 1. If we dont do so then the predictor variable which is very high values for eg salary variable will usually have values like 1000 or above will influence the KNN int its favor, to avoid that we have to scale the variable so that all of them have standard deviation of 1 and mean of 0.

50% accuracy as good as random a guess. 58.5% of the time is correctly guess Up and 41.11% of the time it guess as Down.

#Comparion of performance of different algorithm :

AlgorithmTest ErrorUp AccuracyDown Accuracy
Logistic regression62.562.2264.29
Linear Discriminate Analysis62.562.2264.29
Quadratic Discriminate Analysis58.65055.23
KNN5058.541.11

Logistic regression and LDA seems to have have same accuracy and QDA and KNN seem to perform very bad.

#Conclusion

We have seen how we get different algorithm give different performance and no one algorithm rules them all. We could have used other techinque like decision tree, neural networks but it will be the discussion of future post.

Share