In this post we will try out various classification algorithm on a data set and examine their performance, we will start with exploring the data numerically and then with try some viszualization to get big picture of the dataset, the dataset we are using is “Weekly” dataset from ISLR library in R and the solutions will be in R language. As the title of the blog suggest its just an introduction i will try to keep it simple. This post will have two part in second part we will try out different data set.

The data set which we going to use is the Weekly percentage returns for the S&P 500 stock index between 1990 and 2010. Dataset has 1089 observation and 9 variables

**Year**- the year in which the observation was recorded.**Lag1**to**Lag5**- Lag1 is the percentage return of pervious week, Lag2 for previous 2 weeks, Lag3 for 3 weeks and so on.**Volume**- Volume of shares traded on average in billions.**Today**- percentage return for this week**Direction**- Whether the marked had a positive or negative return on a given week.

lets take a quick glance of the data

1 | # import the ISLR library |

Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | Direction | |
---|---|---|---|---|---|---|---|---|---|

1 | 1990 | 0.816 | 1.572 | -3.936 | -0.229 | -3.484 | 0.1549760 | -0.270 | Down |

2 | 1990 | -0.270 | 0.816 | 1.572 | -3.936 | -0.229 | 0.1485740 | -2.576 | Down |

3 | 1990 | -2.576 | -0.270 | 0.816 | 1.572 | -3.936 | 0.1598375 | 3.514 | Up |

4 | 1990 | 3.514 | -2.576 | -0.270 | 0.816 | 1.572 | 0.1616300 | 0.712 | Up |

5 | 1990 | 0.712 | 3.514 | -2.576 | -0.270 | 0.816 | 0.1537280 | 1.178 | Up |

6 | 1990 | 1.178 | 0.712 | 3.514 | -2.576 | -0.270 | 0.1544440 | -1.372 | Down |

Few things we can observe is :

- Direction is qualitative variable and rest of the variables are quantitative variables.
- Direction having value of
`Down`

indicates negative market return and`Up`

indicating positive market return. Which are stored as factor levels in R you can use`levels(Weekly$Direction)`

to see the factor levels of the Direction column. - Remember Volume variable is on a billion scale.

So lets start with numerical exploraion.

1 | summary(Weekly) |

#### #Summary of the predictor variables

Variables which we are going to use for prediction are shown in the summary below.

Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | |
---|---|---|---|---|---|---|---|

Min | 1990 | -18.1950 | -18.1950 | -18.1950 | -18.1950 | -18.1950 | 0.08747 |

1st Qu. | 1995 | -1.1540 | -1.1540 | -1.1580 | -1.1580 | -1.1660 | 0.33202 |

Median | 2000 | 0.2410 | 0.2410 | 0.2410 | 0.2380 | 0.2340 | 1.00268 |

Mean | 2000 | 0.1506 | 0.1511 | 0.1472 | 0.1458 | 0.1399 | 1.57462 |

3rd Qu. | 2005 | 1.4050 | 1.4090 | 1.4090 | 1.4090 | 1.4050 | 2.05373 |

Max. | 2010 | 12.0260 | 12.0260 | 12.0260 | 12.0260 | 12.0260 | 9.32821 |

If you observe the table variable Lag1 to Lag5 are in same range they look very identical atleast in summary. Data set is for year 1990 to 2010.

#### #Summary of the response variable

this is the label data which will be used for training the model

Direction | Value | Percentage |
---|---|---|

Down | 484 | 44.44% |

Up | 605 | 55.56% |

From the above table its clear that 44.44% of the weeks between the year 1990 and 2010 market return was negative and 55.56% was positive.

### #Coorelation Matrix

Coorelation matrix tells to whats extend two variables have a linear relationship with each other. Value is always between -1 to 1, value close to zero indicates less linearly dependent and zero means no relationship, and value close to 1 indicate a good amount of linear relationship. Positive or a negative value indicates a positive or a negative linear dependency respectively.

1 | cor(Weekly[,-9]) |

As we can see the values in the matrix are too low to form any opinion about coorelation of any two variable except for Volume and Year these two variable have highest coorelation of 0.84194, let take it title further and viszualise these two variables in correspondence to Direction variable.

### #Viszualise the data set

`pairs(Weekly)`

displays the pairwise scatter plot. It can be thought of the visual representation of coorelation matrix. Dataset cariables can be seen as the diagonal elements in the plot, each variable is plotted against each other variable, so the scatter plot of Lag5 again Volume is the plot which is on the left of Lag5 and Up of Volume.

lets Viszualise scatter plot of Volume against Year and point on the scatter plot will appear red if the market return for that observation is negative and blue if the market return is positive.

1 | attach(Weekly) |

visualization dont seem to reveal any patten.

### #Logistic Regression

Now let perform logistic regression on the data set.

1 | # logistic regression on Weekly data |

Coefficient of all he variables are negative except for Lag2 and Lag2 also has the lowest p-value and also Intercept has low p-value other variables dont seem to be statistical significant

1 | # we havn't passed and data for prediction so the function will use the |

As we can se the accuracy of the model is 56%. Let analyzise the confusion table a bit more to get more insight on the accuracy.

**Confusion table**

Actual Down | Actual Up | Total Pred. | Pred. Accuracy | Pred. Error | |
---|---|---|---|---|---|

Pred. Down | 54 | 48 | 112 | 48.21% | 51.79% |

Pred. Up | 430 | 557 | 987 | 56.43% | 43.57% |

Confusion table give us more deeper understanding of the performance of our model, as you can see for When the market is down predication it correct 48% and if market goes Up correct 56% of the time, and over all accuracy is 56%. All the metrics which we have calculated is on training data so we have only calculated training error.

Now lets break down the data set into two parts training data and a data set to testing our model. The idea is that the model is trainied on a part of the data set and the tested using other data set which it has not seen before which will help to get better understanding on how our model behaves when it faces a new data. So now we can calculate training error and test error.

1 | # breaking data in to train and test set |

notice here we have only considered the Lag2 variable only, and we saw previously only Lag2 variable was statistically significant so we ignored all the other variables and consider only the Lag2 variable and create our model.

1 | # create vector for Down as a dummy variables |

62.5% accuracy is quiet good. lets take it a step further and in our current settings we have set the threshold to 0.5 probability lets try out different probabilities to see if we get a better accuracy at some other value of probability.

1 | # training the model |

Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | Direction | |
---|---|---|---|---|---|---|---|---|---|

1 | 1990 | 0.816 | 1.572 | -3.936 | -0.229 | -3.484 | 0.1549760 | -0.270 | Down |

2 | 1990 | -0.270 | 0.816 | 1.572 | -3.936 | -0.229 | 0.1485740 | -2.576 | Down |

3 | 1990 | -2.576 | -0.270 | 0.816 | 1.572 | -3.936 | 0.1598375 | 3.514 | Up |

4 | 1990 | 3.514 | -2.576 | -0.270 | 0.816 | 1.572 | 0.1616300 | 0.712 | Up |

5 | 1990 | 0.712 | 3.514 | -2.576 | -0.270 | 0.816 | 0.1537280 | 1.178 | Up |

6 | 1990 | 1.178 | 0.712 | 3.514 | -2.576 | -0.270 | 0.1544440 | -1.372 | Down |

Few things we can observe is :

- The direction is qualitative variable and rest of the variables are quantitative variables.
- Direction having a value of
`Down`

indicates negative market return and`Up`

indicating the positive market return. Which are stored as factor levels in R you can use`levels(Weekly$Direction)`

to see the factor levels of the Direction column. - Remember Volume variable is on a billion scale.

So let’s start with numerical exploration.

1 | summary(Weekly) |

#### #Summary of the predictor variables

Variables which we are going to use for prediction are shown in the summary below.

Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | |
---|---|---|---|---|---|---|---|

Min | 1990 | -18.1950 | -18.1950 | -18.1950 | -18.1950 | -18.1950 | 0.08747 |

1st Qu. | 1995 | -1.1540 | -1.1540 | -1.1580 | -1.1580 | -1.1660 | 0.33202 |

Median | 2000 | 0.2410 | 0.2410 | 0.2410 | 0.2380 | 0.2340 | 1.00268 |

Mean | 2000 | 0.1506 | 0.1511 | 0.1472 | 0.1458 | 0.1399 | 1.57462 |

3rd Qu. | 2005 | 1.4050 | 1.4090 | 1.4090 | 1.4090 | 1.4050 | 2.05373 |

Max. | 2010 | 12.0260 | 12.0260 | 12.0260 | 12.0260 | 12.0260 | 9.32821 |

If you observe the table variable Lag1 to Lag5 are in same range they look very identical at least in summary. Data set is for the year 1990 to 2010.

#### #Summary of the response variable

this is the label data which will be used for training the model

Direction | Value | Percentage |
---|---|---|

Down | 484 | 44.44% |

Up | 605 | 55.56% |

From the above table its clear that 44.44% of the weeks between the year 1990 and 2010 market return was negative and 55.56% was positive.

### #Correlation Matrix

Correlation matrix tells to whats extend two variables have a linear relationship with each other. Value is always between -1 to 1, value close to zero indicates less linearly dependent and zero means no relationship, and value close to 1 indicates a good amount of linear relationship. Positive or a negative value indicates a positive or a negative linear dependency respectively.

1 | cor(Weekly[,-9]) |

As we can see the values in the matrix are too low to form any opinion about correlation of any two variable except for Volume and Year these two variable have highest correlation of 0.84194, let take it title further and visualise these two variables in correspondence to Direction variable.

### #Visualize the data set

`pairs(Weekly)`

displays the pairwise scatter plot. It can be thought of the visual representation of correlation matrix. Dataset variables can be seen as the diagonal elements in the plot, each variable is plotted against each other variable, so the scatter plot of Lag5 again Volume is the plot which is on the left of Lag5 and Up of Volume.

let’s Viszualise scatter plot of Volume against Year and point on the scatter plot will appear red if the market return for that observation is negative and blue if the market return is positive.

1 | attach(Weekly) |

visualization don’t seem to reveal any pattern.

### #Logistic Regression

Now let perform logistic regression on the data set.

1 | # logistic regression on Weekly data |

Coefficient of all he variables are negative except for Lag2 and Lag2 also has the lowest p-value and also Intercept has low p-value other variables don’t seem to be statistically significant

1 | # we havn't passed and data for prediction so the function will use the |

As we can see the accuracy of the model is 56%. Let analyze the confusion table a bit more to get more insight on the accuracy.

**Confusion table**

Actual Down | Actual Up | Total Pred. | Pred. Accuracy | Pred. Error | |
---|---|---|---|---|---|

Pred. Down | 54 | 48 | 112 | 48.21% | 51.79% |

Pred. Up | 430 | 557 | 987 | 56.43% | 43.57% |

Confusion table give us a deeper understanding of the performance of our model, as you can see for When the market is down prediction it correct 48% and if the market goes Up correct 56% of the time, and over all accuracy is 56%. All the metrics which we have calculated is on training data so we have only calculated training error.

Now let’s break down the data set into two parts training data and a data set to test our model. The idea is that the model is trained on a part of the data set and the tested using other data set which it has not seen before which will help to get a better understanding of how our model behaves when it faces a new data. So now we can calculate training error and test error.

1 | # breaking data into train and test set |

notice here we have only considered the Lag2 variable only, and we saw previously only Lag2 variable was statistically significant so we ignored all the other variables and consider only the Lag2 variable and create our model.

1 | # create vector for Down as a dummy variables |

62.5% accuracy is quite good. let’s take it a step further and in our current settings, we have set the threshold to 0.5 probability lets try out different probabilities to see if we get a better accuracy at some other value of probability.

1 | # training the model |

the graph seems to suggest the 0.5 is the optimal threshold.

## #Linear Discriminate Analysis (LDA)

So far we had used logistic regression for classification, it makes no assumption about the data and it drawback is that is doing terrible job if the data is well separated, we will use another algorithm called Linear Discriminate Analysis(LDA) which takes different approach, it assumes that the data from each class is normally distributed and has common variance and its also assume that the boundary separating data is also linear, I understand the assumptions are too stringent but we will relax this assumption in next algorithm(Quadratic Discriminate Analysis). Since all the class don’t actually have same variance LDA function internally compute the average of the variance of all classes and uses that as the common variance for computation.

1 | # model is trained using entire dataset and calculate the training error |

62.5% test accuracy which is same as that of logistic regression. In fact event the confusion table produced is also same.

## #Quadratic Discriminate Analysis

Quadratic Discriminate Analysis has relaxed the assumption that all the class share the same variance and still relies on the fact that the class is normally distributed. Class boundaries are assumed to be Quadratic in this algorithm. The code is some what same in both the QDA expect the LDA function call is replaced by QDA function.

1 | # model is trained using entire dataset and calculate the training error |

Accuracy of 58% even thought it assumes Up the whole time. terrible performance its quiet clear the class boundaries are not Quadratic.

## #K - Nearest Neighbour (KNN)

KNN is Nearest Neighbour algorithm considers the highest votes of the K nearest Neighbour of the point of prediction. For simplicity, we will use the considered value of K as 1.

1 | # to produce same result everytime we run the algorithm |

`scale`

function is used to scale the Lag2 vector values so it has mean 0 and standard deviation of 1. If we don’t do so then the predictor variable which is very high values for eg salary variable will usually have values like 1000 or above will influence the KNN in its favor, to avoid that we have to scale the variable so that all of them have a standard deviation of 1 and mean of 0.

50% accuracy as good as random a guess. 58.5% of the time is correctly guessed Up and 41.11% of the time it guesses as Down.

### #Comparison of performance of different algorithm :

Algorithm | Test Error | Up Accuracy | Down Accuracy |
---|---|---|---|

Logistic regression | 62.5 | 62.22 | 64.29 |

Linear Discriminate Analysis | 62.5 | 62.22 | 64.29 |

Quadratic Discriminate Analysis | 58.65 | 0 | 55.23 |

KNN | 50 | 58.5 | 41.11 |

Logistic regression and LDA seems to have had same accuracy and QDA and KNN seem to perform very bad.

## #Conclusion

We have seen how we get different algorithm give a different performance depending on the nature of data set. We could have used other techniques like the decision tree, neural networks but it will be the discussion of the future post.

the graph seem to suggest the 0.5 is the optimal threshold.

## #Linear Discriminate Analysis (LDA)

So far we had used logistic regression for classification, it makes no assumption about the data and it drawback is that is does terrible job if the data is well seperated, we will use another algorithm called Linear Discriminate Analysis(LDA) which takes different approach, it assumes that the data from each class is normally distributed and has common variance and its also assume that the boundary seprating data is also linear, I understand the assumptions are too stringent but we will relax this assumption in next algorithm(Quadratic Discriminate Analysis). Since all the class have dont acutally have same variance LDA function internally compute the average of the variance of all classes and uses that as the common variance for computation.

1 | # model is trained using entire dataset and calculate the training error |

62.5% test accuracy which is same as that of logistic regression. In fact event the confusion table produced is also same.

## #Quadratic Discriminate Analysis

Quadratic Discriminate Analysis is relaxes the assumption that all the class share the same variance and still relies on the fact that the class are normally distributed. Class boundaries are assumed to be Quadratic in this algorithm. Code is some what same in both the QDA expect the lda function call is replaced by qda function.

1 | # model is trained using entire dataset and calculate the training error |

Accuracy of 58% even thought it assumes Up the whole time. terrible performance its quiet clear the class boundaries are not Quadratic.

## #K - Nearest Neighbour (KNN)

KNN is Nearest Neighbour algorithm considers the highest votes of the K nearest Neighbour of point of predication. For simplicity we will use the consider value of K as 1.

1 | # to produce same result everytime we run the algorithm |

`scale`

function is use to scale the Lag2 vector values so the its has mean 0 and standard deviation of 1. If we dont do so then the predictor variable which is very high values for eg salary variable will usually have values like 1000 or above will influence the KNN int its favor, to avoid that we have to scale the variable so that all of them have standard deviation of 1 and mean of 0.

50% accuracy as good as random a guess. 58.5% of the time is correctly guess Up and 41.11% of the time it guess as Down.

### #Comparion of performance of different algorithm :

Algorithm | Test Error | Up Accuracy | Down Accuracy |
---|---|---|---|

Logistic regression | 62.5 | 62.22 | 64.29 |

Linear Discriminate Analysis | 62.5 | 62.22 | 64.29 |

Quadratic Discriminate Analysis | 58.65 | 0 | 55.23 |

KNN | 50 | 58.5 | 41.11 |

Logistic regression and LDA seems to have have same accuracy and QDA and KNN seem to perform very bad.

## #Conclusion

We have seen how we get different algorithm give different performance and no one algorithm rules them all. We could have used other techinque like decision tree, neural networks but it will be the discussion of future post.