Introduction to Classification Algorithm

In this post we will try out various classification algorithm on a data set and examine their performance, we will start with exploring the data numerically and then with try some viszualization to get big picture of the dataset, the dataset we are using is “Weekly” dataset from ISLR library in R and the solutions will be in R language. As the title of the blog suggest its just an introduction i will try to keep it simple. This post will have two part in second part we will try out different data set.

The data set which we going to use is the Weekly percentage returns for the S&P 500 stock index between 1990 and 2010. Dataset has 1089 observation and 9 variables

1. Year - the year in which the observation was recorded.
2. Lag1 to Lag5 - Lag1 is the percentage return of pervious week, Lag2 for previous 2 weeks, Lag3 for 3 weeks and so on.
3. Volume - Volume of shares traded on average in billions.
4. Today - percentage return for this week
5. Direction - Whether the marked had a positive or negative return on a given week.

lets take a quick glance of the data

YearLag1Lag2Lag3Lag4Lag5VolumeTodayDirection
119900.8161.572-3.936-0.229-3.4840.1549760-0.270Down
21990-0.2700.8161.572-3.936-0.2290.1485740-2.576Down
31990-2.576-0.2700.8161.572-3.9360.15983753.514Up
419903.514-2.576-0.2700.8161.5720.16163000.712Up
519900.7123.514-2.576-0.2700.8160.15372801.178Up
619901.1780.7123.514-2.576-0.2700.1544440-1.372Down

Few things we can observe is :

1. Direction is qualitative variable and rest of the variables are quantitative variables.
2. Direction having value of Down indicates negative market return and Up indicating positive market return. Which are stored as factor levels in R you can use levels(Weekly$Direction) to see the factor levels of the Direction column. 3. Remember Volume variable is on a billion scale. So lets start with numerical exploraion. #Summary of the predictor variables Variables which we are going to use for prediction are shown in the summary below. YearLag1Lag2Lag3Lag4Lag5Volume Min1990-18.1950-18.1950-18.1950-18.1950-18.19500.08747 1st Qu.1995-1.1540-1.1540-1.1580-1.1580-1.16600.33202 Median20000.24100.24100.24100.23800.23401.00268 Mean20000.15060.15110.14720.14580.13991.57462 3rd Qu.20051.40501.40901.40901.40901.40502.05373 Max.201012.026012.026012.026012.026012.02609.32821 If you observe the table variable Lag1 to Lag5 are in same range they look very identical atleast in summary. Data set is for year 1990 to 2010. #Summary of the response variable this is the label data which will be used for training the model DirectionValuePercentage Down48444.44% Up60555.56% From the above table its clear that 44.44% of the weeks between the year 1990 and 2010 market return was negative and 55.56% was positive. #Coorelation Matrix Coorelation matrix tells to whats extend two variables have a linear relationship with each other. Value is always between -1 to 1, value close to zero indicates less linearly dependent and zero means no relationship, and value close to 1 indicate a good amount of linear relationship. Positive or a negative value indicates a positive or a negative linear dependency respectively. As we can see the values in the matrix are too low to form any opinion about coorelation of any two variable except for Volume and Year these two variable have highest coorelation of 0.84194, let take it title further and viszualise these two variables in correspondence to Direction variable. #Viszualise the data set pairs(Weekly) displays the pairwise scatter plot. It can be thought of the visual representation of coorelation matrix. Dataset cariables can be seen as the diagonal elements in the plot, each variable is plotted against each other variable, so the scatter plot of Lag5 again Volume is the plot which is on the left of Lag5 and Up of Volume. lets Viszualise scatter plot of Volume against Year and point on the scatter plot will appear red if the market return for that observation is negative and blue if the market return is positive. visualization dont seem to reveal any patten. #Logistic Regression Now let perform logistic regression on the data set. Coefficient of all he variables are negative except for Lag2 and Lag2 also has the lowest p-value and also Intercept has low p-value other variables dont seem to be statistical significant As we can se the accuracy of the model is 56%. Let analyzise the confusion table a bit more to get more insight on the accuracy. Confusion table Actual DownActual UpTotal Pred.Pred. AccuracyPred. Error Pred. Down544811248.21%51.79% Pred. Up43055798756.43%43.57% Confusion table give us more deeper understanding of the performance of our model, as you can see for When the market is down predication it correct 48% and if market goes Up correct 56% of the time, and over all accuracy is 56%. All the metrics which we have calculated is on training data so we have only calculated training error. Now lets break down the data set into two parts training data and a data set to testing our model. The idea is that the model is trainied on a part of the data set and the tested using other data set which it has not seen before which will help to get better understanding on how our model behaves when it faces a new data. So now we can calculate training error and test error. notice here we have only considered the Lag2 variable only, and we saw previously only Lag2 variable was statistically significant so we ignored all the other variables and consider only the Lag2 variable and create our model. 62.5% accuracy is quiet good. lets take it a step further and in our current settings we have set the threshold to 0.5 probability lets try out different probabilities to see if we get a better accuracy at some other value of probability. YearLag1Lag2Lag3Lag4Lag5VolumeTodayDirection 119900.8161.572-3.936-0.229-3.4840.1549760-0.270Down 21990-0.2700.8161.572-3.936-0.2290.1485740-2.576Down 31990-2.576-0.2700.8161.572-3.9360.15983753.514Up 419903.514-2.576-0.2700.8161.5720.16163000.712Up 519900.7123.514-2.576-0.2700.8160.15372801.178Up 619901.1780.7123.514-2.576-0.2700.1544440-1.372Down Few things we can observe is : 1. The direction is qualitative variable and rest of the variables are quantitative variables. 2. Direction having a value of Down indicates negative market return and Up indicating the positive market return. Which are stored as factor levels in R you can use levels(Weekly$Direction) to see the factor levels of the Direction column.
3. Remember Volume variable is on a billion scale.

#Summary of the predictor variables

Variables which we are going to use for prediction are shown in the summary below.

YearLag1Lag2Lag3Lag4Lag5Volume
Min1990-18.1950-18.1950-18.1950-18.1950-18.19500.08747
1st Qu.1995-1.1540-1.1540-1.1580-1.1580-1.16600.33202
Median20000.24100.24100.24100.23800.23401.00268
Mean20000.15060.15110.14720.14580.13991.57462
3rd Qu.20051.40501.40901.40901.40901.40502.05373
Max.201012.026012.026012.026012.026012.02609.32821

If you observe the table variable Lag1 to Lag5 are in same range they look very identical at least in summary. Data set is for the year 1990 to 2010.

#Summary of the response variable

this is the label data which will be used for training the model

DirectionValuePercentage
Down48444.44%
Up60555.56%

From the above table its clear that 44.44% of the weeks between the year 1990 and 2010 market return was negative and 55.56% was positive.

#Correlation Matrix

Correlation matrix tells to whats extend two variables have a linear relationship with each other. Value is always between -1 to 1, value close to zero indicates less linearly dependent and zero means no relationship, and value close to 1 indicates a good amount of linear relationship. Positive or a negative value indicates a positive or a negative linear dependency respectively.

As we can see the values in the matrix are too low to form any opinion about correlation of any two variable except for Volume and Year these two variable have highest correlation of 0.84194, let take it title further and visualise these two variables in correspondence to Direction variable.

#Visualize the data set

pairs(Weekly) displays the pairwise scatter plot. It can be thought of the visual representation of correlation matrix. Dataset variables can be seen as the diagonal elements in the plot, each variable is plotted against each other variable, so the scatter plot of Lag5 again Volume is the plot which is on the left of Lag5 and Up of Volume.

let’s Viszualise scatter plot of Volume against Year and point on the scatter plot will appear red if the market return for that observation is negative and blue if the market return is positive.

visualization don’t seem to reveal any pattern.

#Logistic Regression

Now let perform logistic regression on the data set.

Coefficient of all he variables are negative except for Lag2 and Lag2 also has the lowest p-value and also Intercept has low p-value other variables don’t seem to be statistically significant

As we can see the accuracy of the model is 56%. Let analyze the confusion table a bit more to get more insight on the accuracy.

Confusion table

Actual DownActual UpTotal Pred.Pred. AccuracyPred. Error
Pred. Down544811248.21%51.79%
Pred. Up43055798756.43%43.57%

Confusion table give us a deeper understanding of the performance of our model, as you can see for When the market is down prediction it correct 48% and if the market goes Up correct 56% of the time, and over all accuracy is 56%. All the metrics which we have calculated is on training data so we have only calculated training error.

Now let’s break down the data set into two parts training data and a data set to test our model. The idea is that the model is trained on a part of the data set and the tested using other data set which it has not seen before which will help to get a better understanding of how our model behaves when it faces a new data. So now we can calculate training error and test error.

notice here we have only considered the Lag2 variable only, and we saw previously only Lag2 variable was statistically significant so we ignored all the other variables and consider only the Lag2 variable and create our model.

62.5% accuracy is quite good. let’s take it a step further and in our current settings, we have set the threshold to 0.5 probability lets try out different probabilities to see if we get a better accuracy at some other value of probability.

the graph seems to suggest the 0.5 is the optimal threshold.

#Linear Discriminate Analysis (LDA)

So far we had used logistic regression for classification, it makes no assumption about the data and it drawback is that is doing terrible job if the data is well separated, we will use another algorithm called Linear Discriminate Analysis(LDA) which takes different approach, it assumes that the data from each class is normally distributed and has common variance and its also assume that the boundary separating data is also linear, I understand the assumptions are too stringent but we will relax this assumption in next algorithm(Quadratic Discriminate Analysis). Since all the class don’t actually have same variance LDA function internally compute the average of the variance of all classes and uses that as the common variance for computation.

62.5% test accuracy which is same as that of logistic regression. In fact event the confusion table produced is also same.

Quadratic Discriminate Analysis has relaxed the assumption that all the class share the same variance and still relies on the fact that the class is normally distributed. Class boundaries are assumed to be Quadratic in this algorithm. The code is some what same in both the QDA expect the LDA function call is replaced by QDA function.

Accuracy of 58% even thought it assumes Up the whole time. terrible performance its quiet clear the class boundaries are not Quadratic.

#K - Nearest Neighbour (KNN)

KNN is Nearest Neighbour algorithm considers the highest votes of the K nearest Neighbour of the point of prediction. For simplicity, we will use the considered value of K as 1.

scale function is used to scale the Lag2 vector values so it has mean 0 and standard deviation of 1. If we don’t do so then the predictor variable which is very high values for eg salary variable will usually have values like 1000 or above will influence the KNN in its favor, to avoid that we have to scale the variable so that all of them have a standard deviation of 1 and mean of 0.

50% accuracy as good as random a guess. 58.5% of the time is correctly guessed Up and 41.11% of the time it guesses as Down.

#Comparison of performance of different algorithm :

AlgorithmTest ErrorUp AccuracyDown Accuracy
Logistic regression62.562.2264.29
Linear Discriminate Analysis62.562.2264.29
KNN5058.541.11

Logistic regression and LDA seems to have had same accuracy and QDA and KNN seem to perform very bad.

#Conclusion

We have seen how we get different algorithm give a different performance depending on the nature of data set. We could have used other techniques like the decision tree, neural networks but it will be the discussion of the future post.

the graph seem to suggest the 0.5 is the optimal threshold.

#Linear Discriminate Analysis (LDA)

So far we had used logistic regression for classification, it makes no assumption about the data and it drawback is that is does terrible job if the data is well seperated, we will use another algorithm called Linear Discriminate Analysis(LDA) which takes different approach, it assumes that the data from each class is normally distributed and has common variance and its also assume that the boundary seprating data is also linear, I understand the assumptions are too stringent but we will relax this assumption in next algorithm(Quadratic Discriminate Analysis). Since all the class have dont acutally have same variance LDA function internally compute the average of the variance of all classes and uses that as the common variance for computation.

62.5% test accuracy which is same as that of logistic regression. In fact event the confusion table produced is also same.

Quadratic Discriminate Analysis is relaxes the assumption that all the class share the same variance and still relies on the fact that the class are normally distributed. Class boundaries are assumed to be Quadratic in this algorithm. Code is some what same in both the QDA expect the lda function call is replaced by qda function.

Accuracy of 58% even thought it assumes Up the whole time. terrible performance its quiet clear the class boundaries are not Quadratic.

#K - Nearest Neighbour (KNN)

KNN is Nearest Neighbour algorithm considers the highest votes of the K nearest Neighbour of point of predication. For simplicity we will use the consider value of K as 1.

scale function is use to scale the Lag2 vector values so the its has mean 0 and standard deviation of 1. If we dont do so then the predictor variable which is very high values for eg salary variable will usually have values like 1000 or above will influence the KNN int its favor, to avoid that we have to scale the variable so that all of them have standard deviation of 1 and mean of 0.

50% accuracy as good as random a guess. 58.5% of the time is correctly guess Up and 41.11% of the time it guess as Down.

#Comparion of performance of different algorithm :

AlgorithmTest ErrorUp AccuracyDown Accuracy
Logistic regression62.562.2264.29
Linear Discriminate Analysis62.562.2264.29