Saturday, 16 September 2017

Logistic Regression 101 - Part 1: Why Predict Categorical Variables?

We all know Ordinary Least Square (OLS) doesn't work well enough with categorical Y variables. And we all know why Logistic Regression is so great when attempting to predict binary Y variables. But why is it great?

Most students of data science (including me) are familiar with various tools you can use to generate logistic regression coefficients to predict a binary Y variables but never really understand the intuitive logic behind logistic regression. We are always able to interpret the coefficients and the predictions pretty well. What I am trying to do with this series of blog posts is to gain a fundamental understanding of everything that makes logistic regression work without shying away from admitting that I know everything about it's building blocks like logarithms logits, odds, maximum likelihood estimation and several other things that make it click.

Why predict Categorical Variables at all?

Let's take a step back here and look at the bigger picture. Why do data scientists so often need to predict a categorical variable? It is because most phenomena in nature are qualitative. You're either dead or alive, make one choice or another. Most such results have no cardinal or ordinal properties, i.e. none of the categories gets preference or differs in size. E.g. Male doesn't come before female or isn't different in magnitude than female. Even if we choose to assign a numerical values to the categories, there is no numerical characteristic attached to the variable. You can assign 1 to males and 0 to females, but that doesn't mean that since 1 is greater than 0 we can use that fact in the modeling process. Values between the categories are considered nonsensical. E.g. 0.5 is not a valid gender in the above method of coding. Sometimes there are more than two results: you vote for a Republican or a Democrat or a 3rd party candidate. Because such output variables are so prevalent in nature, it is important to be able to just as accurately predict them as you would continuous output variables. As a starting point, it would serve us well to be able to first clearly understand how to model a dichotomous output variable which we will then extend to several categories. We will use an example of smoking: you smoke or you don't.