Saturday, 16 September 2017

Logistic Regression 101 - Part 1: Why Predict Categorical Variables?

We all know Ordinary Least Square (OLS) doesn't work well enough with categorical Y variables. And we all know why Logistic Regression is so great when attempting to predict binary Y variables. But why is it great?

Most students of data science (including me) are familiar with various tools you can use to generate logistic regression coefficients to predict a binary Y variables but never really understand the intuitive logic behind logistic regression. We are always able to interpret the coefficients and the predictions pretty well. What I am trying to do with this series of blog posts is to gain a fundamental understanding of everything that makes logistic regression work without shying away from admitting that I know everything about it's building blocks like logarithms logits, odds, maximum likelihood estimation and several other things that make it click.

Why predict Categorical Variables at all?

Let's take a step back here and look at the bigger picture. Why do data scientists so often need to predict a categorical variable? It is because most phenomena in nature are qualitative. You're either dead or alive, make one choice or another. Most such results have no cardinal or ordinal properties, i.e. none of the categories gets preference or differs in size. E.g. Male doesn't come before female or isn't different in magnitude than female. Even if we choose to assign a numerical values to the categories, there is no numerical characteristic attached to the variable. You can assign 1 to males and 0 to females, but that doesn't mean that since 1 is greater than 0 we can use that fact in the modeling process. Values between the categories are considered nonsensical. E.g. 0.5 is not a valid gender in the above method of coding. Sometimes there are more than two results: you vote for a Republican or a Democrat or a 3rd party candidate. Because such output variables are so prevalent in nature, it is important to be able to just as accurately predict them as you would continuous output variables. As a starting point, it would serve us well to be able to first clearly understand how to model a dichotomous output variable which we will then extend to several categories. We will use an example of smoking: you smoke or you don't.

Sunday, 21 May 2017

What kind of nuts/seeds are best for bone health?


Have you ever wondered 'which nuts should I add to my diet to improve my bone health'?

I recently got a full body checkup including bone density test. I was shocked to find out that it was very low. I consume a lot of dairy products and get some amount of exercise. My mother, who also has tooth enamel problems besides bone density problems, really needed to supplement her diet with foods rich in Calcium, Iron, Phosphorous and other Vitamins like B2, B3 to really aid bone and teeth health. We are both heavy snackers - I usually snack on chips or biscuits knowing fully well that they are not good for me. However, I love munching on nuts and seeds but I have never known which ones are the best for you. I know that nuts can be very high in calories and saturated fats. When she came to me for advice, I was honestly clueless. I did a couple of Google searches but there were too many variables. So I decided to put my BI hat on and solve this with data.

Nuts/Seeds nutritional values vary and there are many myths surrounding what is good for you and what is not. I decided to look at datasets that gave me a comprehensive picture of the nutritional value of different foods, specifically, nuts. I found this page and this section "USDA Nutrition Dataset for the Apps for Healthy Kids Competition" had an Excel dataset that looked pretty comprehensive. I imported this into Tableau, filtered out Food Group 12 - Nuts/Seeds and started messing with it.

My aim was a build a dashboard that gave you information about which nuts/seeds to consume in order to maximize the intake of a particular nutrient while minimizing the number of carbs, fat, and calories in that type of nut/seed.

So here it is.

Understanding Nutritional Value of Nuts/Seeds (Data from USDA)




Use this dashboard to find out which nuts/seeds to include in your diet to maximize the intake of a certain nutrient while reducing the number of calories and saturated fat.

Here's how to use it:

1.  Select the nutrient of which you want to maximize the intake.
2. Set an upper limit to the number of calories that you are willing to consume. (If you are like me you can skip this step)
3. Hover over the circles to see which nuts/seeds have the highest nutritional value in terms of the selected n nutrient. The nuts/seeds in the lower right quadrant and the ones with the smallest circles are the best for you.
4. If you want to look at a group of nuts/seeds, select that from the group.

Make an informed choice to improve your bone health. Or not. But check out the dashboard anyway - it is quite fun.


Sunday, 2 April 2017

Mahalanobis Distance - The Most Interesting Measure by the Most Interesting Statistician

An Indian mathematician in 1927 devised a method to find outliers in data, a method that could have predicted the 2008 sub-prime mortgage crisis (Stockl & Hanke, 2014). If this statement has successfully captivated your attention, read ahead to know more about one of India’s most underappreciated mathematicians – Prasanta Chandra Mahalanobis, and yes, you guessed it, his greatest gift to humankind – The Mahalanobis Distance.

In this post, I will attempt to 1) introduce you to one of the most interesting statisticians in the world, 2) give you an intuitive explanation of the measure.



Mahalanobis distance is an oft repeated measure of determining outliers, but very few know the power of this measure. In wildlife biology, it is used to find the ideal landscape which best fits the niche of some wildlife species (Jenness Enterprises, 2016). In finance, it is used in asset classification and portfolio surveillance. Like I mentioned earlier, it could have been used to flag the underlying problems in the financial numbers leading up to the Great Recession. It is used to test and improve the welding quality of Robotic arc welding (Chand, et al., 2013). It is used in visual surveillance to search for criminals in crowds by facial re-identification (Roth, Hirzer, Kostinger, Beleznai, & Bischof, 2014). And I am going to stop here, because I am tired of listing references to research papers which use Mahalanobis Distance. You can take my word for it when I say, Mahalanobis Distance, is kind of a big deal in the world of Statistics.



Let us look at the creator of this widely used measure. Prasanta Chandra Mahalanobis was born in a Bengali family in Bikrampur (now in Bangladesh). Now that I have gotten the most boring fact about him out of the way, let me start listing all the mind-blowing facts about him. Jagdish Chandra Bose, The Father of Radio Science, was his teacher in school. Subhash Chandra Bose was his college mate in college! He was a Physics Honor Student, not Mathematics or Statistics, but in fact, Physics (not unlike the best data scientists today). He worked under J.J.Thompson, Nobel Prize winning Physicist after he completed his Bachelors. He had a flair for the theatricals too – P.C. Mahalanobis once played the protagonist in Rabindranath Tagore’s play! He accompanied Tagore on his Europe tour where they met Albert Einstein and Sigmund Freud. If you think the multi-variate measure of distance is his the only noteworthy achievement of his life, he also introduced the concept of Pilot surveys! Wow, this guy was a Rockstar!

Moving on to Mahalanobis Distance, there is no easy way to explain it. I must admit, I have not really understood the finer mathematical workings of it so well myself. The simplest definition of Mahalanobis Distance is that it is the Euclidean distance of a data point from the other points’ center of mass, normalized and vectorized based on their sample point distribution.

Consider a set of data points in an N-dimensional Euclidean space. The first step is the figure out the center of mass of all the data points. Given any data point, our task is to figure out if it is an outlier. To know if it is an outlier, the usual approach would be to measure the Euclidean distance of the data point in question and see if it is beyond a certain number of standard deviations from the center of mass. However, what this “usual” method does not consider, is the spread of all the data points themselves. What it assumes is that the distribution of the data points is spherical. This assumption means that only the normalized distance is sufficient to determine if any data point is an outlier or not. However, if the distribution is indeed ellipsoidal, or some other non-spherical form, then the direction of the data point in question also becomes important. Mahalanobis understood this shortcoming of the usual method and decided to incorporate the shape of the distribution as well. Hence, if an ellipsoid represents the spread of the data points the best, the Mahalanobis distance of a data point is the distance of that data point from the center of mass divided by the width of the ellipsoid in the direction of the test point.


This vectorized approach is why Mahalanobis is often considered to be a cousin of the Linear Discriminant Analysis. Here’s a rather intuitive explanation of Mahalanobis distance in two dimensions. You will realize why it is the cousin of Linear Discriminant Analysis in one scroll-through.


Now that you know the man behind this awesome measure, go ahead and let the world know! There is a new most interesting man statistician in town.


Works Cited

Chand, R. R., Kim, I. S., Lee, J. H., Lee, J. P., Shim, J. Y., & Kim, Y. S. (2013, November). A Study on Welding Quality of Robotic Arc Welding Process Using Mahalanobis Distance Method. (H. Z. A. Kiet Tieu, Ed.) Materials Science Forum, 773-774, pp. 759-765. doi:10.4028/www.scientific.net/MSF.773-774.759
Jenness Enterprises. (2016, February 01). Jenness Enterprises - ArcView Extensions; Mahalanobis Description. Retrieved from www.jennessent.com: http://www.jennessent.com/arcview/mahalanobis_description.htm
Roth, P. M., Hirzer, M., Kostinger, M., Beleznai, C., & Bischof, H. (2014). Mahalanobis Distance Learning for Person Re-Identification.
Stockl, S., & Hanke, M. (2014, November). Financial Applications of the Mahalanobis Distance. Applied Economics and Finance, 1(2), 79-84. doi:http://dx.doi.org/10.11114/aef.v1i2.511
WHUBER. (2013, July 8). Bottom to top explanation of the Mahalanobis distance? Retrieved from Statexchange: http://stats.stackexchange.com/a/62147