An Indian mathematician in 1927 devised a method to find outliers in data, a method that could have predicted the 2008 sub-prime mortgage crisis (Stockl & Hanke, 2014). If this statement has successfully captivated your attention, read ahead to know more about one of India’s most underappreciated mathematicians – Prasanta Chandra Mahalanobis, and yes, you guessed it, his greatest gift to humankind – The Mahalanobis Distance.

In this post, I will attempt to 1) introduce you to one of the most interesting statisticians in the world, 2) give you an intuitive explanation of the measure.

Mahalanobis distance is an oft repeated measure of determining outliers, but very few know the power of this measure. In wildlife biology, it is used to find the ideal landscape which best fits the niche of some wildlife species (Jenness Enterprises, 2016). In finance, it is used in asset classification and portfolio surveillance. Like I mentioned earlier, it could have been used to flag the underlying problems in the financial numbers leading up to the Great Recession. It is used to test and improve the welding quality of Robotic arc welding (Chand, et al., 2013). It is used in visual surveillance to search for criminals in crowds by facial re-identification (Roth, Hirzer, Kostinger, Beleznai, & Bischof, 2014). And I am going to stop here, because I am tired of listing references to research papers which use Mahalanobis Distance. You can take my word for it when I say, Mahalanobis Distance, is kind of a big deal in the world of Statistics.

Let us look at the creator of this widely used measure. Prasanta Chandra Mahalanobis was born in a Bengali family in Bikrampur (now in Bangladesh). Now that I have gotten the most boring fact about him out of the way, let me start listing all the mind-blowing facts about him. Jagdish Chandra Bose, The Father of Radio Science, was his teacher in school. Subhash Chandra Bose was his college mate in college! He was a Physics Honor Student, not Mathematics or Statistics, but in fact, Physics (not unlike the best data scientists today). He worked under J.J.Thompson, Nobel Prize winning Physicist after he completed his Bachelors. He had a flair for the theatricals too – P.C. Mahalanobis once played the protagonist in Rabindranath Tagore’s play! He accompanied Tagore on his Europe tour where they met Albert Einstein and Sigmund Freud. If you think the multi-variate measure of distance is his the only noteworthy achievement of his life, he also introduced the concept of Pilot surveys! Wow, this guy was a Rockstar!

Moving on to Mahalanobis Distance, there is no easy way to explain it. I must admit, I have not really understood the finer mathematical workings of it so well myself. The simplest definition of Mahalanobis Distance is that it is the Euclidean distance of a data point from the other points’ center of mass, normalized and vectorized based on their sample point distribution.

Consider a set of data points in an N-dimensional Euclidean space. The first step is the figure out the center of mass of all the data points. Given any data point, our task is to figure out if it is an outlier. To know if it is an outlier, the usual approach would be to measure the Euclidean distance of the data point in question and see if it is beyond a certain number of standard deviations from the center of mass. However, what this “usual” method does not consider, is the spread of all the data points themselves. What it assumes is that the distribution of the data points is spherical. This assumption means that only the normalized distance is sufficient to determine if any data point is an outlier or not. However, if the distribution is indeed ellipsoidal, or some other non-spherical form, then the direction of the data point in question also becomes important. Mahalanobis understood this shortcoming of the usual method and decided to incorporate the shape of the distribution as well. Hence, if an ellipsoid represents the spread of the data points the best, the Mahalanobis distance of a data point is the distance of that data point from the center of mass divided by the width of the ellipsoid in the direction of the test point.

This vectorized approach is why Mahalanobis is often considered to be a cousin of the Linear Discriminant Analysis. Here’s a rather intuitive explanation of Mahalanobis distance in two dimensions. You will realize why it is the cousin of Linear Discriminant Analysis in one scroll-through.

Now that you know the man behind this awesome measure, go ahead and let the world know! There is a new most interesting ~~man~~ statistician in town.

Works Cited

Chand, R. R., Kim, I. S., Lee, J. H., Lee, J. P., Shim, J. Y., & Kim, Y. S. (2013, November). A Study on Welding Quality of Robotic Arc Welding Process Using Mahalanobis Distance Method. (H. Z. A. Kiet Tieu, Ed.) Materials Science Forum, 773-774, pp. 759-765. doi:10.4028/www.scientific.net/MSF.773-774.759

Jenness Enterprises. (2016, February 01). Jenness Enterprises - ArcView Extensions; Mahalanobis Description. Retrieved from www.jennessent.com: http://www.jennessent.com/arcview/mahalanobis_description.htm

Roth, P. M., Hirzer, M., Kostinger, M., Beleznai, C., & Bischof, H. (2014). Mahalanobis Distance Learning for Person Re-Identification.

Stockl, S., & Hanke, M. (2014, November). Financial Applications of the Mahalanobis Distance. Applied Economics and Finance, 1(2), 79-84. doi:http://dx.doi.org/10.11114/aef.v1i2.511

WHUBER. (2013, July 8). Bottom to top explanation of the Mahalanobis distance? Retrieved from Statexchange: http://stats.stackexchange.com/a/62147

Statistemophilia

Sunday, 2 April 2017

Mahalanobis Distance - The Most Interesting Measure by the Most Interesting Statistician

Works Cited

No comments:

Post a Comment