Mean vs. Median
Average can often mislead us and hide the valuable and reliable information of our data. It is hard to decide which number we need to use. In everyday statistics and news or even in scientific researches the result is often presented by the number of the mean. Median gives you much better and reliable information about your database, so we recommend you to use median instead of mean.
The difference between mean and median
The “mean” is the “average” you’re used to, where you add up all the numbers and then divide by the number of numbers. The “median” is the “middle” value in the list of numbers. To find the median, your numbers have to be listed in numerical order, so you may have to rewrite your list first. Source of definitions
To see the difference here is an example made by our own dataset, just for illustrating the case.
Click on the picture to enlarge.
On the first chart you can see the income of companies in middle range (25.56 to 1213). Necessary to see in this range because in our dataset there are some outlier value which distort our result and visualization, but to see the main difference between median and mean this chart is perfect. The green line shows the median ($157K) of our data and the red line shows the mean ($603K). So in our case, we wanted to figure out the average of our customers’ income, and here is where the problem was. The mean and the median showed us very different informations. So in which data we need to trust? We recommend you to choose the median instead of the mean. Below you can read the reasons.
The outlier problem
On the second chart you can see the same dataset, but visualized in full range (7.3 to 27302.93). Big range… It shows the income inequality too, but it is another topic.
Click on the picture to enlarge
It is also easy to see the difference between the median and the mean, but now we can clearly see how far the outlier value is.
The change of the mean
So we can ask the question: What if we doesn’t involve the outlier? Let’s see: When we exclude more and more “outlier” values.
Click on the picture to enlarge
The value of the mean will change (decrease), but the median won’t, until a bigger change. It means the median is more reliable and stable number, than the mean. Outlier value won’t disfigure the result.
Standard deviation is often used to support the understanding of average. It helps to describe our result, not just using one number, but it isn’t understandable for everybody and instead of that it is better to use midrange. For this we can use IQR (interquartile range) which can show us the range of the middle 50% of the values. But what about the other 50%? Better if we check the 10-90 percentile range, so the middle 80% of the values. With this number we can describe the bottom and the top 10% of our data.
It is proved that people can understand the shapes and visualized data better than just plain numbers. According to this fact, we recommend to use histogram. Below you can see one more reason to use it.
It can happen that you have a big amount of minimum and maximum values, but just a few from the middle. In our case the mean and the median will be the same number (5), and without histogram you won’t be able to see the real meaning of your data.
- Using mean can often misleads you and hide the truth.
- Outliers can destroy the results and the mean will show false picture of your data
- Visualize your data ( graphs, charts, histogram…)
- Always know the background
+1 interesting fact: The Friendship paradox . On average, your friends have more friends than you (explanation on the link), on average your friends sleep, drink, work and so on … more than you. Mean doesn’t show the truth.
So use median and be a real data scientist!