Histogram – The ultimate guide of binning
About histogram
“A graphical representation, similar to a bar chart in structure, that organizes a group of data points into user-specified ranges. The histogram condenses a data series into an easily interpreted visual by taking many data points and grouping them into logical ranges or bins.” – according to investopedia.com
It’s obvious that histogram is the most useful tool to say something about a bunch of numeric values. Compared to other summarising methods, it has the richest descriptive power while it is still the fastest way to interpret because human brain prefers visual perception. But if you are not careful enough, viewers won’t understand it, or you may fail to get the most out of it. Especially it is very important to specify an optimal bin size.
Why choose it?
If you have a set of data values, like
- ages of your customers
- revenues by months
- website visitors’ visit times
- number of sold cars by agents
or any important numbers related to your business. It’s likely that you want to share these with your boss or co-workers so that you can build a better business together based on the information contained in these data. You are supposed to share the information in a compact way because nobody wants to read numeric values one by one.
Alternatives are wrong
Suppose you have a set of numbers: 1; 23; 24; 25; 25; 25; 26; 27; 30; 32; 999
Mean: 112.45
Very sensitive to outliers, almost all real-world data have outliers, so it can be very misleading.
Median: 25
It doesn’t tell you anything about the distribution.
Full range: 1 – 999
Just shows you the outliers.
Standard deviation: 294.1436
Hard to be interpreted without non-statistical background.
Variance: 86520.47
Hard to be interpreted without non-statistical background.
Interquartile range (IQR): 24.5 – 28.5
Central 50% range of your values doesn’t tell you anything about the rest 50%.
Which do you think describes the numbers better? None of them, because these numeric summarising techniques don’t include any information about spikes, nor the shape of the distribution. Therefore you should use a histogram in any case.
Bin carefully
Histograms are column-charts, where each column represents a range of the values, and the height of a column corresponds to the count of how many values are in that range.
Bin width, bins count
The wider ranges (bin width) you use, the less number of columns (bins) you have
numberofbins = ceil( (maximumvalue - minimumvalue) / binwidth )
Too wide bin widths can hide important details about distribution, while too narrow bin widths can cause a lot of noise, it also can hide important information about the distribution.
Widths of bins should be equal, and you should only use pretty values like 1, 2, 5, 10, 20, 25, 50, 100, … so that it can be easier for the viewer to interpret it.
These histograms are created from the same example dataset, that contains 550 values between 12 and 69.
Too wide bins
Too-wide: Too wide bins, unable to detect unusual spike at around 53
Too narrow bins
Too-narrow: Too narrow bins, there are lots of spikes just by coincidence
Unpretty bins
Unpretty: Hard to read, because bins have unpretty 7 width
Unequal bins
Unequal: Hard to read, because widths of bins are not equal
Ideal bins
Ideal: This one is good.
TIPS
If you have a small amount of data, use wider bins in order to eliminate noise.
If you have lots of data, then you are allowed to use narrower bins, because the histogram will not be that noisy.
In the case of the above used dataset (that contains 550 values between 12 and 69) we get the following result:
Square-root | Sturges | Rice | Scott | Freedman-Diaconis | |
Number of bins | 23 | 11 | 17 | 14 | 16 |
Bin width | 2 | 5 | 3 | 4 | 4 |
Another examples
Square-root | Sturges | Rice | Scott | Freedman-Diaconis | |
100/1000/10000 normally distributed numbers with mean 50 and standard deviation 10: (Number of bins) | |||||
#100 | 10 | 8 | 10 | 6 | 7 |
#1000 | 32 | 11 | 20 | 20 | 26 |
#10000 | 100 | 15 | 44 | 51 | 66 |
100/1000/10000 normally distributed numbers with mean 50 and standard deviation 10: (Bin width) | |||||
#100 | 4 | 5 | 4 | 8 | 6 |
#1000 | 2 | 6 | 3 | 4 | 3 |
#10000 | 1 | 6 | 2 | 2 | 1 |
Square-root | Sturges | Rice | Scott | Freedman-Diaconis | |
100/1000/10000 uniformly distributed numbers with mean 50 and standard deviation 10: (Number of bins) | |||||
#100 | 10 | 8 | 10 | 5 | 5 |
#1000 | 32 | 11 | 20 | 10 | 10 |
#10000 | 100 | 15 | 44 | 21 | 21 |
100/1000/10000 uniformly distributed numbers with mean 50 and standard deviation 10: (Bin width) | |||||
#100 | 10 | 12 | 10 | 20 | 19 |
#1000 | 3 | 9 | 5 | 10 | 10 |
#10000 | 1 | 7 | 2 | 5 | 5 |
Opened? Closed?
Not so easy, huh? Now comes the trouble. If you look at the 10-15-20-25-… binned histogram, are the occurrences of value “20” represented in the second or third column? Or both? Obviously, you need to put each specific value into an exact bin.
Two options are available to be able to do this:
Option A | All bins should have left-open, right-closed intervals | |||||||
First bin: | (10,15] | Contains these values: | 11 | 12 | 13 | 14 | 15 | |
Second bin: | (15,20] | Contains these values: | 16 | 17 | 18 | 19 | 20 | |
Third bin: | (20,25] | Contains these values: | 21 | 22 | 23 | 24 | 25 | |
Option B | All bins should have left-closed, right-open intervals | |||||||
First bin: | [10,15) | Contains these values: | 10 | 11 | 12 | 13 | 14 | |
Second bin: | [15,20) | Contains these values: | 15 | 16 | 17 | 18 | 19 | |
Third bin: | [20,25) | Contains these values: | 20 | 21 | 22 | 23 | 24 |
Avoid the trap
You are free to choose any of these options, but be careful! At both of these options, one value will not be included in the histogram. If you choose option #1, then value “10” will not be included in any of the bins. If you choose option #2, then value “25” will not be included in any of the bins.
The solution is that you may force the histogram to have the first or last bin to be full-closed interval. We suggest you do it on the last bin while using option #2, because uniform bins are usually more important on the left side than on the right. If you have integer values, it is recommended to label the bins like “10-14”, “15-19”, “20-25”, instead of just writing out 10, 15, 20, 25. In this case, viewers of the histogram will understand it better.
Summary
Remember to always ask for histograms, if you are about to be tricked by a single average.
- If your marketing specialist says that your campaigns are usually reaching 1000 people
- If your salesman tells you that your purchasers spend ~$100 in your shop.
- If your car mechanic says your vehicle will be ready in 7 days
- If your family physician tells you that you will recover from the disease in cca. 5 days
- If your mom says that the lunch is going to be ready in approx 15 minutes
Keep in mind
AnswerMiner helps you to create automatic histograms, you don’t need to bother with finding out ideal settings.
TRY IT NOW