Histogram: The Ultimate Guide of Binning
About Histograms
“A graphical representation, similar to a bar chart in structure, that organizes a group of data points into user-specified ranges. The histogram condenses a data series into an easily interpreted visual by taking many data points and grouping them into logical ranges or bins.” – investopedia.com
It is obvious that histograms are the most useful tool to say something about a bunch of numeric values. Compared to other summarizing methods, histograms have the richest descriptive power while being the fastest way to interpret data – the human brain prefers visual perception. However, if you are not careful, viewers will not be able to understand your histogram, or you may fail to get the most out of it. It is especially important to specify the optimal bin size.
Why Choose Histograms?
If you have a set of data values, you probably want to share this information with your boss or co-workers to build a better business based on the information contained in these data. These data values could be any of the following:
• Customers’ ages
• Monthly revenues
• Length of time visitors spend on your website
• The number of sold cars by agents
• Any other important numbers related to your business.
You should share the information in a compact way because nobody wants to read numeric values one by one.
Alternatives are Wrong
Suppose you have a set of numbers: 1, 23, 24, 25, 25, 25, 26, 27, 30, 32, 999
The mean value (112.45)
is very sensitive to outliers. Almost all real-world data has outliers, so the mean value can be very misleading.
The median value (25)
does not tell you anything about the distribution.
The full range (1 – 999)
just shows the outliers.
The standard deviation (294.1436)
can be hard to be interpreted without a statistical background.
The variance (86520.47)
can be also hard to be interpreted without a statistical background.
Interquartile range (IQR) (24.5 – 28.5)
is the central 50% of your values and does not tell you anything about the other 50%.
Which do you think describes the numbers best? The answer is none of them because these numeric summarizing techniques do not include any information about spikes, or the shape of the distribution. Therefore, you should use always use a histogram.
Bin Carefully
Histograms are column-charts, which each column represents a range of the values, and the height of a column corresponds to how many values are in that range.
Bin Width, Bin Count
The wider the range (bin width) you use, the fewer columns (bins) you will have.
numberofbins = ceil( (maximumvalue - minimumvalue) / binwidth )
Bin that are too wide can hide important details about distribution while bin that are too narrow can cause a lot of noise and hide important information about the distribution as well.
The width of the bins should be equal, and you should only use round values like 1, 2, 5, 10, 20, 25, 50, 100, and so on to make it easier for the viewer to interpret the data.
These histograms were created from the same example dataset that contains 550 values between 12 and 69.
Too wide bins
Too-wide: Too wide bins, unable to detect unusual spike at around 53
Too narrow bins
Too-narrow: Too narrow bins, there are lots of spikes just by coincidence
Unpretty bins
Unpretty: Hard to read, because bins have unpretty 7 width
Unequal bins
Unequal: Hard to read, because widths of bins are not equal
Ideal bins
Ideal: This one is good.
TIPS
If you have a small amount of data, use wider bins to eliminate noise.
If you have a lot of data, use narrower bins because the histogram will not be that noisy.
In the case of the above used dataset (that contains 550 values between 12 and 69) we get the following result:
Square-root | Sturges | Rice | Scott | Freedman-Diaconis | |
Number of bins | 23 | 11 | 17 | 14 | 16 |
Bin width | 2 | 5 | 3 | 4 | 4 |
Another examples
Square-root | Sturges | Rice | Scott | Freedman-Diaconis | |
100/1000/10000 normally distributed numbers with mean 50 and standard deviation 10: (Number of bins) | |||||
#100 | 10 | 8 | 10 | 6 | 7 |
#1000 | 32 | 11 | 20 | 20 | 26 |
#10000 | 100 | 15 | 44 | 51 | 66 |
100/1000/10000 normally distributed numbers with mean 50 and standard deviation 10: (Bin width) | |||||
#100 | 4 | 5 | 4 | 8 | 6 |
#1000 | 2 | 6 | 3 | 4 | 3 |
#10000 | 1 | 6 | 2 | 2 | 1 |
Square-root | Sturges | Rice | Scott | Freedman-Diaconis | |
100/1000/10000 uniformly distributed numbers with mean 50 and standard deviation 10: (Number of bins) | |||||
#100 | 10 | 8 | 10 | 5 | 5 |
#1000 | 32 | 11 | 20 | 10 | 10 |
#10000 | 100 | 15 | 44 | 21 | 21 |
100/1000/10000 uniformly distributed numbers with mean 50 and standard deviation 10: (Bin width) | |||||
#100 | 10 | 12 | 10 | 20 | 19 |
#1000 | 3 | 9 | 5 | 10 | 10 |
#10000 | 1 | 7 | 2 | 5 | 5 |
Opened or Closed
It is not so easy to decide. Now comes the trouble. If you look at the 10-15-20-25… binned histogram, are the occurrences of value “20” represented in the second column, the third column, or both? Obviously, you need to put each specific value into an exact bin.
Two options are available to be able to do so:
Option A | All bins should have left-open, right-closed intervals | |||||||
First bin: | (10,15] | Contains these values: | 11 | 12 | 13 | 14 | 15 | |
Second bin: | (15,20] | Contains these values: | 16 | 17 | 18 | 19 | 20 | |
Third bin: | (20,25] | Contains these values: | 21 | 22 | 23 | 24 | 25 | |
Option B | All bins should have left-closed, right-open intervals | |||||||
First bin: | [10,15) | Contains these values: | 10 | 11 | 12 | 13 | 14 | |
Second bin: | [15,20) | Contains these values: | 15 | 16 | 17 | 18 | 19 | |
Third bin: | [20,25) | Contains these values: | 20 | 21 | 22 | 23 | 24 |
Avoid the Trap
You are free to choose any of these options, but be careful! With both of these options, one value will not be included in the histogram. If you choose option #1, then value “10” will not be included in any of the bins. If you choose option #2, then value “25” will not be included in any of the bins.
The solution is to force the histogram to have the first or last bin be a full-closed interval. We suggest you do this with the last bin when using option #2 because uniform bins are usually more important on the left side than on the right. If you have integer values, it is recommended to label the bins “10-14,” “15-19,” and “20-25” instead of writing out “10,” “15,” “20,” “25.” In this case, viewers of the histogram will understand it better.
Summary
Remember to always ask for histograms if you are about to be tricked by a single average.
• If your marketing specialist says that your campaigns usually reach 1000 people
• If your salesman tells you that your purchasers spend approximately $100 in your shop
• If your car mechanic says your vehicle will be ready in seven days
• If your family physician tells you that you will recover from the disease in five days
• If your mom says that the lunch is going to be ready in approximately 15 minutes
Keep in Mind
AnswerMiner helps you to create automatic histograms, so you do not need to bother with finding ideal settings.
TRY IT NOW