Get in touch

AnswerMiner > Data visualization  > Histogram – The ultimate guide of binning
histogram

Histogram – The ultimate guide of binning

About histogram

A graphical representation, similar to a bar chart in structure, that organizes a group of data points into user-specified ranges. The histogram condenses a data series into an easily interpreted visual by taking many data points and grouping them into logical ranges or bins.” – according to investopedia.com

It’s obvious that histogram is the most useful tool to say something about a bunch of numeric values. Compared to other summarising methods, it has the richest descriptive power while it is still the fastest way to interpret because human brain prefers visual perception. But if you are not careful enough, viewers won’t understand it, or you may fail to get the most out of it. Especially it is very important to specify an optimal bin size.

Why choose it?

If you have a set of data values, like

  •  ages of your customers
  •  revenues by months
  •  website visitors’ visit times
  •  number of sold cars by agents

or any important numbers related to your business. It’s likely that you want to share these with your boss or co-workers so that you can build a better business together based on the information contained in these data. You are supposed to share the information in a compact way because nobody wants to read numeric values one by one.

Alternatives are wrong

Suppose you have a set of numbers: 1; 23; 24; 25; 25; 25; 26; 27; 30; 32; 999

Mean: 112.45
Very sensitive to outliers, almost all real-world data have outliers, so it can be very misleading.

Median: 25
It doesn’t tell you anything about the distribution.

Full range: 1 – 999
Just shows you the outliers.

Standard deviation: 294.1436
Hard to be interpreted without non-statistical background.

Variance: 86520.47
Hard to be interpreted without non-statistical background.

Interquartile range (IQR): 24.5 – 28.5
Central 50% range of your values doesn’t tell you anything about the rest 50%.

Which do you think describes the numbers better? None of them, because these numeric summarising techniques don’t include any information about spikes, nor the shape of the distribution. Therefore you should use a histogram in any case.

Bin carefully

Histograms are column-charts, where each column represents a range of the values, and the height of a column corresponds to the count of how many values are in that range.

Bin width, bins count

The wider ranges (bin width) you use, the less number of columns (bins) you have

numberofbins = ceil( (maximumvalue - minimumvalue) / binwidth )

Too wide bin widths can hide important details about distribution, while too narrow bin widths can cause a lot of noise, it also can hide important information about the distribution.
Widths of bins should be equal, and you should only use pretty values like 1, 2, 5, 10, 20, 25, 50, 100, … so that it can be easier for the viewer to interpret it.

These histograms are created from the same example dataset, that contains 550 values between 12 and 69.

Too wide bins

histogram-toowidebins
Too-wide: Too wide bins, unable to detect unusual spike at around 53

Too narrow bins

histogram-toonarrowbins
Too-narrow: Too narrow bins, there are lots of spikes just by coincidence

Unpretty bins

histogram-unprettybins
Unpretty: Hard to read, because bins have unpretty 7 width

Unequal bins

histogram-unequalbins
Unequal: Hard to read, because widths of bins are not equal

Ideal bins

histogram-idealbins
Ideal: This one is good.

TIPS

If you have a small amount of data, use wider bins in order to eliminate noise.
If you have lots of data, then you are allowed to use narrower bins, because the histogram will not be that noisy.

The methods

Method name Square-root Sturges Rice Scott Freedman-Diaconis
Method formula bins Histogram - Square-root formula (bins) Histogram - Sturges formula (bins) Histogram - Rice formula (bins) Histogram - Scott formula (bins) Histogram - Freedman-Diaconis (bins)
Method formula width Histogram - Square-root formula (binwidth) Histogram - Sturges formula (binwidth) Histogram - Rice formula (binwidth) Histogram - Scott formula (binwidth) Histogram - Freedman-Diaconis (binwidth)
Method year N/A 1926 1944 1979 1981

• Click on the scheme to see it in big.

In the case of the above used dataset (that contains 550 values between 12 and 69) we get the following result:

Square-root Sturges Rice Scott Freedman-Diaconis
Number of bins 23 11 17 14 16
Bin width 2 5 3 4 4

Another examples

Square-root Sturges Rice Scott Freedman-Diaconis
100/1000/10000 normally distributed numbers with mean 50 and standard deviation 10: (Number of bins)
#100 10 8 10 6 7
#1000 32 11 20 20 26
#10000 100 15 44 51 66
100/1000/10000 normally distributed numbers with mean 50 and standard deviation 10: (Bin width)
#100 4 5 4 8 6
#1000 2 6 3 4 3
#10000 1 6 2 2 1

 

Square-root Sturges Rice Scott Freedman-Diaconis
100/1000/10000 uniformly distributed numbers with mean 50 and standard deviation 10: (Number of bins)
#100 10 8 10 5 5
#1000 32 11 20 10 10
#10000 100 15 44 21 21
100/1000/10000 uniformly distributed numbers with mean 50 and standard deviation 10: (Bin width)
#100 10 12 10 20 19
#1000 3 9 5 10 10
#10000 1 7 2 5 5

Opened? Closed?

Not so easy, huh? Now comes the trouble. If you look at the 10-15-20-25-… binned histogram, are the occurrences of value “20” represented in the second or third column? Or both? Obviously, you need to put each specific value into an exact bin.

Two options are available to be able to do this:

Option A All bins should have left-open, right-closed intervals
First bin: (10,15] Contains these values: 11 12 13 14 15
Second bin: (15,20] Contains these values: 16 17 18 19 20
Third bin: (20,25] Contains these values: 21 22 23 24 25
Option B All bins should have left-closed, right-open intervals
First bin: [10,15) Contains these values: 10 11 12 13 14
Second bin: [15,20) Contains these values: 15 16 17 18 19
Third bin: [20,25) Contains these values: 20 21 22 23 24

Avoid the trap

You are free to choose any of these options, but be careful! At both of these options, one value will not be included in the histogram. If you choose option #1, then value “10” will not be included in any of the bins. If you choose option #2, then value “25” will not be included in any of the bins.

The solution is that you may force the histogram to have the first or last bin to be full-closed interval. We suggest you do it on the last bin while using option #2, because uniform bins are usually more important on the left side than on the right. If you have integer values, it is recommended to label the bins like “10-14”, “15-19”, “20-25”, instead of just writing out 10, 15, 20, 25. In this case, viewers of the histogram will understand it better.

Summary

Remember to always ask for histograms, if you are about to be tricked by a single average.

  • If your marketing specialist says that your campaigns are usually reaching 1000 people
  • If your salesman tells you that your purchasers spend ~$100 in your shop.
  • If your car mechanic says your vehicle will be ready in 7 days
  • If your family physician tells you that you will recover from the disease in cca. 5 days
  • If your mom says that the lunch is going to be ready in approx 15 minutes

Keep in mind

AnswerMiner helps you to create automatic histograms, you don’t need to bother with finding out ideal settings.

TRY IT NOW

No Comments

Leave a reply

Get special deals and up to date content.

Subcribe