Get in touch

AnswerMiner > Data science  > Correlation coefficient demystified
demist

Correlation coefficient demystified

Correlation and correlation coefficient seems to be difficult to understand, it sounds like some weird mathematical / statistical thing. But once you understand it, you are going to think in a totally new way about causality and how things are related in all aspects of life. Read this article and find out what Pearson or Spearman changed the statiscics.

What is correlation coefficient?

It’s a metric which helps you to measure the strength of the relationship between two numerical datasets. For example, you may have a list of students and you know their age and height. You can ask: what is the correlation between age and height? It’s likely that in most of the cases, the taller a student, the older she/he is. And vice versa: if someone is rather old, you can guess she/he is tall. Of course, this correlation doesn’t exist among adults.

So, simply speaking correlation means:
The bigger (or more) something is, the bigger (or more) something else.

correlation-age-height-300x283

If the absolute value of the calculated correlation coefficient is high then the connection between the variables is strong. If coefficient is low, there might be only weak connection, or maybe no relationship at all.

Negative correlation coefficient means reverse correlation, i.e.:
The bigger (or more) something is, the smaller (or less) something else.

As a rule of thumb, you can use this table:

if this relationship applies in all cases correlation is perfect coefficient is 1
if this relationship applies in almost all cases correlation is almost perfect coefficient is between 0.9 and 1
if this relationship applies in most cases correlation is very strong coefficient is between 0.8 and 0.9
if this relationship applies in many cases correlation is strong coefficient is between 0.7 and 0.8
if this relationship applies in some cases correlation is moderate coefficient is between 0.5 and 0.7
if this relationship applies in a few cases correlation is weak coefficient is between 0.3 and 0.5
if this relationship applies in few cases correlation is very weak coefficient is between 0.2 and 0.3
if this relationship applies in very few cases correlation is negligible coefficient is below 0.2

Different correlation algorithms

There are many different algorithms for calculating correlation, each one has different properties, and variants. Pearson is the most popular, but I would suggest Spearman, because it has less limitation and can be applied more widely.

 

Pearson correlation

Inventor: Karl Pearson ~ 1895
Other names: Pearson product-moment correlation coefficient; PPMCC; PCC; Pearson’s r
Population coefficient is denoted by: greek letter ρ (rho)
Sample coefficient is denoted by: r
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

Good for:

  • if you care about the amount of growth
  • if you also want to calculate confidence interval
  • if you have no outliers at all, because pearson (unlike spearman) is very sensitive to outliers
  • if you want to check linear association (it’s not good at all for nonlinear relationships)

Formula:

formula-pearson-r

Spearman correlation

Inventor: Charles Spearman ~ 1904
Other names: Spearman’s rank correlation coefficient; Spearman’s rho
Coefficient is denoted by: greek letter ρ (rho)
https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

Good for:

  • if outliers exist, because spearman (unlike pearson) is not sensitive to outliers
  • if you also want to calculate confidence interval
  • linear and nonlinear relationships
  • if there are no repeated values (more identical x or y values)
  • if you care about the relationship only, but not the amount of growth (spearman only checks monotony)

Formula:

formula-spearman-rho-coefficient

Kendall correlation

Inventor: Maurice Kendall ~ 1938
Other names: Kendall rank correlation coefficient; Kendall’s tau coefficient
Coefficient is denoted by: greek letter τ (tau)
https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient

Good for:

  • if outliers exist
  • linear and nonlinear relationships
  • when repeated values exist
  • if you do not want to calculate confidence interval

Formula:

formula-kendall-tau

A/B test calculator!

Correlation is not causation

It’s very important not to forget: Correlation does not imply causation!
So if you find some strong correlation in your data, the following relationships are possible:

  • X causes Y (this is what most people incorrectly assume)
  • Y causes X (this is what most people might incorrectly think)
  • X and Y are consequences of a common cause (this is very frequent)
  • X causes Y and Y causes X
  • X causes Z which causes Y
  • There is no connection between X and Y (it’s just a coincidence)

If there is no mathematical correlation between variables, it does not mean that there is no relationship. There might be a strong connection, but other factors can cause that you see no correlation.

So what is correlation good for?

  • There are mathematical algorithms to filter out the effect of other variables, so you may find real relationships if you take into account many factors
  • If the correlation is strong, you can predict X from Y, and Y from X
  • Based on the results of correlations, you may investigate your research further if you find surprisingly too weak or too strong correlation and the calculated coefficient conflicts with your hypothesis
No Comments

Sorry, the comment form is closed at this time.

Get special deals and up to date content.

Subcribe