Correlation

A common task in statistics is to examine the relationship between two variables. For example, assume that you have collected information about anxiety and depression levels among a sample population. It is likely that you will be interested in the relationship between anxiety and depression. For example, do people with high anxiety also tend to have high depressive symptoms?

One way to observe relationships is to inspect a scatter plot. In a scatterplot, each sample is represented by a dot whose location is determined by its measurements. For example, a person's depression score may be plotted on the x-axis, and their anxiety score is plotted on the y-axis. If enough data is available, a visual inspection of the scatterplot will reveal patterns in the underlying data. For example, if the points approximate a straight line, there is a linear relationship between the variables. Furthermore, if the points move upwards as you move left to right along the x-axis, there is a positive relationship between the variables, and if they get lower as you move right, there is a negative relationship.

One way to quantify the linear relationship between two variables is with their covariance, which measures the degree to which the two variables vary in the same direction. Unfortunately, a covariance value alone is difficult to interpret since it is sensitive to the scale of the variables. Therefore, a correlation coefficient is typically more useful since it is a standardized measure. This means that correlation coefficients can be compared even when the data sets have different scales. The value of a correlation coefficient can be interpreted as follows:

-1A perfect negative relationship: all points fall on a line with a negative slope.
0No linear relationship.
+1A perfect positive relationship: all points fall on a line with a positive slope.

By convention, ± 0.1 is a small effect, ± 0.3 is a medium effect, and ± 0.5 is a large effect. Here are the most common correlation coefficients:

  • Pearson correlation coefficient: This is the most common of the correlation coefficients. One disadvantage of the Pearson correlation coefficient is that it can be sensitive to the presence of outliers in the data set.
  • Spearman's rank correlation coefficient: This is a non-parametric measure, so it is a more robust way to measure the relationship between two variables when there are outliers in the data set.
  • Kendall's tau coefficient: Another non-parametric approach. Kendall's tau is appropriate when there is a small number of possible values (e.g. a Likert scale) with many tied responses.
Warning: The Pearson correlation coefficients measures the strength of the linear relationship between two variables, so may give misleading results if there is a non-linear relationship (e.g. the points fall along a curve instead of a line). Furthermore, be wary of the impact of outliers. Both of these situations can be identified by looking at a scatter plot. For some interesting examples see Anscombe's quartet.
loading...