Correlation

© Paul Cooijmans

Explanation

A correlation coefficient is a number between -1 and 1 that denotes the extent to which two variables tend to go together; the extent to which the one tends to go up when the other goes up. A correlation therefore is never "good" or "bad", but merely "high" or "low". If the variances of two correlated variables are visualized as overlapping squares (with sides the length of that variable's standard deviation), the little square formed by their overlap represents the proportion of the variance that is shared, common, between those variables, and any side of that square has the size of their correlation. (Note: The proportion of the variance that is common is not the same as the "covariance", explained elsewhere).

The basic type of correlation is called "Pearson r", and is computed between the actual values of two paired variables (x and y), by dividing their covariance by the product of the two variables' standard deviations, this division serving to arrive at an index between -1 and 1:

rxy = covariancexy / σxσy

Theoretically this type of correlation requires the variables to be expressed on linear scales, whereon one would use the means and standard deviations as measures of central tendency and spread.

A less used type of correlation is the rank correlation, which is computed between the rank vectors of two paired variables, rather than between the actual values. Each actual value is then replaced by the rank it has in its data set, so that for instance the highest value receives rank 1, the next rank 2, and so on. Rank correlations are appropriate for non-linear scales, whereon one would use the medians and quartile deviations as measures of central tendency and spread. Rank correlations tend to be lower than Pearson r's, as the information contained in the distances between the raw values, and therefore also part of the variance, is thrown away.

Correlations can also be computed when one or both of the variables are dichotomous, that is, can only take two values (typically represented by 0 and 1). This type of correlation too tends to be lower than Pearson r's, as the variance is limited to .25 in a dichotomy.

In principle, correlations in the reports on intelligence tests are given with all of the tests wherewith there are at least five score pairs. If this results in too few usable pairs for norming, the threshold is lowered to four, three, or two pairs, until there are enough pairs. "Enough" means at least 70 with a positive correlation.

The correlations are given per test (and not aggregate) as that is the only objective way to obtain correlations. Aggregate correlations (over a number of tests combined) require interpretation of test scores onto a common scale, and are therefore not objective. Aggregate correlations with intelligence tests as seen in statistical reports by some authors should not be compared with the weighted means of objective correlations as explained here, as they (the aggregate correlations) may be inflated by subjective interpretation.

Interpretation of correlations

A rule of thumb:

Significance of correlations depends on two factors:

  1. Height of correlation — Higher correlations, either positive or negative, are more significant;
  2. Number of pairs — Greater numbers of pairs give greater significance.

Below is a table that for each number of pairs shows the minimal Pearson r required for significance at the .05 level; that is, the level where the probability of that or a higher correlation occurring by chance if the true correlation were zero is 5 %. This computation of significance rests on the assumption that correlation values resulting from mere chance ("error") have a "normal distribution". This is the common way of reporting significance in statistics, but it is (in Paul Cooijmans' opinion) of limited practical value and meaning. The formula that shows the standard deviation of random error (σerror, or standard error) of a correlation as a function of the number of pairs (n) under the assumption of a true correlation of zero, is:

σerror = 1/√(n - 1)

 n Minimal r required for significance at .05
5.98
6.88
7.80
8.74
9.69
10.65
11.62
12.59
13.57
14.54
15.52
16.51
17.49
20.45
25.4
30.36
40.31
50.28
70.24
100.2
200.14
500.09
1000.06
10 000.02
100 000.01

For rank correlations, significance should properly be assessed in a different way, to wit by probability calculation. One computes the correlations of all possible pairings between the data sets and counts how many thereof equal or exceed the value of which one wants to know the significance. Then, one divides that number by the total number of possible pairings to obtain the significance. This is so labour-intensive that is it only doable for very low numbers of pairs (otherwise the number of pairings becomes astronomical). It is more precise than the the common way of reporting significance though, as probability calculation is "hard" while the assumption of a normal distribution of error is "soft". There exists a formula to test the significance of rank correlations in a simpler way, known as "Student's t test".

- [More statistics explained]

The Imperial Seal