Correlation

Explanation

A correlation coefficient is a number between -1 and 1 that denotes the extent to which two variables tend to go together; the extent to which the one tends to go up when the other goes up. A correlation therefore is never "good" or "bad", but merely "high" or "low". If the variances of two correlated variables are visualized as overlapping squares (with sides the length of that variable's standard deviation), the little square formed by their overlap represents the proportion of the variance that is shared, common, between those variables, and any side of that square has the size of their correlation. (Note: The proportion of the variance that is common is not the same as the "covariance", explained elsewhere).

The basic type of correlation is called "Pearson r", and is computed between the actual values of two paired variables (x and y), by dividing their covariance by the product of the two variables' standard deviations, this division serving to arrive at an index between -1 and 1:

r_xy = covariance_xy / σ_xσ_y

Theoretically this type of correlation requires the variables to be expressed on linear scales, whereon one would use the means and standard deviations as measures of central tendency and spread.

A less used type of correlation is the rank correlation, which is computed between the rank vectors of two paired variables, rather than between the actual values. Each actual value is then replaced by the rank it has in its data set, so that for instance the highest value receives rank 1, the next rank 2, and so on. Rank correlations are appropriate for non-linear scales, whereon one would use the medians and quartile deviations as measures of central tendency and spread. Rank correlations tend to be lower than Pearson r's, as the information contained in the distances between the raw values, and therefore also part of the variance, is thrown away.

Correlations can also be computed when one or both of the variables are dichotomous, that is, can only take two values (typically represented by 0 and 1). This type of correlation too tends to be lower than Pearson r's, as the variance is limited to .25 in a dichotomy.

In principle, correlations in the reports on intelligence tests are given with all of the tests wherewith there are at least four score pairs. If this results in too few usable pairs for norming, the threshold is lowered to three or two pairs, until there are enough pairs.

Interpretation of correlations

A rule of thumb:

.9 to 1 Very high
.7 to .89 High
.5 to .69 Moderate
.3 to .49 Low
-.29 to .29 (As good as) no correlation
-.49 to -.3 Low negative
-.69 to -.5 Moderate negative
-.89 to -.7 High negative
-1 to -.9 Very high negative

Significance of correlations depends on two factors:

Size of correlation — Higher correlations, either positive or negative, are more significant;
Number of pairs — Greater numbers of pairs give greater significance.

p value

The p value is typically used to indicate to what extent a Pearson r correlation is significant; it is the probability that a given correlation would result by coincidence, given a true correlation of zero ("under the null hypothesis"). This computation rests on the assumption that correlation values resulting from mere chance ("error") have a "normal distribution". The formula that shows the standard deviation of random error (σ_error, or standard error) of a correlation as a function of the number of pairs (n) under the assumption of a true correlation of zero, is:

σ_error = 1 / √(n - 1)

The actually found correlation coefficient is then divided by this standard error, resulting in a z-score, which can be looked up in a cumulative normal distribution table to find the probability of that z-score being exceeded; mostly, this value is then doubled to arrive at the "two-tailed" p value, to catch in both the negative and positive correlations exceeding that size. If only the one-tailed p is desired, it is not doubled obviously.

For convenience, below is a table that, for each number of pairs, shows the minimal Pearson r required for significance at the .05 level; that is, the level where the probability of that or a higher correlation occurring by chance if the true correlation were zero is 5 %.

n	Minimal r required for significance at .05
5	.98
6	.88
7	.80
8	.74
9	.69
10	.65
11	.62
12	.59
13	.57
14	.54
15	.52
16	.51
17	.49
20	.45
25	.4
30	.36
40	.31
50	.28
70	.24
100	.2
200	.14
500	.09
1000	.06
10 000	.02
100 000	.01

For rank correlations, significance should properly be assessed in a different way, to wit by probability calculation. One computes the correlations of all possible pairings between the data sets and counts how many thereof equal or exceed the value of which one wants to know the significance. Then, one divides that number by the total number of possible pairings to obtain the significance. This is so labour-intensive that is it only doable for very low numbers of pairs (otherwise the number of pairings becomes astronomical). It is more precise than the common way of reporting significance though, as probability calculation is "hard" while the assumption of a normal distribution of error is "soft". There exists a formula to test the significance of rank correlations in a simpler way, known as "Student's t test".