© Paul Cooijmans

Reliability should be understood as the correlation between two (hypothetical) very similar versions of the test. Some (less accurately) call this "test-retest correlation". A typical reliability for an I.Q. test is .9.

Although one may use two actual test versions to assess reliability, reliability is mostly computed internally, for instance by splitting the test into halves (typically the odd and even numbered items, but other divisions may be used too) and computing the correlation therebetween (*r*_{oe}), after which a correction is applied to obtain the reliability (*r*_{xx}) for the test as a whole. This is called the "split-half method". The meant correction is done via the Spearman-Brown formula:

*r*_{xx} = (2 × *r*_{oe}) / (1 + *r*_{oe})

The split-half method is robust in that it works for almost any test, even when the item scores are not uniform (*e.g.*, when some items yield more points than others, or yield negative points). Its informative value can be improved by providing it for two different ways of splitting the test in halves. An alternative to the odd-even split is that between the items with numbers divisible by 3 and/or 4, and the remaining items.

Another formula is "Cronbach's alpha", sometimes less accurately called "internal consistency method". This uses the covariances between all of the individual test items to estimate the mean of all possible split-half reliabilities, which is effectively the lower limit of the test's actual reliability. A simplified version of "alpha", for tests with only dichotomous items, is called "Kuder-Richardson 20". Note that Cronbach's alpha itself is applicable to such tests too and gives the same result. Although "alpha" is considered a better indicator than the split-half reliability, it is more limited in its use, and requires the items to be scored uniformly; typically one uses values from 0 to 1. If some items give scores outside that range, the outcome of Cronbach's alpha is meaningless, may even be greater than 1, so one needs to be alert to that to not use it incorrectly. In such cases, the split-half method can be used.

What should not be used to estimate reliability is the correlation between sections of a test containing different item types, such as between a numerical and a visual-spatial section. Such correlations are informative but should be understood as measures of internal consistency, which is a concept differing from reliability in that it, as it were, maximizes the polarization within a test, while reliability maximizes the coherence.

Theoretically, reliability represents the proportion of true score variance (σ^{2}_{t}) within a test's actual variance (σ^{2}), in other words:

*r*_{xx} = σ^{2}_{t} / σ^{2}

It follows that the proportion of error variance within a test's actual variance is (1 - *r*_{xx}), assuming that test score variance consists of a true score component and an error component. It also follows that a test's reliability is the square of the correlation between "true scores" and actual scores.

To estimate the reliability of a test consisting of subtests with known reliabilities *r*_{1}, *r*_{2}, *r*_{3}, ..., an expanded form of the Spearman-Brown formula can be used:

*r*_{123…} = (2 × *r*_{1} + 2 × *r*_{2} + 2 × *r*_{3} + …) / (1 + *r*_{1} + 1 + *r*_{2} + 1 + *r*_{3} + …)

This formula assumes subtests of equal weight (or equal length, in the case of a compound score which is a simple sum of subtest raw scores). When subtests have different weights, a more complex form of the formula is needed, with coefficients representing the subtests' weights or lengths.

Notice that the subtest reliabilities can not simply be averaged to acquire the compound reliability *r*_{123…}; that would yield a marked underestimation of the true compound reliability.