© Paul Cooijmans


Reliability should be understood as the correlation between two (hypothetical) very similar versions of the test. Some (less accurately) call this "test-retest correlation". A typical reliability for an I.Q. test is .9.

The split-half method

Although one may use two actual test versions to assess reliability, reliability is mostly computed internally, for instance by splitting the test into halves (typically the odd and even numbered items, but other divisions may be used too) and computing the correlation therebetween (roe), after which a correction is applied to obtain the reliability (rxx) for the test as a whole. This is called the "split-half method". The meant correction is done via the Spearman-Brown formula:

rxx = (2 × roe) / (1 + roe)

The split-half method is robust in that it works for almost any test, even when the item scores are not uniform (e.g., when some items yield more points than others, or yield negative points). Its informative value can be improved by providing it for two different ways of splitting the test in halves. An alternative to the odd-even split is that between the items with numbers divisible by 3 and/or 4, and the remaining items.

Cronbach's alpha

Another formula is "Cronbach's alpha", sometimes less accurately called "internal consistency method". This uses the covariances between all of the individual test items to estimate the mean of all possible split-half reliabilities, which is effectively the lower limit of the test's actual reliability. A simplified version of "alpha", for tests with only dichotomous items, is called "Kuder-Richardson 20". Note that Cronbach's alpha itself is applicable to such tests too and gives the same result. Although "alpha" is considered a better indicator than the split-half reliability, it is more limited in its use, and requires the items to be scored uniformly; typically one uses values from 0 to 1. If some items give scores outside that range, the outcome of Cronbach's alpha is meaningless, may even be greater than 1, so one needs to be alert to that to not use it incorrectly. In such cases, the split-half method can be used.

Confusion with internal consistency

What should not be used to estimate reliability is the correlation between sections of a test containing different item types, such as between a numerical and a visual-spatial section. Such correlations are informative but should be understood as measures of internal consistency, which is a concept differing from reliability in that it, as it were, maximizes the polarization within a test, while reliability maximizes the coherence.

Relation with true score variance and error variance

Theoretically, reliability represents the proportion of true score variance (σ2t) within a test's actual variance (σ2), in other words:

rxx = σ2t / σ2

It follows that the proportion of error variance within a test's actual variance is (1 - rxx), assuming that test score variance consists of a true score component and an error component. It also follows that a test's reliability is the square of the correlation between "true scores" and actual scores.

Reliability of compound tests

To estimate the reliability of a test consisting of subtests with known reliabilities r1, r2, r3, ..., an expanded form of the Spearman-Brown formula can be used:

r123… = (2 × r1 + 2 × r2 + 2 × r3 + …) / (1 + r1 + 1 + r2 + 1 + r3 + …)

This formula assumes subtests of equal weight (or equal length, in the case of a compound score which is a simple sum of subtest raw scores). When subtests have different weights, a more complex form of the formula is needed, with coefficients representing the subtests' weights or lengths.

Notice that the subtest reliabilities can not simply be averaged to acquire the compound reliability r123…; that would yield a marked underestimation of the true compound reliability.

- [More statistics explained]

The Imperial Seal