© Paul Cooijmans

Reliability should be understood as the correlation between two (hypothetical) very similar versions of the test. Some (less accurately) call this "test-retest correlation". A typical reliability of a full-scale or stand-alone I.Q. test is .9.

Although one may use two actual test versions to assess reliability, reliability is mostly computed internally, for instance by splitting the test into halves (typically the odd and even numbered items, but other divisions may be used too) and computing the correlation therebetween (*r*_{oe}), after which a correction is applied to obtain the reliability (*r*_{xx}) for the test as a whole. This is called the "split-half method". The meant correction is done via the Spearman-Brown formula:

*r*_{xx} = (2 × *r*_{oe}) / (1 + *r*_{oe})

The split-half method is robust in that it works for almost any test, even when the item scores are not uniform (*e.g.*, when some items yield more points than others, or yield negative points). Its informative value can be improved by providing it for two different ways of splitting the test in halves. An alternative to the odd-even split is that between the items with numbers divisible by 3 and/or 4, and the remaining items.

Another formula is "Cronbach's alpha", sometimes less accurately called "internal consistency method". This uses the covariances between all of the individual test items to estimate the mean of all possible split-half reliabilities, which is effectively the lower limit of the test's actual reliability. A simplified version of "alpha", for tests with only dichotomous items, is called "Kuder-Richardson 20". Note that Cronbach's alpha itself is applicable to such tests too and gives the same result. Although "alpha" is considered a better indicator than the split-half reliability, it is more limited in its use, and requires the items to be scored uniformly; typically one uses values from 0 to 1. If some items give scores outside that range, the outcome of Cronbach's alpha is meaningless, may even be greater than 1, so one needs to be alert to that to not use it incorrectly. In such cases, the split-half method can be used.

What should not be used to estimate reliability is the correlation between sections of a test containing different item types, such as between a numerical and a visual-spatial section. Such correlations are informative but should be understood as measures of internal consistency, which is a concept differing from reliability in that it indicates the polarization within a test, while reliability indicates the coherence.

Another misunderstanding about reliability and internal consistency is that a very high reliability coefficient is bad because, supposedly, it means that "all items measure the same" and there is not enough diversity of items. This is mistaken because test reliability is also a function of test length; *ceteris paribus*, reliability increases in proportion with the square root of test length, so a sufficiently long test will have very high reliability despite containing very diverse items. And that is the way to go.

A test's reliability forms the upper limit of the test's possible (significant) correlation with any other variable. This is so because a test can not correlate higher with any external variable than it correlates "with itself". Therefore, in order to have high validity (which means correlation with other variables) a test needs to have very high reliability. For a full-scale or stand-alone I.Q. test, a reliability coefficient of .9 is the minimum to strive for. A common mistake in test construction is to make the test too short, resulting in too low reliability and thus too low validity.

Sceptics of I.Q. testing sometimes claim that typical I.Q. tests have low reliability and validity, and cite values in the order of .4 or .5. These claims are misleading; upon closer investigation, one will discover those claims to concern mere one-sided subtests of comprehensive tests like the Wechsler Adult Intelligent Scales, the full-scale reliability of which is always well above .9.

Theoretically, reliability represents the proportion of true score variance (σ^{2}_{t}) within a test's actual variance (σ^{2}), in other words:

*r*_{xx} = σ^{2}_{t} / σ^{2}

It follows that the proportion of error variance within a test's actual variance is (1 - *r*_{xx}), assuming that test score variance consists of a true score component and an error component. It also follows that a test's reliability is the square of the correlation between "true scores" and actual scores.

Factors like answer leakage and fraud may actually increase the computed reliability of a test (by increasing inter-item correlations), so one should be aware that high reliability does not imply that all is well with the test.

To estimate the reliability of a test consisting of subtests with known reliabilities *r*_{1}, *r*_{2}, *r*_{3}, …, an expanded form of the Spearman-Brown formula can be used:

*r*_{123…} = (2 × *r*_{1} + 2 × *r*_{2} + 2 × *r*_{3} + …) / (1 + *r*_{1} + 1 + *r*_{2} + 1 + *r*_{3} + …)

This formula assumes subtests of equal length (in the case of a compound score which is a simple sum of subtest raw scores). When subtests have different lengths, a more complex form of the formula is needed, with weight coefficients representing the subtests' lengths in front of the relevant terms in the above formula. The placement of the weight coefficients should be self-explanatory to who understands the formula.

Notice that the subtest reliabilities can not be simply averaged to acquire the compound reliability *r*_{123…}; that would yield a marked underestimation of the true compound reliability by failing to take into account that reliability increases with test length when subtests are combined into a total score. This mistake is sometimes made by I.Q. sceptics who cite (low) subtest reliabilities of large comprehensive tests to discredit I.Q. testing but fail to understand that full-scale reliability is not a simple average of subtest reliabilities.