Reliability

Explanation

Reliability should be understood as the correlation between two (hypothetical) very similar versions of the test. Some (less accurately) call this "test-retest correlation". A typical reliability of a full-scale or stand-alone I.Q. test is .9.

The split-half method

Although one may use two actual test versions to assess reliability, reliability is mostly computed internally, for instance by splitting the test into halves (typically the odd and even numbered items, but other divisions may be used too) and computing the correlation therebetween (r_oe), after which a correction is applied to obtain the reliability (r_xx) for the test as a whole. This is called the "split-half method". The meant correction is done via the Spearman-Brown formula:

r_xx = (2 × r_oe) / (1 + r_oe)

The split-half method is robust in that it works for almost any test, even when the item scores are not uniform (e.g., when some items yield more points than others, or yield negative points). Its informative value can be improved by providing it for two different ways of splitting the test in halves. An alternative to the odd-even split is that between the items with numbers divisible by 3 and/or 4, and the remaining items.

Cronbach's alpha

Another formula is "Cronbach's alpha", sometimes less accurately called "internal consistency method". This uses the covariances between all of the individual test items to estimate the mean of all possible split-half reliabilities, which is effectively the lower limit of the test's actual reliability. A simplified version of "alpha", for tests with only dichotomous items, is called "Kuder-Richardson 20". Note that Cronbach's alpha itself is applicable to such tests too and gives the same result. Although "alpha" is considered a better indicator than the split-half reliability, it is more limited in its use, and requires the items to be scored uniformly; typically one uses values from 0 to 1. If some items give scores outside that range, the outcome of Cronbach's alpha is meaningless, may even be greater than 1, so one needs to be alert to that to not use it incorrectly. In such cases, the split-half method can be used.

There is a more complex form of Cronbach's alpha, called "stratified alpha coefficient". This is used to estimate the total reliability of a heterogeneous compound test, given the subtest reliabilities and subtest variances.

Confusion with internal consistency

What should not be used to estimate reliability is the correlation between sections of a test containing different item types, such as between a numerical and a visual-spatial section. Such correlations are informative but should be understood as measures of internal consistency, which is a concept differing from reliability in that it indicates the polarization within a test, while reliability indicates the coherence.

Another misunderstanding about reliability and internal consistency is that a very high reliability coefficient is bad because, supposedly, it means that "all items measure the same" and there is not enough diversity of items. This is mistaken because test reliability is also a function of test length; ceteris paribus, reliability increases in proportion with the square root of test length, so a sufficiently long test will have very high reliability despite containing very diverse items. And that is the way to go.

Relation with validity

A test's reliability forms the upper limit of the test's possible (significant) correlation with any other variable. This is so because a test can not correlate higher with any external variable than it correlates "with itself". Therefore, in order to have high validity (which means correlation with other variables) a test needs to have very high reliability. For a full-scale or stand-alone I.Q. test, a reliability coefficient of .9 is the minimum to strive for. A common mistake in test construction is to make the test too short, resulting in too low reliability and thus too low validity.

Sceptics of I.Q. testing sometimes claim that typical I.Q. tests have low reliability and validity, and cite values in the order of .4 or .5. These claims are misleading; upon closer investigation, one will discover those claims to concern mere one-sided subtests of comprehensive tests like the Wechsler Adult Intelligent Scales, the full-scale reliability of which is always well above .9.

Relation with true score variance and error variance

Theoretically, reliability represents the proportion of true score variance (σ²_t) within a test's actual variance (σ²), in other words:

r_xx = σ²_t / σ²

It follows that the proportion of error variance within a test's actual variance is (1 - r_xx), assuming that test score variance consists of a true score component and an error component. It also follows that a test's reliability is the square of the correlation between "true scores" and actual scores.

Caution

Factors like answer leakage and fraud may actually increase the computed reliability of a test (by increasing inter-item correlations), so one should be aware that high reliability does not imply that all is well with the test.

Reliability of compound tests

To estimate the total reliability of a test consisting of subtests with known reliabilities r₁, r₂, r₃, …, an expanded form of the Spearman-Brown formula can be used:

r_123… = (2 × r₁ + 2 × r₂ + 2 × r₃ + …) / (1 + r₁ + 1 + r₂ + 1 + r₃ + …)

This formula assumes subtests of equal length/weight. When subtests have different lengths/weights toward the total score, a more complex form of the formula is needed, with weight coefficients representing the subtests' contributions to total score in front of the relevant terms in the above formula. The placement of the weight coefficients should be self-explanatory to who understands the formula; they have been left out here to keep the formula human-readable. Instead of simple subtest length, one should really use the subtest standard deviations as weights, because those are the real determinants of the relative weight of each subtest in the total score. If the test has internal subtest weighting going on, those weights should be taken into account in the formula.

Notice that the subtest reliabilities can not be simply averaged to acquire the compound reliability r_123…; that would yield a marked underestimation of the true compound reliability by failing to take into account that reliability increases with test length when subtests are combined into a total score. This mistake is sometimes made by I.Q. sceptics who cite (low) subtest reliabilities of large comprehensive tests to discredit I.Q. testing but fail to understand that full-scale reliability is not a simple average of subtest reliabilities.

Finally, there also exists a "stratified alpha coefficient" to estimate the total reliability of a heterogeneous compound test. This follows a different approach from the above; stratified alpha is computed by summing (subtest variance × [1 - subtest reliability]) across the subtests, dividing by total test variance, and then subtracting the result from 1:

Stratified alpha coefficient = 1 - (Σ σ²_subtest × (1 - r_{xx_subtest})) / σ²_total

In other words, stratified alpha is the sum of squared subtest standard errors divided by total variance, subtracted from 1.