Compound variables

© Paul Cooijmans

Explanation

A compound variable is a variable created by combining two or more single variables. This may require converting one or more of those single variables to a common scale to enable meaningful combination. When variables are expressed on different scales, it is obviously nonsensical to combine them without conversion to a common scale.

When to use a compound variable

A common reason to create a compound variable is a lack of data for each of a number of single variables. When there is so little data for each that they can not be used for statistical purposes on their own, one may create a compound of them to arrive at enough data. Notice that this is a "second choice" method; ideally one wants to use each variable on its own.

Disadvantage of compound variables

A compound variable will generally have lower correlations with any other variable than its constituent single variables would have. This is so because the constituent variables, even when converted to a common scale, will generally differ from each other in absolute value and in loadings on various group factors. Even when the constituent variables each have a perfect correlation with a certain target variable, their compound will normally have a lower correlation.

An example will make this clear; this example is purposely somewhat exaggerated to make visible what is going on in a compound variable:

Suppose we have an object I.Q. test whereon four candidates have the respective set of scores (105, 110, 120, 115).

Suppose there is another test on which those candidates have scored respectively (106, 111, 121, 116), expressed on the same scale.

And again another test whereon they scored (125, 130, 140, 135), expressed on the same scale.

Notice that each of the other tests has a correlation of 1.00 (perfect, unity) with the object I.Q. test, and that the two other tests also correlate 1.00 with each other. If you do not see and understand this at once, you have not yet understood what a correlation is; see the relevant explanation.

Now we create a compound variable of the two other tests, with first the two scores of the first candidate, then those of the second, and so on: (106, 125, 111, 130, 121, 140, 116, 135).

The therewith paired scores on the object I.Q. test are (105, 105, 110, 110, 120, 120, 115, 115).

The correlation of the compound variable with the object test is only .51, rather than the 1.00 of each of the single variables. This is so because the two other tests yield scores on different absolute levels, though perfectly correlated.

The conclusion is that it is better to compute the correlations of the object test with each of the other tests separately; a (weighted) average of the resulting correlations will be superior (as an indicator) to the object test's correlation with a compound of the other tests' scores. When one occasionally uses a compound variable out of necessity (lack of data for each singe variable) one should keep in mind that its correlations will be attenuated through the act of compounding.

- [More statistics explained]

The Imperial Seal