Sex differences on high-range I.Q. tests analysed

© September 2013 Paul Cooijmans


When dealing with high-range tests, two types of sex difference are conspicuous:

  1. More males than females take the tests; since high-range candidates are self-selected, this difference in participation is a behavioural variable in its own right; its significance is staggering: the probability that the participation difference is mere coincidence is almost infinitely small, in the order of thirty standard deviations removed from the mean in a normal distribution;
  2. On most of the tests, males score on average higher than females, to varying degrees. Because of the low number of females, this difference is less significant than the difference in participation.

The goal of this study is to find possible relations between either of the two sex differences on the one hand, and any of a number of test properties on the other hand. The test properties in question are hardness, estimated g factor loading, and contents type. What follows is first a legend of the variables that appear in the tables, then a number of tables with numerical results, and finally some conclusions in verbal form.

The sex of a test candidate is simply that which the candidate reported when registering to take the tests, and can be either female or male. No "in between" option has been offered to date.


The following fields appear in the main table of test variables:

Remark: The male-female score difference could also have been computed in other ways, such as using medians instead of means, or expressing it in raw score standard deviations instead of protonorms, but whichever way one chooses, there remains a fair amount of error caused by the low number of females per test. The current mode of calculating M-F has been chosen because it allows a conversion of the difference to I.Q. points (1 protonorm point is on average .18 I.Q. points in the range where the female and male averages lie); means are used instead of medians because with a very low number of (female) candidates on most of the tests that seems to give a better estimate of the average per test.

The results

The following correlations between the various variables (columns in the table) have been computed:

Correlates of PropM (the participation difference)

Correlate of M-F (the score difference)

This moderate correlation between the two indicators of the score difference reflects the amount of error in many a test's score difference, caused by the low number of females on most of the tests.

Remaining correlations

None of these are significant:

Note the last one: hardness and g loading are apparently not related, but are distinct properties. A highly g-loaded test is not necessarily a difficult test.

Median score difference per contents type

 In prot.In I.Q.In rsex × score

Apparently, and contrary to females' preference for one-sided verbal tests, the male-female score difference is on average smallest on heterogeneous tests, containing a mixture of item types. The somewhat anomalous behaviour of Numerical (In prot. and I.Q.) results from the fact there is only one numerical test in the study, so that the error of the score difference on that one test is not mediated out. This table also suggests that rsex × score is the better indicator of the sex difference.

The computed values per test

Sorted by PropM

TestPropMHardnessgContM-Frsex × score

n = 30


Looking at the table with computed values per test, one thing stands out immediately: The four tests at the bottom, that have drawn the most female candidates (PropM lower than .9), have four things in common:

  1. They are easy (low hardness), in fact about the easiest tests in the table;
  2. They are multiple-choice, in fact about the only multiple-choice tests in the table;
  3. They are verbal;
  4. They have rather low estimated g factor loadings, in fact about the lowest in the table.

Another thing that can be observed about these tests is that, despite their popularity with females, female performance on them is not better but even slightly worse than over all of the tests combined. The median male-female difference on these four tests is 43.5 protonorm points, which is about 7.8 I.Q. points. The median of the correlations of sex with score is .255, which is even higher, relative to the corresponding value for all of the tests together (see "Total" in the table "Median score difference per contents type"). In other words, female candidates appear to have had an unlucky hand in their choice of tests. In actuality, the male-female difference is smallest on heterogeneous tests.

On the correlates of PropM

The highly significant correlation of .60 with g factor loadings implies that female candidates are seeking out tests with low g loadings, whether they are aware of it or not. The possibility that they somehow cause those loadings to be low by participating can be excluded with certainty, because the females who took the lowly g-loaded tests in this study had as good as no known scores on other tests, and have therefore not contributed to the computation of the estimated g factor loadings of those tests (which are based on a test's correlations with other tests). The fact that they had no scores on other tests is a logical consequence of the rareness of female high-range test candidates. It follows that females are actively avoiding tests with high g loadings, as if they possess a "sixth sense" - call it a g-spot - that enables them to detect how much g a task requires.

The low but significant correlation of PropM with hardness implies that female candidates are also avoiding difficult tests. This, together with their avoidance of highly g-loaded tests, provides a clue as to the low participation of females in high-range testing on the whole. High-range testing is all about difficult and g-loaded tests, and that appears to be why females avoid it.

On the correlates of M-F and rsex × score

The absence of significant correlations with the two indicators of the male-female score difference means that the sex difference is not greater (nor smaller) on difficult tests, and not greater (nor smaller) on highly g-loaded tests. In combination with the foregoing, this suggests that females are avoiding difficult and g-loaded tests for no reason. With regard to test hardness this is entirely logical, after all it should make no difference for one's score whether one takes an easy or a hard test; as long as one's true level falls within the measured range, one's score should on average be the same on either hardness level of testing.

Also, the absence of a significant correlation with g factor loadings means that there is currently no evidence that the male-female difference on high-range tests is a difference in g. Had there been a significant positive correlation with g loadings, this would have suggested that the difference was at least partly due to a sex difference in g. The absence of such a correlation leaves open the possibility that the observed sex difference lies partly or wholly in non-g factors. To understand this, one needs to know that while most of the variance in I.Q. test scores is accounted for by the g factor (typically 60 to 70 %), other portions of the variance are due to so-called "group factors" like the verbal, numerical, and spatial factor. The observed sex difference may lie in group factors as well as, or instead of, in g.

On the score difference per contents type

The median sex difference is markedly smaller on heterogeneous tests (with a mixture of item types) than on homogeneous tests (with only one item type), as both indicators of the sex difference reveal. This lends support to the possibility that the difference lies partly or wholly in group factors rather than in g.

The total median of the size of the sex difference (over all of the 28 tests that have female scores) is 41 protonorm points in this sample of tests, an estimated 7.4 I.Q. points. This is smaller than the difference of 50 protonorm points and 9 I.Q. points found in the current protonorm table (from 2011 for males and 2013 for females), and smaller than the difference of 11 I.Q. points found in a 2004 report. These studies used different samples and different methods of computing the difference. The variation in outcome is due to "sampling error" rather than an actual shrinking of the difference; an analysis of the development of the sex difference over time would be problematic because of the rareness of female test scores. For clarity, simply all existing scores have been used for all of the tests.

The conclusions summarized

  1. The male-female participation difference on high-range I.Q. tests is related to the g factor loading and to the hardness of the tests, such that females apparently avoid highly g-loaded tests and avoid difficult tests. This phenomenon can logically also explain the low female participation on high-range tests in general.
  2. The male-female score difference on high-range tests is smallest on heterogeneous tests (containing a mixture of item types), and largest on homogeneous tests (containing only one item type, and almost regardless of the circumstance which item type they contain).
  3. The male-female score difference on high-range tests has no significant correlation with the tests' g factor loadings; there is therefore no evidence that the score difference is a difference in g.
  4. Conclusions (2) and (3), combined, support the possibility that the observed male-female score difference lies partly or wholly in group factors (e.g., verbal, numerical, spatial) rather than in g.
  5. The preferred test choice strategy of females - one-sided verbal tests, multiple-choice tests, lowly g-loaded tests, easy tests - is keeping them from taking the tests on which the male-female score difference is smallest.
  6. g factor loading and hardness are not correlated across tests, but are independent properties; therefore a highly g-loaded test need not be a difficult test.