Sex differences on high-range I.Q. tests analysed

Introduction

When dealing with high-range tests, two types of sex difference are conspicuous:

More males than females take the tests; since high-range candidates are self-selected, this difference in participation is a behavioural variable in its own right; its significance is staggering: the probability that the participation difference is mere coincidence is almost infinitely small, in the order of thirty standard deviations removed from the mean in a normal distribution;
On most of the tests, males score on average higher than females, to varying degrees. Because of the low number of females, this difference is less significant than the difference in participation.

The goal of this study is to find possible relations between either of the two sex differences on the one hand, and any of a number of test properties on the other hand. The test properties in question are hardness, estimated g factor loading, and contents type. What follows is first a legend of the variables that appear in the tables, then a number of tables with numerical results, and finally some conclusions in verbal form.

The sex of a test candidate is simply that which the candidate reported when registering to take the tests, and can be either female or male. An "Intersex" option has been offered for about a decade now, but been chosen only once to date.

Legend

The following fields appear in the main table of test variables:

Test - The name of a high-range test. Only tests that have received at least 36 submissions are included; the actual numbers of submissions range from 36 to 413 per test;
PropM - The proportion of males among the candidates for a given test; this quantifies the participation difference. Notice that a value of 1.00 means that no females took the test; in those cases, the rightmost column, dealing with the sex difference in score, is logically empty;
Hardness - Hardness of a test; to be understood as the proportion of the possible score range that is on average missed on that test;
g - The estimated g factor loading of the test;
Cont - The contents type of the test, abbreviated as v (verbal), n (numerical), s (spatial), and l (logical);
r_{sex × score} - The correlation between sex (0 for female and 1 for male) and raw score on the test. This is the best within-test indicator of the sex difference.

Remark: An earlier version of this report also contained the actual sex difference per test (male minus female mean in protonorm points and I.Q) but analysis proved this to be an inferior and insignificant indicator of the within-test sex difference compared to r_{sex × score}. To know the sex difference in I.Q. over all tests combined, consult the protonorm table and look up the within-sex medians in the table. At the time of the present report, the difference found there is 5 I.Q. points, favouring males. To know the sex difference in I.Q. for any particular test, consult the statistical report for that test.

The results

The following correlations between the variables (columns in the table) have been computed:

Correlations significant beyond the .05 level:

PropM × g = .50 (n = 54)
hardness × g = .28 (n = 54)

So, the only sizeable and significant correlation is that between PropM (proportion of males among the candidates) and g loading, large .50. In other words, the more g-loaded a test is, the fewer females will take it.

The correlation between hardness and g, while just significant, is very small, meaning there is little relation between those variables; a hard test is not necessarily a g-loaded test and vice versa (a not infrequent misconception is that g loading is something like hardness).

Remaining correlations

PropM × r_{sex × score} = -.27 (n = 52)
PropM × hardness = .26 (n = 54)
r_{sex × score} × g = -.14 (n = 52)
r_{sex × score} × hardness = .11 (n = 52)

The remaining correlations are not significant and are very small, meaning there is no sufficient evidence that (1) the sex participation difference is related to the sex score difference, (2) the sex participation difference is related to test hardness, (3) the sex score difference is related to g loading, and (4) the sex score difference is related to test hardness.

Median participation and score difference per contents type

	PropM	r_{sex × score}
Verbal	.93	.15
Numerical	.95	.15
Spatial	.96	.10
Logical	.98	.10
Heterogeneous	.95	.10
Total	.95	.115

It does seem a bit as if the tests with the greatest female participation (lowest PropM) have the greatest sex difference (favouring males) but this is not conclusive in this table, just like the above reported correlation PropM × r_{sex × score} (which suggests the same) is not significant.

The computed values per test

Sorted by PropM (proportion of males among the test's candidates)

Test	PropM	Hardness	g	Cont	r_{sex × score}
Cooijmans Intelligence Test 5	1.00	.81	.77	vns
Reason	1.00	.27	.58	l
Test For Genius - Revision 2004	.99	.58	.80	vs	.29
Daedalus Test	.98	.61	.67	l	.13
A Paranoiac's Torture: Intelligence Test Utilizing Diabolic Exactitude	.98	.84	.73	vns	.10
Problems In Gentle Slopes of the third degree	.98	.23	.79	vs	.01
Combined Numerical and Spatial sections of Test For Genius - Revision 2010	.98	.60	.78	ns	.05
Problems In Gentle Slopes of the second degree	.98	.65	.74	ns	.24
Spatial section of Test For Genius - Revision 2004	.98	.48	.74	s	.19
Lieshout International Mesospheric Intelligence Test	.97	.41	.75	s	.07
Verbal section of Test For Genius - Revision 2004	.97	.70	.74	v	.14
Cooijmans Intelligence Test - Form 1	.97	.30	.64	vns	.01
Numerical section of Test For Genius - Revision 2010	.96	.76	.75	n	.15
The Sargasso Test	.96	.49	.72	vnsl	.02
Reflections In Peroxide	.96	.68	.86	ns	.25
Space, Time, and Hyperspace - Revision 2016	.96	.40	.77	s	.10
Test For Genius - Revision 2016	.96	.66	.87	vns	.30
Combined Numerical and Spatial sections of Test For Genius - Revision 2016	.96	.53	.81	ns	.15
The Test To End All Tests	.96	.65	.80	v	.16
Verbal section of Test For Genius - Revision 2016	.96	.76	.79	v	.18
Test For Genius - Revision 2010	.95	.67	.83	vs	-.02
Genius Association Test	.95	.45	.69	v	.01
Cooijmans On-Line Test - Two-barrelled version	.95	1.00	.70	vnl	.07
Verbal section of The Marathon Test	.95	.39	.82	v	.04
Numerical section of The Marathon Test	.95	.23	.84	n	.07
Spatial section of The Marathon Test	.95	.54	.83	s	.03
Reason - Revision 2008	.95	.23	.65	l	.07
Reason Behind Multiple-Choice - Revision 2008	.95	.20	.74	vl	.11
The Marathon Test	.95	.36	.88	vns	.10
Associative LIMIT	.95	.45	.78	vs	-.00
Numerical and spatial sections of The Marathon Test	.95	.39	.86	ns	.06
The Bonsai Test - Revision 2016	.95	.36	.83	vns	.17
Numbers	.95	.40	.62	n	.16
The Nemesis Test	.94	.87	.79	vnsl	.16
The Alchemist Test	.94	.58	.85	nl	.13
Association subtest of Long Test For Genius	.94	.63	.69	v	.02
Isis Test	.93	.96	.61	vn	.10
Test of the Beheaded Man	.93	.53	.87	vnsl	-.05
The Final Test	.93	.48	.76	v	.21
Cooijmans Intelligence Test - Form 4	.92	.62	.81	vnsl	.10
Analogies of Long Test For Genius	.92	.68	.73	v	.11
Long Test For Genius	.92	.59	.75	vns	.12
Cooijmans Intelligence Test - Form 2	.92	.44	.75	vns	.31
Cooijmans Intelligence Test - Form 3	.91	.50	.78	vns	-.01
Qoymans Multiple-Choice #5	.91	.18	.73	v	.14
Gliaweb Riddled Intelligence Test - Revision 2011	.91	.22	.77	vns	.08
Cartoons of Shock	.91	.42	.77	vns	.06
Space, Time, and Hyperspace	.91	.52	.64	s	.19
Gliaweb Riddled Intelligence Test (old version)	.91	.17	.37	vns	.12
Short Test For Genius	.90	.82	.74	vns	.30
Qoymans Multiple-Choice #4	.85	.24	.59	v	.26
Qoymans Multiple-Choice #3	.75	.30	.57	v	.26
Qoymans Multiple-Choice #1	.71	.29	.33	v	.15
Qoymans Multiple-Choice #2	.57	.35	.60	v	.24

n = 54

Observations

Looking at the table with computed values per test, one thing stands out immediately: The four tests at the bottom, which have drawn the most female candidates (PropM lower than .9), have five things in common:

They are easy (low hardness), in fact about the easiest tests in the table;
They are multiple-choice, in fact about the only multiple-choice tests in the table;
They are verbal;
They have rather low estimated g factor loadings, in fact about the lowest in the table;
They have a large sex score difference (high value of r_{sex × score}) favouring males.

This gives the impression that females have an unlucky hand in choosing tests to take, selecting tests that appear "easy" but on which they perform worst. But it is more complicated, because the lower female participation on the other tests may represent a higher selection of females, a smaller and higher-scoring group than those who took the bottom four tests.

On the correlation of PropM (proportion of males) with g

The highly significant correlation of .50 with g factor loadings implies that female candidates are seeking out tests with low g loadings, whether they are aware of it or not. The possibility that they somehow cause those loadings to be low by participating can be excluded with certainty, because the females who took the lowly g-loaded tests in this study had as good as no known scores on other tests, and have therefore not contributed to the computation of the estimated g factor loadings of those tests (which are based on a test's correlations with other tests). The fact that they had almost no scores on other tests is a logical consequence of the rareness of female high-range test candidates. It follows that females are actively avoiding tests with high g loadings, as if they possess a "sixth sense" - call it a g-spot - that enables them to detect how much g a task requires.

This also provides a clue as to the low participation of females in high-range testing on the whole. High-range testing is all about g-loaded tests. In fact, in society at large, any activities, professions, and fields requiring high g suffer from a worrying under-representation of females, as demonstrated by decades-long, depressingly unsuccessful, campaigns for more women in high positions.

On r_{sex × score}

The absence of significant correlations with the indicators of the male-female score difference means that the sex difference is not greater (nor smaller) on difficult tests, and not greater (nor smaller) on highly g-loaded tests. In combination with the foregoing, this suggests that females are avoiding g-loaded tests for no reason.

Also, the absence of a significant correlation with g factor loadings means that there is currently no evidence that the male-female difference on high-range tests is a difference in g. Had there been a significant positive correlation with g loadings, this would have suggested that the difference was at least partly due to a sex difference in g. The absence of such a correlation (in fact, the little correlation there is is negative, implying a smaller sex difference on tests with higher g loadings) leaves open the possibility that the observed sex difference lies partly or wholly in non-g factors. To understand this, one needs to know that while most of the variance in I.Q. test scores is accounted for by the g factor (typically 60 to 70 %), other portions of the variance are due to so-called "group factors" like the verbal, numerical, and spatial factor. The observed sex difference may lie in group factors as well as, or instead of, in g.

The conclusions summarized

The male-female participation difference on high-range I.Q. tests is related to g factor loadings, such that females apparently avoid highly g-loaded tests. This phenomenon can logically also explain the low female participation on high-range tests in general (and even the general female under-representation in g-demanding activities, professions, and positions in society at large).
The male-female score difference on high-range tests has no significant correlation with the tests' g factor loadings; there is therefore no evidence that the score difference is a difference in g.
g factor loading and hardness have little if any correlation across tests, so are mostly independent properties; therefore a highly g-loaded test need not be a difficult test and vice versa.

[More statistical reports]