Raw score trends over the years

© Feb. 2011 Paul Cooijmans

Score trend per item type category

For each test is computed the correlation between the median scores per year and the year numbers. Per item type category, the weighted mean of those correlations is computed across the tests in that category.

Test typern
Heterogeneous-.08537
Verbal (non-analogy).04434
Logical.09113
Numerical.26196
Verbal analogy.46337
Spatial.55265

Considerations

Clearly, heterogeneous (mixed-item) tests are the way to go. Score inflation over time is highest on one-sided spatial or visual-spatial tests, verbal anology tests, and numerical tests.

Since many of the heterogeneous test do contain verbal analogies and/or spatial and/or numerical problems, it follows that the mere act of mixing those, in themselves non-robust, item types makes a test robust; that mere diversity in item types helps to prevent score inflation. One may speculate this has to do with the personality and aptitude profiles of the megalomanic high-score chasers we all disgust so much in the high-I.Q. world; those obsessed with high scores may tend to be disturbed personalities with uneven aptitude profiles (those are known to go together), so that one-sided tests are necessarily their prime targets, and the diversity of skills required by heterogeneous tests scares them off, surpasses their abilities.

Raw score trends over the years (old version)

© Paul Cooijmans 2005

I.Q. tests used in regular psychology have shown a steady rise in raw scores over the 20th century; this rise occurs mostly in the lower I.Q. ranges, is greater on nonverbal than on verbal tests, and seems to be a function of the tests' g loadings (the higher the g loading, the greater the rise).

There is little doubt the rise is real; that is, that it reflects a real increase of the g level in the lower I.Q. ranges, a pulling up, a shrinking of the standard deviation from below. Most likely it is an increase of the environmental component of g, which accounts for about 25% of the variance in I.Q. scores. Its cause is probably the improvement of medical care and nutrition during pregnancy, around birth and in the first few months or years of life, and this improvement has affected the lower social classes (which were extremely poor in the early 20th century in industrial countries) much more than the upper classes.

On the political level, this rise reflects the success of the labour movements in the industrial countries. It belongs to the second phase of the buildup of wealth in such societies; first, a small group of industrials gets rich and wealthy, and then a century or so later the wealth spreads out to the rest of the country. I.Q. tests have been invented just in time to have their norms messed up by this second phase. The rise in scores is called "Flynn effect", after its discoverer.

Now that my tests have been in use for ten years, I have the opportunity to see if such a rise also occurs in the high range. Below I report some results; in short I will already say there are trends in my tests, but the causes and mechanisms appear to be different from those that operate on regular I.Q. tests. The trends in high-range tests are not related to the Flynn effect, but are real still and should be cause for occasional renorming, depending on the type of test.

I took those tests that have existed for at least three years and computed for each test the correlation between the average raw score for each year and the year number. I then computed the weighted average (weighted by the number of scores for each test) over all tests as .194, using 977 scores in total.

This is a very low but significant correlation that indicates a very slight but steady rise. More informative are the correlations for each type of test:

Test typeCor. mean score × year taken# scores
Mixture of item types-.199196
Logic-.03454
Numerical.024121
Verbal (other than analogies, e.g. association).047166
Spatial.185199
Verbal analogies.62241

Clearly the least affected (unaffected one can safely say) are tests with a mixture of item types. The negative correlation is significant and I interpret it as a "bleed out" effect; these tests are so hard and have such a strong attraction to the very highest scorers that they actually "bleed out" the top of the population in their (the tests') first few years of existence, after which mainly lower scores come in, simply because there are no potential testees left capable of scoring as high as the earlier ones. That is what you get when your tests aim at the summit of cognition. There are not so many people there.

Then come Logic, Numerical and Verbal (other than analogies) tests. They show little change over the years.

Spatial tests suffer a bit, but not much. Purely spatial tests will need to be renormed occasionally I would say.

The true culprits are verbal analogies. I suspect it has to do with the increasing quality of the Internet as a research instrument. The correlation of .62 is probably the rate at which Google is learning to solve verbal analogies. This means tests with only verbal analogies should be renormed quite regularly, and one should better not create new tests with only verbal analogies, maybe even break up existing tests and integrate their items in mixed-item tests. Mixed with other item types they do better. It is a pity, because verbal analogies are the item type with the highest g loading.