Raw score trends over the years

Score trends per item type category

For each test, the correlation is computed between the median scores per year and the year numbers. Per item type category, the mean of those correlations is computed across the tests in that category. A separation is made between older and newer tests to reveal the phenomenon that score inflation took place mainly in the period 1995-2004 (this was obvious upon a preliminary visual inspection). The tests overlap in time so no sharp separation can be made, but roughly speaking, the older tests have much of their scores from 1995-2004, and the newer tests from 2005-present.

One may speculate that the earlier inflation was caused by the rise of the Internet as a tool for research and fraud. From 2005 on, various methods have been applied to counteract fraud, and been effective.

Test type	mean r newer tests	mean r older tests
Heterogeneous	.01	-.07
Verbal (non-analogy)	-.23	.03
Logical	-.21	-.18
Numerical	-.09	.32
Verbal analogy	-.12	.13
Spatial	-.07	.76
Overall means	-.142	.198

In hindsight, almost all of the raw score inflation (positive correlations) took place on homogeneous Spatial, Numerical, and Verbal (analogy) tests before 2005. Other tests types have never shown serious inflation, and on the newer tests there is no inflation on the whole but instead a steady decline of raw scores, as if people are getting stupider.

Addition November 2024

Another study of score trends is undertaken to examine a possible relation between score trend (robustness) and reliability of a large number of tests for which these two statistics can be computed directly. In particular, the possibility that score inflation (through answer leakage and fraud) might misleadingly increase reliability by increasing inter-item correlations gave rise to this investigation. Given the nature of these statistics, and considering that "when taken" information is stored on subtest level, compound tests are necessarily excluded, while subtests are included that are not tests in their own right.

The score trend for a test is computed as the correlation between test scores and months of test submissions (when months are numbered through starting at January 1995 = 1). The results:

Correlation reliability versus score trend

r_{reliability × score trend} = -0.21 (n = 67 tests)
Mean of the reliability coefficients = 0.879
Mean of the score trends = 0.010

So, the correlation is low negative and not significant, meaning that there is no significant tendency for tests with score inflation to have either higher or lower reliability. That is reassuring and probably means that the traditional reliability coefficient does not become worthless when score inflation is going on; if anything, the negative correlation means that reliability goes down when scores go up, which is what one should want.

Regarding the mean of reliabilities, one should keep in mind that subtests are included, which tend to have lower reliability than tests used in their own right (which are aimed to be at or above .9 in reliability). The mean of score trends probably indicates mainly to which extent tests suffer from answer leakage and fraud. When creating new or revising old tests, it is attempted to avoid types of items that have suffered from score inflation. This may cause new tests to appear difficult, weird, or daunting at first.

Tests used

(Reliability, score trend, test name)

0.95 -0.14 Test of the Beheaded Man
0.88 0.09 Cartoons of Shock
0.95 -0.13 Cooijmans Intelligence Test - Form 3
0.81 -0.06 A Paranoiac's Torture: Intelligence Test Utilizing Diabolic Exactitude
0.84 -0.64 GliaWeb Raadselachtige Associatie- en Analogieëntest
0.88 -0.15 Lieshout International Mesospheric Intelligence Test
0.89 -0.24 The Nemesis Test
0.98 -0.09 Psychometric Qrosswords
0.93 0.11 Gliaweb Raadselachtig Analogieënproefwerk
0.83 -0.06 The Sargasso Test
0.89 0.19 The Test To End All Tests
0.98 -0.35 De Roskam
0.91 -0.33 Only idiots
0.75 0.10 Reflections In Peroxide
0.89 0.20 Problems In Gentle Slopes of the third degree
0.77 -0.11 Random Feickery (Brandon Feick)
0.81 0.00 The Gate
0.91 0.13 Problems In Gentle Slopes of the second degree
0.89 0.13 Space, Time, and Hyperspace - Revision 2016
0.90 0.11 Cooijmans Intelligence Test - Form 4
0.94 0.47 The Alchemist Test (Anas El Husseini)
0.88 0.14 Verbal section of Test For Genius - Revision 2016
0.88 0.11 The Bonsai Test - Revision 2016
0.92 0.16 Cooijmans Intelligence Test 5
0.92 0.17 The Piper's Test
0.94 -0.05 Dicing with death
0.96 0.30 Divine Psychometry (Matthew Scillitani)
0.94 -0.29 A Relaxing Test (David Miller)
0.92 -0.16 The Smell Test
0.88 0.09 Cartoons of Shock
0.93 0.02 Qoymans Multiple-Choice #5
0.95 0.06 De Laatste Test
0.93 0.09 The Final Test
0.90 0.04 Genius Association Test
0.95 -0.16 Letters
0.96 -0.18 De Golfstroomtest
0.86 0.17 Gliaweb Riddled Intelligence Test - Revision 2011
0.87 0.09 Reason - Revision 2008
0.87 0.07 Verbal section of Test For Genius - Revision 2004
0.86 0.02 Spatial section of Test For Genius - Revision 2004
0.95 -0.11 Words
0.95 0.09 Verbal section of The Marathon Test
0.97 0.09 Numerical section of The Marathon Test
0.98 -0.14 Spatial section of The Marathon Test
0.97 -0.43 Problems In Gentle Slopes of the first degree
0.92 -0.08 Qoymans Multiple-Choice #3 (batch scored by Paul Cooijmans)
0.87 -0.31 Test of Shock and Awe
0.83 0.46 Spatial Insight Test
0.91 0.06 Short Test For Genius
0.80 0.19 Space, Time, and Hyperspace
0.79 -0.12 Numbers
0.87 0.29 Odds
0.75 0.07 Analogies of Long Test For Genius
0.82 0.09 Analogies subtest of Long Test For Genius (Netherlandic)
0.82 0.07 Analogies #1
0.75 0.13 Association subtest of Long Test For Genius
0.94 0.06 Qoymans Multiple-Choice #4
0.87 0.18 Association subtest of Long Test For Genius (Netherlandic)
0.59 0.07 Reason
0.72 0.07 Bonsai Test
0.88 -0.31 Cooijmans Intelligence Test - Form 1
0.85 0.09 Evens
0.91 -0.03 Cooijmans Intelligence Test - Form 2
0.87 -0.02 Analogies subtest of Long Test For Genius (French)
0.84 0.18 Association subtest of Long Test For Genius (French)
0.87 -0.09 The Final Test - Revision 2013
0.68 0.24 Gliaweb Riddled Intelligence Test (old version)

[More statistical reports]