Skip to main content.

The Hollow Shell of a Test: Comment on the 1995 Technical Report Describing the New Nassau County Police Entrance Examination

Linda S. Gottfredson
University of Delaware

September 17, 1996


The Nassau County technical report purports, in effect, to have found the Holy Grail of personnel selection, namely, a job-related test for a mid- to high-level job (in this case, police officer) that has virtually no disparate impact against blacks. A careful reading shows, however, that disparate impact was effectively eliminated from the County's new police examination by systematically stripping it of cognitive demands that are known to be important on the job.

The test will buy racial balance at the cost of lowering the quality of police officers hired, whatever their race or gender. Blacks, Hispanics, and women who could have passed a more job-related test on the basis of their merit are now more subject to the laws of chance.

Race-Based Test Development and Rescoring

The now-illegal practice of race-norming produced racial balance in employment test scores by converting individuals' raw test scores into percentile ranks, separately by race, which in effect gave all black and Hispanic test takers enough bonus points to boost their average scores to that among whites. The Nassau project took a different and more surreptitious route to racial balance. It appears to have substantially eliminated disparate impact in test scores by adjusting test content and scoring procedures to rig the results. Scores on Nassau's new police test battery need not be artificially boosted for blacks and Hispanics, because only tests on which blacks and Hispanics do about equally well as whites are actually counted toward applicants' test scores, regardless of whether those tests measure crucial job skills.

Both race-based testing procedures lower the quality of the persons hired, but the new procedure has a more devastating effect on workforce quality. Race-norming resulted in hiring less capable minority than majority individuals. However, race-based test rigging will often lower the standards for everyone, because it avoids testing for key job competencies that blacks more often lack. Race-norming was a fairly obvious decision to sacrifice merit in the name of racial parity. Race-based test rigging hides that same decision from public view behind an impenetrable wall of very sophisticated and obscure test development expertise and incomplete reporting of results.

The Nassau report does not describe its strategy so baldly, of course. Indeed, it is long, complicated, exhausting, and confusing reading, even for employment testing experts. Contrary to professional standards and federal Uniform Guidelines requirements, the report does not provide enough data to know or evaluate fully what was done. However, it is precisely what is missing from the report (for example, appropriate data on test scores and validity, comparisons with Nassau's earlier police test) that confirms the suspicion that test development was guided by attention to the race, not the quality, of the applicants who would score well on the test. The concern for disparate impact trumped the concern for validity, which behavior accords with neither good practice nor the law.

The Shocking End Product

There is much that is good in the report. It is the work of a highly knowledgeable team of nationally recognized experts. A glance at its end product is enough to suggest, however, that theirs was expertise bent to a political purpose, in particular, by the U.S. Justice Department, which contributed major funding to the study and hired its own consultant to oversee the work.

The report begins by providing a compelling portrait of the demands of police work as documented by the study's extensive job analysis. To quote just part of the report's summary description (p. 15):

Patrol officers are regularly assigned to deal with a wide variety of complex emergency situations requiring specialized knowledge and training. These may include hazardous material incidents or disasters, child abuse or domestic violence incidents, and hostage or crime-in-progress scenes. In each situation, the assigned officer must call upon both their (sic) training and their knowledge of laws and procedures to provide timely and effective response to the problems he or she encounters. In some cases, an immediate, decisive action on the part of the officer may be required to protect life or property, or to thwart criminal activity. Good judgment and quick reasoning are clearly critical in police work. Not surprisingly, "reasoning, judgment, and inferential thinking" was the single largest of the 18 categories of "skills, abilities, and personal characteristics" to emerge from the project's job analysis (p. 33). Expert police officers also judged this category to contain the greatest number of "critical" skills (p. 61) and, unlike all but one other skills category, to contain skills critical to all duty areas or "task clusters" (pp. 65-68).

Reasoning, judgment, and inferential thinking represent a very general cognitive ability that turns out to be important in all kinds of mid- to high-level work. This fact is well known in personnel testing. As the report describes (Suppl. App. 4), virtually all large police departments test applicants for "judgment/decision making skills." Not surprisingly, then, all three of the project's centerpiece "video-based situation" exercises, one of the two "paper-and-pencil" cognitive tests, and two of the 20 "personality/temperament measures" in the full experimental test battery were meant to measure reasoning and judgment (pp. 107-110).

Nonetheless, by the end of the project, only one of those measures remained in the test battery--the personality scale "Openness to Experience." Moreover, that scale does not measure the capacity for reasoning and judgment, even according to the project's own definition ("job involvement, commitment, work ethic, and extent to which work is...an important part of the individual's life," App. S). The winnowed test battery that the project recommended for operational use--its "refined" model--consists of eight personality/temperament scales plus truly rock-bottom reading skills (being better than the worst one percent of readers on the police force). What was ascertained to be crucial early in the study--reasoning, judgment, and inferential thinking--had no place in the examination the project finally recommended.

How the Project Created and Justified a Cognitively Empty Test

At best, it seems odd for a test development procedure to exclude precisely that which seems most important to include. However, anyone familiar with the employment testing literature knows that the surest and easiest way to avoid disparate impact is to avoid testing for cognitive ability. Witting or not, that is just what the Nassau county study did.

I highlight below only some of the project decisions that had the effect of stripping cognitive content from the Nassau test battery. Each decision can be questioned on technical grounds. Taken together, they reveal a clear pattern of race-driven choices in test development and scoring.

To preview what is detailed below, the project assembled an experimental battery of 25 measures that had limited promise, at best, for identifying applicants who possess the cognitive skills crucial for police work. Even that limited promise was stripped away as the validity of its already cognitively impoverished cognitive tests was minimized with skewed measures of test validity, allowing the project to eliminate those tests on the basis of their disparate impact. On the other hand, the low-impact non-cognitive tests were retained, despite evidence that their validity in the field might be illusory. In a final crescendo of statistical errors, the validity of the project's now cognitively-denuded test battery was overestimated by 100%, boosting it falsely into a respectable range.

1. Project omitted best cognitive tests. A voluminous literature, which was well known to the project team, shows that traditional cognitive tests are important in predicting performance in many or most jobs. It also shows, however, that such tests have considerable disparate impact. Despite their proven record in measuring key job demands, none was included in the experimental battery. (The one exception, "Map Reading" from the 1987 Nassau police test, was included in the experimental battery explicitly so that it could be used as a benchmark for comparing the "psychometric and validity characteristics" of the new test with the old [p. 91], but--strikingly--no such comparisons were reported.) The report is written as if that literature on cognitive tests simply does not exist. However, the project clearly acted on that unmentioned knowledge, judging from its explicit rejection of traditional cognitive tests due to their disparate impact ("in the interest of minimizing adverse impact," p. 86). The project did not use validity as a criterion for including tests in the experimental battery; rather, it used disparate impact as a criterion for excluding them.

2. Project developed weak substitutes for good cognitive tests. The project opted instead to develop its own "innovative" measures of judgment, reasoning, and reading that would have less disparate impact. While not described in this way, all the new measures imited or eliminated the need to reason, read, or write during the exam. Cognitive content was reduced to reduce disparate impact.

Three of the project's four new tests consisted of scenarios or situations acted out on video. They required no reading or writing of applicants during the examination, except to mark an answer sheet in response to video-administered oral questions. One of the three ("Remembering and Using Information") required applicants to remember written material made available to them for study up to 30 days prior to the exam, material which they then had to apply in answering oral questions about scenes enacted in the video. (The research sample of full-time Nassau County police officers got the materials only one week before taking the exam.)

The fourth new test ("Understanding Written Material") was intended specifically to measure reading comprehension and was administered in standard "paper-and-pencil" form. It required applicants to read short passages during the exam and answer questions about them. However, that material, too, was made available to applicants up to 30 days before the exam and then administered with relaxed time limits (so as "not to penalize accurate, but slow readers"). "Slow" reading (which generally reflects slow comprehension) is widely known to be (negatively) correlated with general cognitive ability (that is, slow readers tend not to be very bright). Moreover, traditional practice is to keep the content of cognitive tests secret and to administer them under identical conditions in order to gauge people's ability levels more accurately. This minimizes the impact of extraneous factors, such as differences in motivation or the amount of help and time available to comprehend the material.

All four of the project's "innovative" tests appear to have been intended to reduce disparate impact by relaxing their cognitive demands. That relaxation succeeded to some extent, because the disparate impact of the two tests for which the project reported disparate impact data was, in fact, only about half that normally expected of cognitive tests (p. 184). But a glass half empty is still a glass half full, and those two tests were in the end eliminated too.

3. Project retained non-cognitive tests despite evidence that they may be less valid for applicants than for the research sample. The other 20 scales in the experimental battery were "non-cognitive" measures of personality and temperament. Personality scales are generally thought to be valid for fewer occupations or at a lower level than are cognitive tests, but interest in them has grown as personnel selection has sought alternatives with less disparate impact than is typical of cognitive tests. Such scales typically have little, if any, disparate impact (which turned out to be true in this study too, p. 184). Many personality tests are available on the market. These particular 20 scales are from two proprietary job selection instruments (the LEAP and the WRAP) belonging to several consultants on the project team.

The project selected the personality and temperament measures because they lack the disparate impact of cognitive tests, not because they are equally valid or job-related, which they are not. The report presents considerable evidence that they have a degree of validity, but is silent about the more impressive evidence for cognitive tests. It also fails to mention the special weaknesses of the former, in particular, the possibility that job applicants may be able to "fake good" on them (unlike on ability tests, where doing well requires the actual ability to do well).

The problem is this. The data for calculating validity come from the research sample, in this case, incumbent patrol officers. These data are used to estimate what the validity will be if the test is then used to test and hire from an applicant pool. The research-based estimate will be appropriate only if the research and applicant groups took the same test under the same conditions. That is not always the case. For example, if more applicants than incumbents cheat on a cognitive test or fake good on a non-cognitive test, then the test's operational validity will be lower than estimated in the research sample. It may be zero.

There is evidence of such problems in this study. Although it can happen for reasons related to tenure-linked differences in age, education, or motivation, the report never explains the following odd finding. Applicants (who have a strong incentive to fake) got better scores than did incumbent patrol officers (who were assured that their scores would be confidential and used only for research purposes) on all the personality scales retained in the final battery (but especially on "Achievement Motivation," p. 185). In contrast, the applicants did substantially worse on the paper-and-pencil reading test (p. 185), as is normally found for ability measures, even though they had more incentive and three weeks longer than incumbents to study those test materials. (Applicants scored better on the "Situational Judgment" exam, which raises suspicions of widespread cheating on that video-administered "cognitive" exam.)

4. Project created opportunity to guide validation decisions by race. The experimental battery, which was administered to 500 incumbent patrol officers (the research sample) and to 25,000 job applicants included 25 tests. Fifteen of the tests were eliminated in the first round of validation research, which looked at whether the 25 tests correlated with several dozen highly specific aspects of job performance in the pattern the project hypothesized. The project does not report the validity of any of the individual 25 tests in predicting the composite (summary) performance score, which is the study's crucial criterion measure.

The remaining 10 measures were examined further in the second round of validation research, which assessed how well each of five different statistical equations or models (containing different subsets of the 10 measures) predict job performance. The five equations contained different subsets of the 10 measures, with four of the five including at least one cognitive test. As already noted, the project chose to recommend a sixth "refined" model, which includes eight personality scales plus a bare minimum score in reading. Like the one purely non-cognitive model, the "refined" one effectively eliminated disparate impact. It also, the report said, had distinctly higher "true" validity than all the others (although we shall see this is impossible).

I describe below some of the questionable procedures used to winnow the experimental battery in such a way as to strip it of any remaining meaningful cognitive content. Before doing so, however, it should be noted that the project provided itself with ample data by which to exclude all test material on which blacks would score less well, but to do so in a manner that would not seem to be guided by race.

Specifically, it administered the experimental battery to the applicant population before the research population on which the validation research was to be conducted. It thus reversed the usual order of administration, and it did so explicitly in order to have disparate impact data available during the validation research. However, disparate impact and test validity can and should always be measured completely independently. It is not unheard of for people to advocate a tradeoff between the two once they are accurately determined (say, by opting for a less valid test to obtain lower disparate impact). However, all would agree that it is entirely inappropriate to engineer or "cook" the validity statistics in order to accomplish the same ends more covertly. Because the project obtained information on race differences in performance before it assessed how predictive the tests were, it was able, if it chose, to do precisely that--to shape the results and reporting of the validity data in order to favor the tests with the least disparate impact. The following actions seem incriminating.

5. Project miscalculated validity to favor non-cognitive tests (and failed to report the required statistics). The validation process began with correlating incumbents' job performance ratings with their scores on the 25 tests in the experimental battery. Such correlations (validities) are the basis for estimating how useful an employment test will be for predicting the job performance of prospective hires. The validities that the project team reported are calculated inappropriately. Moreover, they are miscalculated in a way that surely depressed the apparent validity of the cognitive tests and raised it for the non-cognitive ones. Inadvertently or not, the project thereby stacked the deck against finding the cognitive tests as useful as the non-cognitive ones.

It did so by reporting only "adjusted" (what it called "simple") correlations, not the usual unadjusted "zero-order" correlations which professional standards and the Uniform Guidelines require. The project had observed that performance levels rise with tenure on the job (which is typical), so they argued that tenure should be statistically "partialled" out of the performance ratings. This is not unusual. What is unusual was the project's unexplained decision to partial tenure out of both the predictors and the criterion. It makes sense to assume that experience improves job performance, and therefore to remove the effects of experience on the performance criterion. However, there is ample reason to believe that job experience does not change incumbents' general personality and ability traits, which means there is no justification for "partialling" experience out of these incumbents' scores on the test battery as well.

The import of this strange decision to use doubly "adjusted" validities becomes clear by noting another troubling oddity in the test results. The project reports but does not explain it. Among incumbents, tenure on the job is positively correlated with scores on the two paper-and-pencil cognitive tests but negatively correlated with scores on virtually all the non-cognitive tests (p. 175). That is, more experienced (and better performing) officers scored better than less experienced ones on the cognitive tests but worse on the non-cognitive ones (which, recall, make up the new Nassau test).

Partialling job tenure out of test scores on the predictor battery therefore had the consequence of partialling out cognitive ability to some extent. This would level the cognitive scores somewhat, which in turn could be expected to reduce their capacity to predict job performance. Just as the apparent predictive validity of the cognitive measures is depressed by this means, the apparent predictive validity of the non-cognitive ones is boosted. These adjustments to validity therefore amount to "handicapping" the cognitive tests in demonstrating predictive validity. The impact of this selective handicapping on validities is unclear, however, because the project violates professional standards and the Uniform Guidelines by never reporting the usual unadjusted "zero-order" correlations.

We may suspect, however, that this handicapping helps to account for why the four prediction models (scoring systems) that include both the non-cognitive and (admittedly impoverished) cognitive tests appeared to predict job performance only marginally better than the ones including only non-cognitive measures. (In contrast, the Army, in its big "Project A", found just the opposite pattern in predicting proficiency among its police--non-cognitive tests add virtually nothing to the [high] predictive validity of its [better] cognitive tests.) The report therefore concluded that the five models have "nearly identical validity" (p. 135), implying that all the project's cognitive tests could be ignored with virtually no effect on the quality of subsequent hires. This opened the way for eliminating them due to their higher disparate impact.

In short, the inappropriate partialling procedure allowed the project to capitalize on an oddity in the data to suppress the apparent value of its already cognitively-impoverished "cognitive" tests. That oddity is troubling in itself, however, and should have acted as a red flag to stop all analyses until it could be explained. As already noted, the higher non-cognitive scores for applicants than for experienced police officers raise the possibility that many of the applicants "faked good" on those scales. This would mean, in turn, that the test validities calculated on the incumbent sample cannot be applied to the applicant population. In a word, the non-cognitive tests may have no validity in practice.

The report does not provide the unadjusted correlations for the tests in its experimental battery, and it provides no data at all on the validity of the 1987 battery the new one is meant to replace. It is thus impossible to know what the experimental test battery validities truly are, either in absolute terms or relative to each other. It is thus also impossible to ascertain whether the new Nassau test did in fact "maintain validity" (as the Nassau County consent decree requires) while essentially eliminating disparate impact. The earlier test contained a test of reasoning (no doubt accounting for its higher disparate impact), so it is not unreasonable to assume that it also had higher validity as well (because more valid cognitive tests tend to have more disparate impact). But the report does not allow us to know. The report makes quite clear what the pattern of disparate impact is across the different tests; in contrast, it obscures the pattern of validities.

6. Project overstated validity of cognitively denuded test battery (used wrong "shrinkage" corrections). In addition to using inappropriate "simple" validity coefficients, the project also grossly overestimated how valid five combinations of tests ("basic models") are. As other reviewers have explained (but which I will not repeat here), the project made two mistakes. One was that it used the wrong statistical formula to "shrink" the five models' estimates of validity in the research population to compensate for their capitalization on chance factors in the data. The second mistake was in shrinking the wrong validities (for all five models, it used the same higher validity of another model using all 25 tests). Reviewers have estimated that the project's two errors together had the effect of inflating the "true" validities of the five equations by 100% (in the end, yielding a false value of .30 rather than a more accurate .15). Thus, even if one insists that the doubly adjusted validities are appropriate, the true validity of the new Nassau battery is much lower than claimed--and probably lower than both the earlier one it replaced in Nassau County and those which the Justice Department has started pressing other jurisdictions to replace.

7. Project overstated, even further, the validity of its most favored (re)scoring system (made a mistake in correcting for for "restriction in range"). The project did one final required correction (for "restriction in range") on the validities calculated from the research sample so that they would estimate more accurately the validity to be expected in the more heterogeneous applicant group. The project estimated the true validities of all the "basic" models (test rescoring systems) to be about .30, regardless of whether its cognitive tests were included with the non-cognitive ones. In contrast, it estimated that the true validity was .35 for the "refined" model (non-cognitive plus the first percentile cutoff for reading) it recommended for operational use. This validity is substantially higher than that for any of the three models that use the very same reading scores but which do not collapse them into two, much less informative categories (as did the refined model). It is statistically impossible for the less efficient use of the same reading scores in the same population to result in higher true validities. The project team apparently did not question its good fortune in finding this startling superiority for its favored model.

Some reviewers have suggested that the project must have made a computational error. Another possibility is that it used the wrong procedures in correcting for restriction in range. Once again, however, the report does not provide enough information to know what procedures the project used.

Queasiness at Its Own Rhetoric

Others have rightly pointed out that there is no justification for the project having treating reading scores in a pass-fail manner, let alone setting the pass level so low. I would point instead to the project's motivation in even introducing this standard, as weak as it is. It hints that the project was well aware of the importance of cognitive ability and the consequences of omitting it altogether. The report spends hundreds of pages of text, tables, and appendices building up to its conclusion that no cognitive test, not even its own "innovative" ones, need be included in the new Nassau test battery. But then, as if not convinced by its own rhetoric, the report suddenly and virtually without explanation adds back the faint shadow of one (reading above the first percentile). This is the "refined model" whose obviously inflated validity has just been discussed. The report states simply in its closing text that implementation of the strictly non-cognitive test battery, although having "nearly identical validity" as the other options, "could potentially admit applicants to Police Academy training who would fail in the training program" (p. 139). Adding the minimum reading level, it assures us, "would effectively limit selection of [such] individuals" (p. 140). The report says nothing about whether the test would effectively select highly capable officers with a capacity for "deal[ing] with a wide variety of complex emergency situations requiring specialized knowledge and training," often (as the job description continues) "to protect life or property, or to thwart criminal activity."

Need for Congressional Investigation

The Nassau County report appears to be technical camouflage, purchased at great expense by the U.S. Department of Justice, in order that its Division of Civil Rights might coerce police jurisdictions into what amounts to quota hiring. The issue of how to contend with disparate impact in selection is a vexing matter of social policy that should be debated publicly. It not a decision to be made behind the scenes by unelected government bureaucrats and enforced via intimidation, federal cash, and the misleading of District Court judges, as seems to be the case here. And it certainly should not be decided in such a way that threatens the public safety, as lowered police hiring standards are bound to do. A Congressional investigation could get at the truth of this important matter.

A Congressional inquiry might also consider whether race-based construction and scoring of employment tests is, or ought to be, illegal. Does excluding job-related tests on which whites score better, precisely because they do so, constitute intentional discrimination? Is it, or should it be, illegal to rescore an examination, after the fact, in order to reduce the percentage of high scorers among whites relative to blacks, either when that causes test validity (properly estimated) to drop or when evidence concerning validity remains unreported? Legal or not, should the Justice Department be underwriting and promulgating any particular employment test--especially one whose construction and scoring was (mis)shaped by racial considerations? Also, it ought to be determined how this kind of standards lowering comports with the standards raising goals of the C.O.P.S. and Police Corps programs and with the President's and Attorney General's pronouncements on the subject.