Skip to main content.

VACUOUS DEFENSE OF A HOLLOW TEST: COMMENTARY ON THE 1994 NASSAU COUNTY POLICE EXAM

Linda S. Gottfredson
College of Education
University of Delaware
Newark, DE 19716
(302) 831-1650
Fax (302) 831-6058
gottfred@udel.edu


A paper presented at the annual meeting of the Society for Industrial and Organizational Psychology, in the symposium "Police Selection in Nassau County: Validity and Demographic Diversity." St. Louis, April 12, 1997.

You're all well aware of the testing dilemma that bedevils personnel selection. On the one hand, there's a very large racial gap in job-related cognitive skills that makes much disparate impact inevitable. The Department of Education reports, for example, that 25% of white adults but over 75% of black adults have such poor functional literacy that they're "not likely to be able to perform the complex literacy tasks that...[are] important for competing successfully" in our economy (cited in Gottfredson, 1997).

As shown in Table 1, people at this literacy level--levels 1 and 2 in the National Adult Literacy Survey--can't routinely perform tasks any more difficult than locating the expiration date on a driver's license or an intersection on a map. Surely we want police officers who are able to handle more complexity than that, which means that we must expect disparate impact in hiring them.

On the other hand, we have EEO laws and regulation that define disparate impact as evidence of illegal discrimination. This might not be so bad, except that we also have a Justice Department that defines a perfect test as one that has no impact and therefore treats job-related cognitive tests as impediments rather than contributions to fair hiring. For each of Nassau County's two prior police exams, Justice allowed plaintiffs to opportunistically ransack and disaggregate the validation data in order to make the cognitive tests' criterion validity seem to disappear so that plaintiffs could rescore them to reduce impact. David Jones knows this well. He was involved in producing both those exams.

Sackett and Wilk (1994), among others, have shown that to satisfy the four-fifths disparate impact rule, you have to get the mean black-white difference on a predictor battery down to .1 to .2 standard deviations for selection ratios of 10-50%. However, as you may realize, you cannot expect to narrow the racial gap in scores to that extent unless you eliminate most of your test battery's cognitive component.

And that is precisely what the Nassau consultants did. In fact, their exam came close enough to meeting the four-fifths criterion that the Justice Department lawyer testified that theirs was "the closest ['to a perfect exam, vis-a-vis disparate impact'] that I've seen in my years of practice." He's the same lawyer who has tried to get other test developers to cut back the valid cognitive component of their police tests.

The Nassau project's job analysis had unambiguously reconfirmed that cognitive skills are critical for good police work, and the civil service told candidates that the exam would indeed test for such skills. Nonetheless, candidates later learned that all that actually counted toward their scores were 8 personality scales and being able to read as well as the worst one percent of readers in the validation sample.

The Nassau project has defended the virtual elimination of its battery's cognitive component by arguing that it simply heeded the criterion-related validities of the component tests. However, as their recently released bit of data shows, the zero-order validities of their personality scales were no better on average than were those for their cognitive tests (.05). And as David Jones told you last year at SIOP, a battery should make sense in terms of the job analysis data (that is, its content validity), not just the criterion data. This test does not.

I'll say a bit about how the project stripped its battery of crucial cognitive content, but I'd first note that I'm hardly the only critic of the Nassau test. Frank Schmidt (1996a), for example, wrote in the Wall Street Journal that the test is "intellectually dishonest" and will "be a disaster" wherever it's used. I too believe that Nassau County faces the specter of having armed incompetents, black and white, patrolling its streets.

Now, if you want to limit the cognitive content of a predictor battery, the first step--one taken in Nassau County--is to exclude from your validation study all traditional cognitive tests and all highly g-loaded performance criteria, such as success in training and job knowledge. That's not hard to figure out. It does take some skill, however, to justify excluding those measures, because an enormous literature shows that all three are critical precursors to good job performance. Rather than confront that literature, the project's technical report simply ignores it and then creates the impression that we should have grave doubts about the value of cognitive tests due to their paper-and-pencil format. Not surprisingly, the report expresses no such concern about the paper-and-pencil format of the project's personality tests.

Instead of using any traditional cognitive tests, the project created weaker video tests of cognitive ability which it extolled as "innovative" for their not requiring any reading or writing but otherwise bearing greater superficial resemblance to job duties. I'd note, however, that it's impossible to know much about the actual merits of those tests, because the 1995 technical report provides very little of the data that the Uniform Guidelines and professional test standards require, as Table 2 for the Uniform Guidelines, APA Test Standards, and SIOP Principles show. For example, the technical report provides no zero-order correlations of predictors with any criterion measure; no correlations of any sort of the predictors either with each other or with the composite criterion; no means, standard deviations, or disparate impact data for the 16 tests eliminated from the final battery; and no regression weights for the final battery.

With its 25-part experimental battery in hand, the Nassau project then administered it first--not to the validation sample--but to over 25,000 applicants. The twice-repeated and only reason its technical report gave for this reversal of standard procedure, which the report itself championed as "unique," was that the project wanted to look at the impact data before deciding which tests to keep.

I don't have time to go through all the impact-driven decisions that shaped the Nassau test, but you can see discussions of them on the IPMAAC webpage (www.ipmaac.org/Nassau). Craig Russell (1996) summed them up nicely, however, when he wrote that "we see the authors bending over backwards to eliminate cognitive test remnants from the predictor domain."

It's clear that by administering the experimental exam before validating it, the project had committed itself and Nassau County to a battery with much lower criterion validity than they had expected. The project avoided public embarrassment, however, by making a series of three statistical errors that inflated by over 100% the estimated true validity of its final battery. As Frank Schmidt (1996b) has shown, the project used the wrong shrinkage formula, applied it to the wrong multiple R, and then grossly over corrected for restriction in range in its most favored regression model. Although the project claims a true validity of .35, Schmidt re-estimates it at closer to .14. At .14, it's not clear that the new test is even as valid as the previous one that Jones had produced in 1987. Once again, however, we can't judge for ourselves because the project's technical report doesn't compare the two tests as required, and Aon (Jones' company) hasn't answered requests for the 1988 technical report.

The project's response to criticism has been as disturbing as its technical report. Its replies to date (see www.ipmaac.org/nassau/) give more ad hominem commentary than straight answers. It refuses to debate its critics unless, like today, it has virtual veto power over format and who speaks. It has offered to provide the missing data but doesn't actually do so, while suggesting that I would have gotten it long ago had I only made "a simple phone call" before making my criticism public. It has admitted making the one statistical error it could hardly deny (using the wrong shrinkage formula), considering that one of its members had published articles on avoiding that error. But the project has diverted attention from its other errors by showering us with irrelevancies and shifting, post hoc rationales that collapse upon inspection.

After hearing the talks today, I would add that the project members don't read rebuttals. Neal Schmidt just said that, besides being appropriate, the tenure adjustment made no difference anyway. However, the adjustment does indeed make a difference, as my earlier reply to the project had shown with the very same data that he exhibits today. The two cognitive and eight non-cognitive tests that were tried out for the final battery started out with equal average validities (.08), but when tenure was partialled out of the criterion and then both the ctierion and predictors, the validities for the cognitive tests fell and those for the non-cognitive rose. The tenure adjustments produced validities for the non-cognitive tests that ended up, respectively, 27% (criterion adjustment only) and 35% (both predictors and criterion adjusted) greater than those of the cognitive tests.

In conclusion, the project's exam, techical report, and responses to criticism all suggest that test development in Nassau County was bent to the Justice Department's political will. Which brings me to what should concern SIOP most--the Justice Department. We should debate how this organization and its members can best protect themselves and their clients from Justice's much-flexed power to intimidate and corrupt. The seriousness of the matter is illustrated by Justice's apparent willingness to sacrifice public safety for racial balance in police hiring.

References

Gottfredson, L. S. (1997). Why g matters: The complexity of everyday life. Intelligence, 24, 77-130.

Russell, C. J. (1996). The Nassau County police case: Impressions. University of Oklahoma. Available at www.ipmaac.org/nassau

Sackett, P. R. & Wilk, S. L. (1994). Within-group norming and other forms of score adjustment in preemployment testing. American Psychologist, 49, 929-954.

Schmidt, F. L. (1996a, December 10). New police test will be a disaster. Wall Street Journal, A23.

Schmidt, F. L. (1996b). Some comments on the Nassau County police validity study. University of Iowa. Available at www.ipmaac.org/nassau


Table 1
National Adult Literacy Survey

LEVEL % WHITES % BLACKS SAMPLE TASKS

  5        4        0   Summarize 2 ways lawyers challenge jurors
                        Calculate cost of carpet for a room

  4       21        4   Restate argument from long news article
                        Calculate money to raise child from info
                           in article

  3       36       21   Write brief letter explaining billing error
                        Use flight schedule to plan travel

  2       25       37   Locate intersection on street map
                        Enter info on application for SS card

  1       14       38   Locate expiration date on driver's  license
                        Total a bank deposit entry


Table 2
Major Test Development and Documentation Standards Not Met by Technical Report for Nassau County Exam

________________________________________________________________

Information required by the federal government's Uniform
Guidelines (Equal Employment Opportunity Commission et al., 1978)
________________________________________________________________

15.B.2  description of existing selection procedures

     No comparisons of new procedure with old.  Tech report
     refers readers to 1988 report that is not attached.

15.B.8  means and standard deviations

     Not reported for 16 tests winnowed out of experimental
         battery or by race for any test.
     Not reported for any of the trial batteries tested or used.

15.B.8  intercorrelations among predictors and with criteria

     Not reported for either applicants or incumbents.

15.B.8  unadjusted correlation coefficients

     Not reported for any of the 25 tests.

15.B.8  basis for categorization of continuous data

     No basis given for 1st percentile reading cutoff.

15.B.10  weights for different parts of selection procedure

     Regression weights not reported.
________________________________________________________________

Procedures/data/explanations recommended by professional testing
standards
________________________________________________________________

APA Test Standards (AERA/APA/NCME, 1985)

Primary:
1.11  For criterion-related studies, provide basic statistics
      including measures of central tendency and variability,
      relationship, and a description of any marked nonnormality
      distributions
1.17  When statistical adjustments made, report both the
      unadjusted and adjusted results
6.2   Revalidate test when conditions of test administration
      changed
10.9  Give clear technical basis for any cut score

Secondary:
3.12  Provide evidence from research to justify novel item or
      test formats
3.15  Provide evidence on susceptibility of personality
      measures to faking
__________________________________________________________

SIOP Principles (Society for Industrial and Organizational
Psychology, 1987)
__________________________________________________________

Procedures in Criterion-Related Study:

4c    Test administration procedures in validation research
      must be consistent with those utilized in practice (p. 14)
5d    Regression equations should be adjusted using the
      appropriate shrinkage formula (p. 17)
5e    Criterion-related studies should be evaluated against
      background of relevant research literature (p. 17)

Research reports:

2     Deficiencies in previous selection procedures (p. 29
9     Summary statistics including means, standard deviations,
      intercorrelations of all variables measured, with
      unadjusted results reported if statistical adjustments made
      (pp. 29-30)
(Summary)  Provide enough detail in technical report to allow
      others to evaluate and replicate the study (p. 31)

Use of Research Results:

12    Take particular care to prevent advantages (such as
      coaching) that were not present during validation effort
      If present, evaluate their effect on validity (p. 34)