Skip to main content.

Response to Criticisms of Nassau County Test Construction and Validation Project

DRAFT

Marvin Dunnette, University of Minnesota
Irwin Goldstein, University of Maryland
Leaetta Hough, The Dunnette Group
David Jones, Aon Consulting, Inc.
James Outtz, Outtz & Associates
Erich Prien, Performance Management Associates
Neal Schmitt, Michigan State University
Bernard Siskin, Center for Forensic Economic Studies
Sheldon Zedeck, CORE Corporation


In an October 24, 1996 Wall Street Journal letter to the editor titled "Racially Gerrymandered Police Tests," Linda Gottfredson described a project designed to develop selection tests for the Nassau County police department. In her letter, she claimed that a deliberate attempt had been made by the test designers to develop tests that would select equal proportions of minority and majority group members, at the expense of validity.

Moreover, Gottfredson claimed that various statistical and analytic errors had been made during the project and that the end result was a test that "removed competence as an advantage, denying job opportunities to talented individuals of all races." She then distributed a longer paper describing some of these "errors" on various E-mail networks. Since that time, Gottfredson has undertaken a great deal of further discussion on E-mail networks. Her comments on the study have ranged from reasonable questions to misunderstandings and simple misrepresentations.

We, the group responsible for the Nassau County work, are aware of the escalation in commentary regarding our study. At the outset, we wish to indicate that Dr. Gottfredson published her comments in the Wall Street Journal and released her discussions on the inter-net without ever asking one of us for clarifying data or information. We believe this would have been a normal step and would likely have answered many of her concerns. We believe Gottfredson's approach suggests an interest and agenda beyond one of technical concern.

It is unusual for us to have to take this step to explain this project and our rationale and respond to criticisms of the study, as we have heard them. However, many of our colleagues have asked us to do so because some might assume that silence implies that the criticisms are justified. To the contrary, the criticisms are unjustified. We are also responding because public attack with political innuendo and lack of professionalism are very unhealthy for our field. Hence, this response to Gottfredson, authored jointly be members of our group.

Before beginning our response to the issues, we want to indicate how we became involved in this project. Each of us was invited by Nassau County and the Department of Justice to help resolve a long-standing legal conflict between the parties, to design a new selection procedure, and to conduct a criterion-related validity study. Each of us agreed to participate and we formed as a group, the Technical Development Advisory Committee (TDAC). All members of the TDAC participated in all aspects of the entire project, including job analysis, test development, and validation processes. Decisions concerning the conduct of the work were made at group meetings of the TDAC.

One reason we all agreed to participate in this project was the opportunity to work together to evaluate various approaches which might advance the state of the art. Nassau County cooperated fully in this highly research-oriented endeavor. Thus, for example, we were able to evaluate many more than the usual number of rating scales, which permitted us to make comparisons between task oriented and worker oriented performance rating scales. We were also able to arrange for use of the Nassau Coliseum and Madison Square Garden, which enabled us to design certain tests utilizing video technology, even though well over 25,000 applicants were examined.

It was the opportunity to develop new selection procedures and methods that attracted each of us to work in this assignment. We all knew of the relatively problematic history of law enforcement selection research, as demonstrated by the relatively poor validity of traditional cognitive ability tests reported in the Hirsch, Northrup, and Schmidt (1986) meta-analysis. Our group was large in number because both Nassau County and the Department of Justice sought out individuals with substantial hands-on experience in the design of law enforcement selection procedures.

In her various E-mail broadcasts, Gottfredson has detailed a variety of criticisms. In the following pages, we present the reasons why we believe the majority of this commentary is unsubstantiated, irrelevant, or simply mistaken.

Zero Order Correlations between Predictors and Criteria not Provided in Report

There was no intention to keep these data confidential, or to hide them, as implied. Early in the project, we decided that using partial correlations was appropriate (see the next section). We simply never thought in terms of the zero order correlations later. Zero order correlations for the final composite criterion have now been attached to copies of our technical report, and are also attached to this report. These correlations would have been made available to anyone requesting them. We can also provide the zero order correlations with the various components of the criterion and the intercorrelations of the predictors if that is important to one's examination of the study.

Inappropriate Use of Tenure Corrected Validity Coefficients

Before responding to the claim that correcting validity coefficients for the effects of tenure is inappropriate, it is important to note that multiple criteria, both task-based and trait or dimension-based measures, were assessed in this study. Both peers and superiors provided ratings of these criteria. Because of the high intercorrelations among these different criteria, decisions about individual predictors did not change as a function of the type of rating or source of rating considered.

Gottfredson (E-mail communication) expressed concern that tenure should not have been partialled out of both criterion and predictor in arriving at our estimates of the validity of the Nassau County tests. We have two response: First, this approach was correct; and, second, it did not matter, since the zero order correlations and partial correlations were nearly identical. The fact that the final report did not include the zero order correlations raised concern that something was "hidden" from interested readers. As stated above, these correlations have been made available to any person who made such a request. If anyone has made a request that has gone unanswered, please see Table 1.

Gottfredson (E-mail communication) maintained that tenure should have been partialled out of the criterion only and that this was standard practice. This is not standard practice. We can find virtually no mention of corrections for tenure in either the criterion or predictor in the extant literature on employment testing. In one meta-analytic study (Schmidt, Hunter, & Outerbridge, 1986) of the relationship among ability, job knowledge, experience, and performance, experience was partialled out of both the "predictor" and "criterion." The zero order correlation between ability and performance was .16 while the partialled correlation was .17. The validity of the cognitive ability measure actually went up, not down, as Gottfredson claims must be the case.

In fact, Schmidt et al (1986) argued that "one can assume that individual differences in job knowledge and (indirectly) performance capabilities as assessed by job sample measures to be more strongly a function of mental ability than when employees have unequal amounts of job experience" (p. 438). Interestingly also, only four studies were included in the Schmidt et al study; if the use of job tenure or experience were commonly partialled out of ability-performance relationships, one would have expected that these authors would have included many more studies in this effort.

Our rationale for partialling tenure out of both predictor and criterion was based upon our knowledge, specific to Nassau County, concerning changes in the use of various screening devices in Nassau County over the period of time (1985 to 1995) during which the officers in our validation sample were hired into the police force. In particular, our concern was that only cognitive ability measures were used in the earlier years in this range. Personality measures were added later in this time frame. In addition, we reasoned that applicants do not have tenure as Nassau County police, hence we believed that tenure should not be a factor in estimating validity. Therefore, tenure was controlled in making estimates of validity; and we believe, correctly so.

Beyond arguments about the appropriateness of partialling tenure out of these relationships, an important question is the impact of this procedure. Gottfredson claimed that using partial correlations would serve to inflate estimates of the validity of non-cognitive measures and deflate estimates of the validity of cognitive tests. Zero order validities and partial correlations for the composite performance rating (standardized and summed across all supervisor ratings) are presented in Table 1. The first column in Table 1 contains the zero order correlations. The second column contains correlations between predictors and criterion for which tenure has been partialled out of the criterion. The last column shows the predictor and criterion correlations from which the effects of tenure are partialled from both the predictor and criterion.

As is evident, tenure played a trivial role in the estimate of these validities. The zero order correlations involving cognitive ability were virtually identical to the partial correlations. The average of the zero order correlations for non-cognitive measures was .01 less than the two sets of partial correlations. The average of the two sets of partial correlations were almost exactly the same. Decisions with respect to the inclusion or exclusion of specific tests would have been unaffected by these differences.

So the conclusions are two. First, we were correct in partialling tenure out of the predictor and criterion to estimate validity. Second, it didn't matter which approach was used. In short, much ado about absolutely nothing. Had Gottfredson made a professional inquiry prior to going to press, she would have reached the same conclusion.

Inappropriate Cross-Validity Corrections

A second Gottfredson concern was use of the Wherry formula as the basis for estimating the cross-validity of the test battery. The Wherry formula is used in most studies to estimate the shrinkage that would occur upon cross-validation of a set of tests, probably because it is the formula used in statistical packages like SPSS and is labelled as the "shrunken R".

Schmidt (unpublished manuscript) noted that the correct formula to use as an estimate of cross-validity is the Cattin (1980) formula. This formula is referred to in the SIOP Principles (1987) and in some textbooks (e.g., Schneider & Schmitt, 1986). The Cattin formula starts with an estimate of the population validity (which the Wherry formula estimates) and shrinks down from that value as a function of the sample size to number of predictors ratio. The Cattin formula usually produces a slightly lower estimate of cross-validity.

When either formula is applied, the critical values in determining the amount of estimated shrinkage are the number of research participants available and the number of predictors used to estimate the predictor weights. The latter is always, to our knowledge, taken as the number of predictors in the final regression equation (i.e., the number of significant predictors). Again, this is likely because that is the value printed by various computer programs.

However, most researchers, including the Nassau County group, examine and discard some of the predictors before they get to the point of doing regression analyses or estimating shrinkage. We controlled for this "taking advantage of chance" factor by hypothesizing a priori the predictors that would be related to each criterion measure. The process by which these judgments were collected and used is described in the technical report.

Partly because we used a composite criterion in the end, we opted to be maximally conservative with respect to the estimate of shrinkage by using 25 as the number of predictors when estimating shrinkage, even though only nine predictors were included in our final recommended test battery. When using the full number of predictors (i.e., 25) to estimate shrinkage, we reasoned that we should also use the Multiple R that would result from use of the full set of 25 as the starting point for estimating shrinkage, even though the extra set of 16 predictors yield minimal incremental validity.

Schmidt disagrees with the use of this Multiple R as the starting point and claims instead that we should have used the R associated with only nine predictors. Using the R associated with nine predictors, but using 25 as the number of predictors, Schmidt computes a lower bound estimate of cross-validity of .05. Application of the Cattin formula to our estimate of Multiple R based on all 25 predictors (i.e., .30) yields a value of .14 (as opposed to the Wherry estimate of .20).

If we had used nine as the number of predictors, as is conventional in personnel selection applications, the cross-validated R using Cattin's formula would be .17. So, using the Cattin correction, the range of estimates of cross-validated validity from conservative to liberal would be .14 to .17 from our perspective. Note that using the Cattin correction (or the Wherry correction) with the number of predictors equal to 25 is proceeding in an optimally conservative fashion, as was true of the other corrections we employed.

Inappropriate Corrections Due to Unreliability in the Criterion

Corrections to the estimate of cross-validated R were then made for attenuation due to unreliability in the performance ratings. The interrater reliability of the composite ratings used as the final criterion in our validity analyses was .63. As is reported in Exhibit 44 of our report, the interrater reliabilities of individual components of the criterion ranged from .22 to .63. Meta-analyses of interrater reliability usually provide estimates of approximately .60 (Rothstein, 1990) after relatively long periods of opportunity to observe. The .60 value is also routinely used as the mean reliability of supervisory ratings criteria by Hunter and colleagues in their meta-analyses to correct observed validity coefficients.

If we erred in estimating the reliability of the criterion, we erred in a direction that would produce a more conservative estimate of the validity of the test battery. Both observed dimensional ratings and meta-analytic values are lower than the estimate we employed. However, the difference between the use of .63 (which was the actual reliability of the composite criterion) and .60 (the meta-analytic estimate) as reliability estimates would be trivial. Applying the correction for attenuation due to unreliability to the shrunken validity coefficients presented in the previous section would yield a range of .18 (.14) to .21 (.17).

Inappropriate Corrections for Restriction in Range

There have also been questions as to the accuracy of the range restriction corrections we employed. Specifically, Schmidt (unpublished manuscript) has questioned what he perceives to be mistakes in our Exhibit 67. He based his belief on the fact that the battery that included the dichotomized cognitive ability test yielded a larger corrected validity than did the battery that included the continuous version of the same score. No such direct comparison was possible with the data presented in that table. The battery that included the continuous version of the test also included other tests that were not included in our final battery.

We provide data below regarding the tradeoff between the various batteries. It shows that dichomotization results in loss of information. It will always produce lower validity. Corrections for restriction of range to the revised estimates of validity provided in the last paragraph yielded values of .20 to .23 across the different batteries for the lower bound estimate of validity (i.e., .18) and .24 to .27 for the upper bound estimate of validity (i.e., .21).

Lack of Data on the Relationship with the Map Reading Test

The set of tests administered to the validation sample included a test that also had been included in a previous battery of tests evaluated for use in the Nassau County police department (Personnel Designs, Inc. 1988). This test was thought to be cognitive in nature and was used partly as a "marker" of the degree to which the new tests we developed were cognitive in nature. It also was used partly as a basis of ascertaining whether there were significant time-related changes in the applicant pool between its initial use in 1987 and its re-use in 1994.

Gottfredson (E-mail communication) complained because no results of the analysis on this test were reported, but she also complained about the length and incomprehensibility of the report. In point of fact, considerable analyses of this test were conducted. In Table 1, we report the validities (which range from .01 to .03) of the test against the composite criterion.

In addition, multiple regression analyses of the test were conducted in which scores on that test were regressed on the remainder of the battery to determine to what degree the same constructs were assessed. The best correlate of the Reading and Using Maps Test was the Understanding Written Materials Test, which was included in the final Nassau County battery. The correlation between these two tests was .43; corrected for unreliability in both measures, the correlation was .57. The two are not measures of the identical construct, which is not surprising given the substantially different content and format of the two tests, but there is considerable overlap between the two. Both are, we think, similar in many respects to the type of cognitive ability measure Gottfredson has championed in her various commentaries on the Nassau study. Had Gottfredson asked for data regarding the utility of this cognitive predictor (the zero order correlation with the criterion was .03), we would have provided it. Instead, we were presented with allegations about another data cover-up, a consistent theme in her writing.

Over-Concern for Adverse Impact

All members of our work group were concerned with the degree to which the measures we developed displayed subgroup differences. We do not apologize for that concern. A solution to two decades of litigation, as was the experience of Nassau County, would certainly motivate such a concern. At a higher level, we believe professionals in our field have a responsibility to assess skills in a job-related manner which is least likely to adversely impact members of various groups. This remains the law of our land, at least as we interpret it, and is consistent with similar concerns voiced frequently in our professional and scientific publications. It has guided our actions during the majority of all our careers in this field.

We engaged in several activities that we hoped would minimize the adverse impact of our procedures, without lessening their validity and even, hopefully, increasing it. First, we assessed the skills and abilities identified in the job analysis in a manner that closely replicated the manner in which they were used in daily activities on the job. This included work to develop cognitive ability tests that closely mirrored the cognitive demands of the job. It also involved use of audio-visual presentations for situational judgment tests, as well as collection of criterion data from both peers and supervisors, using two different formats in criterion data collection.

We used empirical validities to make our decisions about the utility of tests. We did not rely on adverse impact statistics, as has been alleged. Our decision criteria are spelled out on Page 133 of the report that Gottfredson claims she reviewed. Decisions to consider a predictor for inclusion in the final predictor battery were made on the basis of one of the following conditions: 1) the predictor demonstrated validity for "overall performance," as hypothesized, against one or both of the supervisor summary criteria; 2) the predictor demonstrated validity, as hypothesized, against individual criteria in one or more of the three criterion domains; or 3) the predictor was hypothesized to show validity in only one criterion domain, and the predictor demonstrated validity within that domain. If she read the report, our only conclusion is that she simply doesn't believe us. We do not have audio tapes of our meetings to prove that these criteria were employed. However, the reader can examine the validity data presented in Table 1 to judge whether this was true.

We did administer the test to applicants prior to the time at which we administered the test to incumbents for validation purposes. This was done not to allow us to "select out" tests that displayed high levels of adverse impact as has been implied by Gottfredson (in fact, the battery of tests given to applicants and incumbents was identical), but to ensure that the content of the test was not compromised in any way before the test was given to applicants. Anyone with practical experience who has done large scale test design in a highly competitive situation, such as that in Nassau County, will certainly relate to this concern. Tests were presented to applicants and incumbents with identical time limits, instructions, and presentation media.

What follows is a brief description of the steps we took in developing and selecting tests that are relevant to criticisms by various parties including that of Gottfredson. Our "cognitive ability" tests were oriented around a hypothetical security company. Applicants received a "Policy and Procedures Manual" that they were instructed to commit to memory with the idea that they would need to remember and use these policies in the actual test. These materials were modeled on police procedures in their structure, style, and reading level, but they did not require knowledge of specific police security procedures. The test was used to test applicants' ability to apply principles they would be taught and expected to know without reference to written material. This parallels how police officers are trained in the police academy; they read manuals, listen to lectures or training videos, and are expected to know and apply what they have learned.

A second test, presented in video format, consisted of a series of lectures, about which the applicant was subsequently asked questions. This approach simulated what frequently happens in the police training academy, and in subsequent on-the-job training as learned in our observations of training academy classes.

The third test, the Understanding Written Materials Exercise, was most like a traditional reading comprehension test. The examinee was provided with a section of text and then asked questions about the text. This test was meant to replicate on-the-job situations in which the officer has the time and latitude to consult or study policy or procedures statements prior to taking action.

A fourth test presented in video format a series of situations, along with visually enacted alternative courses of action. The candidate was asked to rate the effectiveness of each alternative course of action. This test could be best characterized as a situational judgment test (Motowidlo, Dunnette, & Carter, 1990). Reading is not required of the participants in the actual on-the-job solution to these problems, and it was not required to answer the items in this test. The scenarios used to develop the situational judgement test came from critical incidents identified by Nassau County police personnel. One critic (Russell, unpublished manuscript) referred to our approach in job analysis and test development as a stamp collecting exercise (Landy, 1986). Quite the contrary--we went well beyond the usual job analysis in trying to measure job-related skills in ways that actually replicated job tasks as closely as possible, given the practical constraints of testing and scoring between 25,000 and 30,000 people at once. Russell offered no alternative.

Gottfredson complained that our approach to assessing cognitive skills "watered down" our assessment of those skills. This is not true. We already presented the correlation (observed and corrected) between the Map Reading Test and the Understanding Written Materials Exercise. Further, in at least one previous study (Hattrup, Schmitt, & Landis, JAP, 1992) comparing tests with face and content validity with more traditional cognitive ability tests (in that case, the Differential Aptitude Test Battery), the constructs assessed in these two modes were found to be virtually identical. The plain simple fact is that cognitive ability was assessed in this test battery. It did not prove to be related to assessed job performance, except in the instance of the Understanding Written Materials Exercise.

Are these largely negative results regarding the validity of cognitive ability in police selection inconsistent with the literature? Gottfredson (E-mail communication) has stated that "anyone familiar with the employment literature" would include a cognitive ability test because of the overwhelming evidence of the validity of cognitive ability tests. Schmidt (unpublished communication) indicated that lack of consideration of the previous literature was a major liability of the Nassau County effort.

However, in the most thorough compilation on the validity of such tests of which we aware (Hirsh, Northrop, & Schmidt, 1986), the average observed validity for cognitive ability tests across 125 studies of "police and detectives in public service" was .09, ranging from .05 for tests of memory to .13 for tests of quantitative skills. In that context, our validity for the Understanding Written Materials exercise seems quite good and the validities for our other measures of cognitive ability are not at all surprising. They are entirely consistent with the employment literature. Have Gottfredson, Schmidt, and Russell access to a more recent body of unpublished law enforcement validation data for cognitive tests that escaped our review?

In that same meta-analysis, Hirsh, Northrop, and Schmidt (1986) provided information on fifteen studies in which a composite of cognitive ability tests and "human relations" tests were used. The average observed validity across these 15 studies was .145, but no data were provided regarding the validities of the individual tests in that study. In addition, the meta-analysis of personality measures by Barrick and Mount (1991) indicated that conscientiousness (perhaps the personality construct most frequently associated with job performance) had the best validity for the selection of police personnel, as compared to other occupations.

Finally, Hough (1995a), as part of the Nassau County project, as well as in other contexts (Hough, 1994), provided a comprehensive examination of the validity of personality measures. Even Schmidt and his colleagues (Hirsh, Northrup, & Schmidt, 1986), after spending considerable space speculating as to why the correlations between cognitive ability and job performance were so low, noted that one "hypothesis for the low validities associated with job performance is that personality variables or interpersonal skills play a large (emphasis added) role in determining proficiency as a police officer or detective" (p. 417). That is exactly the conclusion our group reached. That is exactly why we broadened our frame of reference beyond cognitive ability. That is one of the reasons why we sought to develop measures of cognitive ability that more accurately reflected job demands than is true for many traditional tests of cognitive ability.

Another hypothesis Hirsch et al (1986) proposed was that performance of police officers is not always observable by supervisors and that therefore the criterion might be responsible for the low validities of cognitive ability tests in police jobs. Consider that hypothesis in light of our effort to improve upon the performance ratings collected, and the fact that we collected performance ratings from both supervisors and peers. Finally, consider the last sentence in the Hirsh, Northrup and Schmidt (1986) paper: "we recommend that additional validity studies be conducted on law enforcement occupations...." (p. 419). Was our balanced attention to personality and cognitive ability misguided in light of the research literature on employment testing? We think not. Neither, apparently, did Schmidt in 1986.

Personality Tests are Flawed Because They Can Be Faked

This has been a long standing concern about personality measures when they are used to make personnel decisions in real-life, applicant settings. The preponderance of evidence indicates that while people can distort their responses to personality measures, the impact on criterion-related validity is not significant. Hough, Eaton, Dunnette, Kamp, and McCloy (1990), for example, found that intentional distortion did not moderate the criterion-related validities of personality scales. Similarly, Christiansen, Goffin, Johnston, and Rothstein (1994) and Barrick and Mount (1996) found, that in an applicant setting, the criterion-related validities remain intact.

Ones, Viswesvaran, and Schmidt (1993) conducted a meta-analysis of the validities of integrity measures, both overt and personality-based, and found in predictive validity studies using applicants that the criterion-related validity of personality-based integrity measures is .29 for predicting broad, external criteria; exactly the same as for concurrent validity studies using employees. In a recent meta-analysis of the impact of social desirability on the usefulness of personality scales, Ones, Viswesvaran, and Reiss (in press) conclude that social desirability does not function as a suppressor variable or mediator variable and that removing the effects of social desirability from personality scales leaves the relationships between personality variables and job performance intact.

Hough (1995b) examined the impact of social desirability on the criterion-related validity of the personality scales used in the Nassau County project. She developed three types of score adjustments for each personality scale using an Unlikely Virtues scale. The Unlikely Virtues scale was designed to identify individuals whose responses to items indicating unusual levels of honesty or forthrightness indicate they might be faking. She compared a) zero-order validities, b) multiple regression-based validities, and c) moderated regression-based validities of the raw scale scores and each of the three types of adjusted scores. In the Nassau County study, the Unlikely Virtue scale correlated .11 with overall job performance. Thus the multiple regression-based validities (an individual personality scale combined with the Unlikely Virtues scale) were higher than the zero order correlations of the individual scales. However, the moderated regression validities were no higher than the multiple regression validities. She also found that removing the Unlikely Virtues scale variance from the personality scales reduced the zero-order validities of the personality scales.

In short, social desirability did not moderate the validities of the personality scales for predicting police officer performance in Nassau County. Moreover, removing the Unlikely Virtues variance from the personality scales reduced the criterion-related validity of the personality scales for predicting police officer performance.

Recently, Douglas, McDaniel, and Snell (1996) argue that the validity of non-cognitive measures decays when responses are faked. This study, however, involved students, more specifically, students who were directed to fake. As of now, we have no evidence that such results generalize to real-life applicants in real-life selection situations. It should be noted that the title of the Douglas et al (1996) study refers to applicants, rather than to the student sample actually used. Inappropriate Use of Understanding Written Materials Test. Gottfredson criticized our dichotomized use of the Understanding Written Materials Test scores. In particular, she criticized our cutoff, which was set at the first percentile of the incumbent officers in our validation study. Our group is well aware of the fact that some of the utility of a test is lost when it is used in dichotomized fashion. However, we were also aware of other facts. First, the Nassau County police department requires 32 semester hours of college credits on entrance to the academy. Moreover, the educational levels of the majority of the incumbent officers is much higher. Among 1209 (of approximately 1700 total) police officers who took a sergeant's promotional examination in 1994, 47.3 percent had college degrees; another 32.3 percent had two or three years of college.

Second, the incumbent officers in our validation study had successfully passed the training academy, arguably a cognitively demanding six months of training. They had all functioned successfully as police officers for between two and ten years.

Third, we knew that use of this test in dichotomized fashion resulted in scores of zero for approximately eight percent of African-American applicants, three percent of Hispanic Americans and two percent of all other applicants, while all others received scores of 1.

What effect does the use of a test in this fashion have on the validity of the test battery. As the Nassau County validation report documents (see Exhibit 69), the "full" set of predictors, validated against one or more major criterion elements, produced an observed multiple correlation of .24.

Removing the two cognitive ability tests (i.e., Situational Judgment and Understanding Written Materials) produced a multiple correlation of .22. Using the Understanding Written Materials Exercise in dichotomized fashion as described above yielded an observed multiple correlation of .23. These are values of the observed multiple correlations; no corrections have been applied to these coefficients. Note also that the reduction in validity from the full battery to our recommended battery was due to the removal of the Situational Judgment Test, as well as to the dichotomization of the Understanding Written Materials Test.

Now, what effect do these various uses of the test battery have on adverse impact? The data in Exhibit 68 of our report reflect a selection ratio of 22 percent, which was the best projection of the proportion of applicants that would be selected if a top-down list of candidates were used to make selections for the life of the list. For the "full" set of predictors, the African-American versus Other adverse impact ratio was .62. For the set of noncognitive tests only, the adverse impact ratio was .82. For the battery we recommended, the adverse impact ratio was .77. The tradeoff is clear. We sacrifice .01 (.24 versus .23) in observed validity for a change in adverse impact ratio of .15 (.77 vs. 62). Contrary to the Gottfredson commentary on this test battery, the final recommended battery approved by the court does not meet the usual "four-fifths" rule. There is, indeed, an adverse impact against African American applicants.

In our judgment, the tradeoff outlined above was appropriate. In fact, we believe that the current situation requires it. Why? First, there are a variety of other ways in which cognitively able police officers are selected (i.e., the entry-level educational requirement and the training academy regimen). Second, the validity of the cognitive ability test was not high. This was consistent with other meta-analytic data collected in public sector police jobs (Hirsh, Northrop, & Schmidt, 1986). Third, we felt then (and still do) that police departments cannot function effectively in minority neighborhoods when virtually all police officers are white males. This concern is one for organizational effectiveness rather than individual effectiveness. Our traditional studies which focus on individual performance do not consider such issues.

Some Excellent Candidates Failed and Poor Candidates Passed

In her Wall Street Journal piece, Gottfredson cites "close observers" who claim that some of the top scorers have poor academic records as well as outstanding arrest warrants. She claims to know that some poor scorers are "well qualified" as indicated by years of experience as probation officers or cops in other jurisdictions. We could argue as to whether all of these criteria attest to these persons' competence or incompetence.

More importantly, Gottfredson's argument plays only to the uninformed reader and the statistically naive. Those familiar with the employment literature know that in 1939, Taylor and Russell published a set of tables that describe the outcomes achieved with employment tests of a given validity with a particular selection ratio and base rate of success. One minute studying these tables would inform anyone that, with 25,000 applicants, a projected selection ratio of .22, for base rates in the mid range, and a validity in the range we report, the numbers of false positives and false negatives will almost certainly exceed 10,000. Further, even with the most optimistic levels of validity (say, .70), the numbers of false positives and false negatives will be in the thousands. Only when one has perfect knowledge of performance or a test with validity of 1.00 will one be unable to cite the types of examples cited by Gottfredson. Unfortunately, our group worked in a situation that was much less ideal.

In addition, it is always possible that applicants who had previous arrest records will pass typical selection tests. Nassau County, like most other police jurisdictions of which we are aware, conducts extensive background checks prior to the selection of candidates. That is how such candidates are identified. It was never our intent to identify such individuals and, given the background check, it is unnecessary.

Overpromotion of the Nassau County Test

One concern on the part of some is that the Nassau County test has been promoted by the Department of Justice. The TDAC conducted a validity study for two clients, Nassau County and the Department of Justice. We do not speak for or negotiate for the Department of Justice concerning test use in other jurisdictions. All members of the TDAC understand that, even if the Nassau County test worked perfectly in this one situation, it is not portable. Any reading of SIOP's Principles would make it obvious that one validity study rarely makes any test battery totally portable.

In addition, the final outcomes of the Nassau County test make it clear that further test development and further validity studies need to be accomplished. We believe we grappled with difficult circumstances and provided a competently executed study. We also think we made contributions to the knowledge base, especially as regards the use of personality tests. We believe it is also the case that the job analyses and criterion development work accomplished in this project should prove useful in doing further work. But we also wish that, given the amount of time and effort, we could have accomplished more in terms of the level of validity of the final selection battery.

The TDAC as a group no longer exists. We were only invited as a group to work on this one project. A few persons involved in the original TDAC team are attempting to develop new tests based on what was learned in Nassau County. Some commentators have questioned our proprietary interests in the tests developed. In the Nassau County project, we did use a few tests that had been developed preceding the project. In those cases in which individuals or organizations had proprietary interests, they obviously maintained them while allowing us to use their products.

Conclusion

We have attempted to respond to the criticisms of the Nassau County project in detail. We realize that others may have other questions and desire to see data other than those reported in our final report. Because of the high profile this project has attained, we will try to respond positively and quickly, within reason, to questions regarding technical aspects of the study. We hope you will examine the report, ask questions, and consider your criticisms carefully in light of the situation with which we dealt.

We don't think it serves our profession well to air complaints in the newspapers, before even commencing a dialogue among colleagues. Various members of our profession and SIOP as an organization have been concerned with our perceived low degree of influence among policy makers and heads of corporations. A letter such as the one that appeared in the Wall Street Journal can, and probably has, destroyed progress we have made in this arena. A simple telephone call would have produced all the information above and more. We feel that there should always be room for debate in our profession over the best procedures to be used and the best techniques for advancing our field. Public attacks regarding the conduct and motivation underlying research, without any attempt to ascertain the facts of the case, are inappropriate and harmful.


Table 1. Zero Order and Partial Correlations between Predictors and Composite Criterion

  Partial Correlation
Zero Order
Correlation (N=505)

Criterion (N=500)
Both Predictor
& Criterion (N=500)
Situational Judgment .07 .08 .07
Remembering/Using Info. .01 .00 .01
Lrng./Apply Info. .04 .05 .05
Understanding Wrt. Mat.* .09 .07 .08
Reading Maps .03 .01 .01
Ach. Motivation* .11 .12 .12
Responsibility* .04 .06 .07
Non-Delinquency* .03 .06 .06
Emotional Control -.01 .02 .03
Influence* .11 .11 .12
Sociability .02 .04 .03
Cooperativeness -.06 -.03 -.03
Interpersonal Perc. .02 .02 .02
Adaptability* .09 .12 .13
Tolerance -.04 -.02 -.02
Fate Control .03 .04 .04
Att'n. to Detail* .08 .08 .10
Realistic Interests .03 .02 .03
Practical IQ -.02 -.01 -.01
Authoritarianism .06 .07 .06
Self esteem -.03 -.04 -.04
Emotional Stability* .06 .08 .08
Agreeableness -.07 -.07 -.07
Conscientiousness -.08 -.10 -.10
Openness* .12 .13 .13
Overall .02 .03 .03

*Indicates the test was used in the final battery.

References

Barrick, M. R., & Mount, M. K. (1996). Effects of impression management and self-deception on the predictive validity of personality constructs. Journal of Applied Psychology, 81, 262-272.

Cattin, P. (1980). Estimation of the predictive power of a regression model. Journal of Applied Psychology, 65, 407-414.

Christiansen, N. D., Goffin, R. D., Johnston, N. G., & Rothstein, M. G. (1994). Correcting the 16PF for faking: Effects on criterion-related validity and individual hiring decisions. Personnel Psychology, 47, 847-860.

Douglas, E. F., McDaniel, M. A., & Snell, A. F. (1996). The validity of non-cognitive measures decays when applicants fake. Academy of Management Proceedings.

Hattrup, K., Schmitt, N., & Landis, R. S. (1992). Equivalence of constructs measured by job-specific and commercially-available aptitude tests. Journal of Applied Psychology, 77, 298-308.

Hirsh, H. R., Northrop, L. C., & Schmidt, F. L. (1986). Validity generalization results for law enforcement occupations. Personnel Psychology, 39, 399-420.

Hough, L. M. (1994). Personality at work. Presented at Bowling Green Conference on Alternative Selection Procedures. Bowling Green, Ohio.

Hough, L. M. (1995a). Interim Report on Personality Variables. Minneapolis, MN: Personnel Decisions Research Institutes, Inc.

Hough, L. M. (1995b). Applicant self description: Evaluating strategies for reducing distortion. In F. L. Schmidt (Chair), Response Distortion and Social Desirability in Personality Testing for Personnel Selection. Symposium conducted at the 10th Annual Convention of the Society of Industrial and Organizational Psychology, Orlando.

Hough, L. M., Eaton, N. K., Dunnette, M. D., Kamp, J. D., & McCloy, R. A. (1990). Criterion-related validities of personality constructs and the effect of response distortion on those validities. Journal of Applied Psychology, 75, 581-595.

Landy, F. J. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41, 1183-1192.

Motowidlo, S. J., Dunnette, M. D., & Carter, G. W. (1990). An alternative selection procedure: The low-fidelity simulation. Journal of Applied Psychology, 75, 640-647.

Ones, D. S., Viswesvaran, C., & Reiss, A. D. (in press) Social desirability in personality testing for personnel selection: The red herring. Journal of Applied Psychology.

Personnel Designs, Inc. (1988). Development and validation of a police officer selection testing program. Final report prepared for Nassau County New York.

Rothstein, H. R. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe. Journal of Applied Psychology, 75, 322-327.

Schmidt, F. L., Hunter, J. E., & Outerbridge, A. N. (1986). Impact of job experience and ability on job knowledge, work sample performance, and supervisory ratings of job performance. Journal of Applied Psychology, 71, 432-439.

Schneider, B., & Schmitt, N. (1986). Staffing organizations. Glenview, IL: Scott-Foresman.

Society for Industrial and Organizational Psychology, Inc. (1987). Principles for the validation and use of selection procedures. College Park, MD: Author.