TDAC'S DEFENSE OF ITS NASSAU COUNTY POLICE EXAM MAKES MY POINT
Linda S. Gottfredson
Department of Educational Studies
University of Delaware
Newark, DE 19716
(302) 831-1650
Fax: (302) 831-6058
gottfredson@udel.edu
February 28, 1997
The Technical Development and Advisory Committee (TDAC), which had primary responsibility for developing and validating the 1994 Nassau County police entrance exam, responded in three documents to my criticisms of that project. I comment here on TDAC's three responses (12/31/96 letter submitted to the WALL STREET JOURNAL, 1/4/97 letter to me on HRNET, and 1/12/97 "Response to Criticisms of Nassau County Test Construction and Validation Project" on the IPMAAC web page (http://www.ipmaac.org/nassau/zedeck.html). The three responses come from nine of TDAC's ten members.
I draw below on a 2/6/97 journal manuscript in which I provide a history of the Nassau County test, describe the law and enforcement policies that encourage personnel psychologists to sacrifice validity in order to increase minority hiring, and outline techniques for reducing disparate impact that either lower or artificially cap test validity. That manuscript, "Racially Gerrymandering the Content of Police Tests to Satisfy U.S. Justice Department: A Case Study," is available at the IPMAAC web site (http://www.impaac.org/nassau/gottfredson3.html) or from me. The paper has been submitted for a special issue of the journal PSYCHOLOGY, PUBLIC POLICY, AND LAW. It is still in the review process.
PERSONAL CRITICISMS
It is unfortunate that the debate over the Nassau County test has been marred by ad hominem criticism and innuendo. I answer such criticism below in the spirit of getting beyond it, for it only distracts attention from the substantive issues at hand.
INSINUATION: GOTTFREDSON ACTED ON POLITICAL OR FINANCIAL CONSIDERATIONS, NOT PROFESSIONAL ONES.
Contrary to what the U.S. Justice Department has suggested, I was not paid by any test developer to look into this matter. Nor did I act "at the behest" of anyone. I have never had any financial interest in any testing enterprise.
How did I become involved? After the District Court approved the Nassau test for use in Nassau County, the Justice Department began pressing other police departments around the country to switch to it. Psychologists associated with one such department became very concerned with the pressure and asked Frank Erwin, President of Richardson, Bellows, and Henry (RBH) to look at the Nassau technical report. Justice had some time back sent Erwin a copy of it for review. (RBH had developed its own police test a decade ago at the request of the Justice Department, which later became disenchanted with Erwin for resisting its pressure to reduce the test's cognitive component.)
In turn, Erwin (whom I had not met before) asked me and several other SIOP members to provide independent evaluations of the report (from which he had expunged all indications of authorship). In addition to studying the report for several months, I looked into the history of litigation, test development, and test use in Nassau County as well as into how the Justice Department was using the new test to intimidate other jurisdictions. I published nothing until I had checked my facts and conclusions with relevant experts, including others who had read the technical report.
My pursuit of the issue should not surprise anyone who knows anything about my professional interests. I have focused in my career on, among other things, the interplay between science and politics, especially as it involves difficult racial issues (for example, "Science and Politics of Race-Norming," AMERICAN PSYCHOLOGIST, 1994, 49, 955-963). Indeed, it was precisely for such work that I was elected a SIOP Fellow in 1994. In my view, the Nassau case illustrates the unreasonable legal and political pressure that personnel psychologists are under to do the impossible--to produce tests of essential skills that do not have adverse impact upon groups who possess fewer of those skills. That the Nassau case turned out to involve some people I respect highly only increased my concern about such pressure.
CRITICISM: GOTTFREDSON FAILED TO QUERY NASSAU CONSULTANTS (TDAC) BEFORE PUBLISHING HER CRITICISMS.
First, professional testing standards and federal guidelines both require that technical reports such as TDAC's provide enough information for others to conduct an independent technical review. The authors claim in their report that they provided it. Scientists and academics, like movie reviewers and food critics, are neither expected nor obliged to contact the individuals whose work they are reviewing. The work is supposed to stand on its own. The authors' complaint amounts to special pleading.
Second, the only protection I could provide my informants was to publish the story before the Justice Department or others could take reprisals against them or their agencies. Providing such protection seemed far more important than extending the authors special courtesies.
COMPLAINT: GOTTFREDSON ENGAGED IN "PUBLIC ATTACK WITH POLITICAL INNUENDO AND LACK OF PROFESSIONALISM [WHICH] ARE VERY UNHEALTHY FOR OUR FIELD" (TDAC, 1/12/97, P. 2).
There is no innuendo involved because I say it explicitly. I believe that the Justice Department has been pursuing its own covert political agenda in the guise of improving merit hiring. I also believe that, wittingly or not, TDAC provided scientific cover for that exercise of political will. TDAC members may feel that I should have evaluated their work more tactfully or less publicly. However, they have yet to demonstrate that the criticisms of their work--or of the Justice Department's use of it--are unwarranted. Attempting to demonize me, as have both a TDAC member and the Justice Department's lawyer in the case, only gives the impression that TDAC prefers to evade rather than confront the criticism.
As for professionalism and health of the field, let us debate what they require. Both are undermined by the Justice Department's relentless pressure on personnel professionals to reduce or restrict test validity. Both would be enhanced by considering how to relieve or resist that pressure to degrade employment testing. One may, of course, legitimately argue that some validity should be sacrificed for greater minority hiring. My point is simply that such political decisions should be debated openly and not be disguised as scientific matters best left to technical experts.
SUBSTANTIVE ISSUES
I take up TDAC's points in the order they are presented in its January 12, 1997, "Response to Criticisms." The headings are TDAC's.
ZERO-ORDER CORRELATIONS BETWEEN PREDICTORS AND CRITERIA NOT PROVIDED IN REPORT.
TDAC protests that it "simply never thought in terms of the zero order correlations" after deciding on the partial correlations. However, those correlations are required by all three major sets of employment testing standards. As Table 2 in the "Gerrymandering" manuscript shows, TDAC's technical report failed to provide many of the most essential data that those guidelines require (e.g., means, standards deviations, and correlations among variables; regression weights for tests in the battery).
TDAC has offered to make such data available "if that is important to one's examination of the[ir] study" (p. 2). I appreciate the offer, so I repeat here my unanswered e-mail and phone requests for such data. Table 2 in the "Gerrymandering" manuscript lists the categories of missing information that readers would find particularly useful. As explained there, that information includes the 1988 technical report for the 1987 Nassau exam, also developed by HRStrategies. TDAC could put the missing data and directions for ordering the 1988 report on IPMAAC's web page.
INAPPROPRIATE USE OF TENURE CORRECTED VALIDITY COEFFICIENTS.
TDAC maintains that it was appropriate to partial tenure out of both predictors and criteria and that, in any case, it made no difference whether it did so or not. TDAC is wrong on both counts.
APPROPRIATENESS OF DOUBLE PARTIALLING. I had written that it was OK for TDAC to partial tenure out of the criteria (that it was "not unusual" to do so) but that there is no justification (and TDAC gave none) for partialling tenure out of the predictors. TDAC (1/12/97) seems to have taken me to task for both statements by arguing that (1) one seldom finds in the literature that tenure has been partialled out of EITHER criteria or predictors and (2) when tenure was controlled out of both in other studies, the validity for cognitive tests went up, not down as TDAC said I "claim[ed] must be the case."
Now, their first rebuttal would seem to weaken rather than strengthen their case. As for the second, I made predictions only for the NASSAU data based on the correlations (taken from the technical report) of tenure with the predictors and criteria in the validation sample. Now that TDAC has released the pertinent validities, we see that the correlations for its two "cognitive" predictors (situational judgment and understanding written materials) did NOT go up. On average, they went down: .08 when no partialling and .075 when partialling tenure out of either the criterion alone or both the criterion and the predictors. But more on that later.
The rationale TDAC offers for its double partialling procedure--that hiring standards have changed--makes no sense. One might (as I suggested in my 9/17/96 "Hollow Shell" analysis) argue that changes in hiring might justify partialling tenure out of the CRITERIA but certainly not out of the predictors too. However, I have since come to believe that partialling tenure out of even the criterion actually OVERcontrols for the effects of experience by partialling out some valid COvariance between cognitive ability and job performance. I explain this in endnote 2 of "Gerrymandering."
Partialling tenure out of the PREDICTORS removes yet more of the valid covariance (i.e., validity) for the Nassau COGNITIVE tests. The reason is that the cognitive tests are POSITIVELY correlated with tenure (later recruits were hired under lower cognitive standards). On the other hand, the partialling procedure could be expected to BOOST the correlations between the criterion and the NON-cognitive predictors because the latter are NEGATIVELY correlated with tenure. In other words, partialling tenure out of the criteria would artificially lower the validities of the Nassau cognitive tests relative to the non-cognitive ones, and partialling tenure out of the predictors too would bias the validities yet further in favor of the personality tests. (See endnote 2 in "Gerrymandering.")
IMPACT OF DOUBLE PARTIALLING. TDAC argues that its data refute my prediction because "it didn't matter which approach was used." More specifically, (1) "the average validity of the zero-order correlations for non-cognitive measures was .01 less than the two sets of partial correlations" (.01 vs. 02 and .02) and (2) the "average of the two sets of partial correlations were almost exactly the same" (.02). In some sense TDAC is right in claiming that "tenure played a trivial role in the estimate of these validities," because the validities themselves turn out to be so trivial.
However, the more pertinent comparisons concern the ten tests that TDAC retained in trying out alternative batteries. Those comparisons are presented in the table below. The last panel of the table puts a lie to TDAC's claim that partialling had no impact.
NO PARTIAL ONLY CRITERION BOTH
2 COGNITIVE TESTS
situational .07 .08 .07
written materials .09 .07 .08
8 PERSONALITY TESTS
achievement motiv. .11 .12 .12
responsibility .04 .06 .07
non-delinquency .03 .06 .06
influence .11 .11 .12
adaptability .09 .12 .13
attention to detail .08 .08 .10
emotional stability .06 .08 .08
openness to experience .12 .13 .13
AVERAGES
2 cognitive .08 .075 .075
8 personality .08 .095 .1015
all 10 .08 .092 .096
The cognitive and non-cognitive (personality) tests have equal zero-order validities on the average (.08). However, partialling tenure out of the criterion produces validities that are 27% larger for the latter (.095) than the former (.075). The difference increases to 35% (.1015 vs. .075) when the double partialling is done. Double partialling did tip the scales in favor of the personality tests.
However, the most important effect of the double partialling procedure, which no reader could have known without the recently-released data, may have simply been to allow TDAC to eek a bit more validity out of a pathetically weak set of predictors. TDAC says that my criticism about partialling was "much ado about absolutely nothing." However, the phrase would seem to fit better the high praise that TDAC's technical report gave its weak new test battery.
INAPPROPRIATE CROSS-VALIDITY CORRECTIONS.
Schmidt identified three statistical errors in TDAC's estimation of true validities, which I explain in the "Gerrymandering" manuscript. Two of the errors involve corrections for capitalization on chance: using the wrong shrinkage formula and applying it to the wrong multiple R. Together the two errors had the effect of inflating the apparent validity of the different trial batteries by more than 100%.
TDAC's response (1/12/97) is oblique and confusing. However, it seems to concede the first error and deny the second. TDAC's response can be understood better by creating the following table, which illustrates three decisions in using a shrinkage formula: the shrinkage formula, the value to be entered into the formula for the number of predictors, and the multiple R to be shrunk. According to Frank Schmidt, TDAC used the wrong formula (Wherry), the correct number of predictors in the shrinkage formula (25), and the wrong multiple R (for 25 rather than 9 variables). TDAC's result, as seen below, is an estimated shrunken R of .20.
Three decisions
Formula: R for: N for:
Catt Wher 9 25 9 25
Schmidt's minimum .05 X X X
Schmidt's maximum .14 X X X
TDAC's "conservative" .14 X X X
TDAC's "liberal" .17 X X X
1995 tech report .20 X X X
Correcting for TDAC's two errors, Schmidt provides a more accurate result by calculating minimum and maximum estimates and taking the average. He notes that a lower bound estimate of the validity of a battery selected ex post facto can be obtained by entering the appropriate multiple R (for the 9-test battery) into the Cattlin formula. His estimate, shown in the table, is .05. He provides the upper bound estimate by using the multiple R for the full set of 25 tests, which produces an estimate of .14.
TDAC tacitly concedes that it used the wrong formula by producing "revised" estimates using the Cattlin formula. However, TDAC apparently sticks to its second mistake, which allows it to take Schmidt's MAXIMUM estimate (.14) as TDAC's new minimum, or "conservative" estimate. TDAC's second mistake had been to shrink the multiple R for all 25 variables (.30) rather than for the much smaller R for the test batteries actually tried out (average R of .228). How does TDAC defend the decision? It says in its 1/12/97 response that it (correctly) used the number 25 in the shrinkage formula (for the number of predictors), so "we reasoned that we should also use the Multiple R that would result from use of the full set of 25 as the starting point for estimating shrinkage." This is a complete non-sequitar, as if a superficial parallelism could trump proper statistical reasoning.
TDAC then attempts to recover half the validity it lost in switching (correctly) to the Cattlin formula by now making the third possible mistake, which is to use the number 9 rather than 25 in the shrinkage formula for the number of predictors. This produces TDAC's "liberal" estimate of validity--.17. Why this new self-serving mistake? TDAC argues that it "is conventional in personnel selection applications" to use the lower value for the number of predictors "likely because that is the value printed by various computer programs."
However, errors are no less mistaken merely because they are conventional. The SIOP PRINCIPLES (p. 15) state that "one should not choose a data analysis method simply because the computer package for it is readily available." Moreover, the PRINCIPLES specifically state (p. 17) that "where a smaller number of predictors is selected for use based on sample validity coefficients from a larger number included in the study [as was the case in Nassau County], shrinkage formulas can be used only if the larger number is entered into the formula as the number of predictors." They also refer readers (p. 17) to Cattlin for "the appropriate shrinkage formula."
If anything, TDAC's defense supports my criticisms by illustrating just the sort of illogic, obfuscation, and self-serving technical decisions that characterized its 1995 technical report.
INAPPROPRIATE CORRECTIONS DUE TO UNRELIABILITY IN THE CRITERION
I never claimed that TDAC made any errors in correcting for unreliability in the criterion.
INAPPROPRIATE CORRECTIONS FOR RESTRICTION IN RANGE
TDAC never explicitly answers Schmidt's and my claim that it made an (again self-serving) error in correcting for restriction in range for its recommended battery. Instead, TDAC creates the impression that Schmidt is mistaken by transmogrifying an accurate observation into a minor irrelevant error which it can then criticize, thus creating the general but false aura that Schmidt is not credible.
According to TDAC, "he based his belief [that there must be some sort of error] on the fact that the battery that included the dichotomized cognitive ability test yielded a larger corrected validity than did the battery that included the continuous version of the same score." It continues, implying a mistake on Schmidt's part: "No such direct comparison was possible with the data presented in that table." Why was no direct comparison possible? According to TDAC, "the battery that included the continuous version of the test also included OTHER tests that were not included in our final battery" (emphasis added).
Now, this last statement actually SUPPORTS, not refutes, Schmidt's point. Batteries that (1) contain fewer of the same tests and (2) score some of them in a less efficient (pass-fail) manner (3) should, for BOTH reasons, have the lower validities. TDAC's technical report reported the opposite and impossible finding, which is what led Schmidt and others to suspect an error. TDAC attempts to defend itself by suggesting (falsely) that Schmidt failed to see that BOTH conditions (1) and (2) held and therefore that his conclusion, (3) above, must be flawed (although either condition alone would suffice to support it).
Once again, TDAC has used illogic to distract attention from the main issue. The question remains: Does TDAC agree that it made an error in correcting for restriction in range?
LACK OF DATA ON THE RELATIONSHIP WITH THE MAP READING TEST
TDAC's 1995 technical report says that the Map Reading test from the 1987 Nassau battery was included in 1994 as a "benchmark" against which to compare the new test and applicant groups with the prior ones. However, the report provides no such comparisons. If "considerable analyses of this test were conducted," as TDAC now says, why weren't any of them reported in the 1995 technical report? TDAC's 1/12/97 reply reports one correlation (Map Reading with Written Information), but nothing else. What were the "significant time-related changes in the applicant pool between its initial use in 1987 and its re-use in 1994" that the test was meant to gauge? And, will TDAC make available the 1988 technical report that it cites for the Map Reading test?
OVER-CONCERN FOR ADVERSE IMPACT
We can all agree that disparate impact is a problem. The disagreement comes in how to deal with it. My concern, like Frank Schmidt's and Craig Russell's, was that TDAC made a series of impact-driven technical decisions which sacrificed potential validity to reduce impact. Some of these decisions concerned which kinds of tests to include in the experimental test battery in the first place; others concerned procedures for winnowing, scoring, and validating the tests that were tried out. TDAC's response touches explicitly only on the latter type of decision. I respond only briefly below as these issues are discussed at length in the "Gerrymandering" manuscript. The important point is that all of TDAC's decisions, minor or major, worked in the same direction--to cap or reduce validity in favor of reducing impact. It is this pattern, not any single decision, that most clearly reveals the priority TDAC gave to reducing disparate impact.
WINNOWING PROCEDURE. TDAC lists in its response the three criteria it used to winnow the 25 tests to 10 for final validation trials. As I discuss in the "Gerrymandering" manuscript, it's not clear how this convoluted process prevents or corrects for capitalization on chance, but it's obvious how it would allow bias to enter the test winnowing process.
TESTING APPLICANTS BEFORE WINNOWING AND VALIDATING THE BATTERY. TDAC says in its 1/12/97 reply that it reversed the usual sequence of events (i.e., by testing applicants before validating the test) in order "to ensure that the content of the test was not compromised in any way before the test was given to applicants." However, never once did the report give this rationale for what the report characterized as its "unique" reversal of standard procedure. Instead, it explained--more than once--that the order was reversed because that "would afford noteworthy research advantages with regard to exploring and creating a potentially less adverse alternative' selection device" (p. 119). See the "Gerrymandering" manuscript for another quotation to this effect.
FIDELITY. As I note in that manuscript, the technical report emphasizes that TDAC sought tests of high "fidelity" because it thought that they would lower disparate impact at the same time as raising validity. TDAC's 1/12/97 response makes the same point by describing its five putative tests of cognitive ability (only one of which remained, in pass-fail form, in the final battery). As I explain at length in the "Gerrymandering" manuscript, "fidelity" is but one of several unproven hypotheses or "innovations" that TDAC extolled while denigrating traditional cognitive tests of proven value--all in the name of reducing impact.
LITERATURE SUPPORTS MINIMIZING ROLE OF COGNITIVE TESTS IN FAVOR OF PERSONALITY TESTS (WHICH HAS THE EFFECT OF MINIMIZING DISPARATE IMPACT). TDAC also seems to be making the foregoing claim, which is consistent with what it actually did in Nassau County. TDAC presses the point by citing Hirsch, Northrop, and Schmidt (1986), who found very small criterion-related validities for cognitive tests in their metaanalyses of police validation research. However, Schmidt has pointed out that these results are anomalous, because cognitive ability has been shown to be important in other moderate complexity work (but where performance could be more readily observed by supervisors); that cognitive ability is important in police training; and that job performance is contingent on job knowledge, which in turn depends heavily on cognitive ability. TDAC's own job analysis provided ample evidence that complex cognitive skills are important for good police work in Nassau County. One has to wear blinders to ignore all the pertinent data on the importance of cognitive ability in police work. As I note in the "Gerrymandering" manuscript, David Jones lectured SIOP last April the dangers of just such a constricted view of validity. As he said then, "the touchstone [of validity] is always back to the job analysis. What's in the battery ought to make sense in terms of job coverage, not just the statistics that come out of the...study."
Neither Schmidt nor I are against a "balance" of cognitive and non-cognitive tests in a test battery, as TDAC seems to imply. There is considerable evidence for the predictive validity of various personality traits, as Schmidt himself has shown. The problem with the Nassau battery is that it leaves cognitive ability almost completely out of the balance.
One of the most striking impact-reducing decisions that TDAC made (but which it does not discuss at this point in its reply) was to rescore the Written Materials test pass-fail with the passing level set at the first percentile of the incumbent sample. I deal with that issue further below.
PERSONALITY TESTS ARE FLAWED BECAUSE THEY CAN BE FAKED
Although TDAC had much to say in its technical report about the supposed flaws of valid cognitive tests, it said not a word there about faking despite its being, in TDAC's more recent words, a "long-standing concern about personality measures." The technical report did provide data which it failed to explain but which could be expected to raise concerns about faking: the applicants in Nassau County scored better, often substantially so, than the incumbents on the (fakable) personality tests in its final battery but considerably worse on the (non-fakable) Written Materials test. (See Table 3 in "Gerrymandering.")
As an aside, I infer that more applicants than incumbents passed the final battery because it is dominated by the non-cognitive tests on which applicants outscored incumbents. The technical report does not, however, provide the required data that would either confirm or disconfirm this inference.
TDAC implicitly justifies its failure to mention the faking issue in its technical report by arguing in its 1/12/97 response that (1) recent research has shown that faking does not significantly distort or moderate the criterion-related validities of personality tests, even in applicant samples, and (2) analyses of the impact of social desirability in the Nassau study (none of which are reported or cited in the technical report) showed that faking did not moderate validity (in the incumbent sample). However, neither of these sorts of evidence is compelling. Taking the second evidence first, the analyses in Nassau County are for the incumbent sample, who had no incentive at all to "fake good." The applicants had special reason to fake, however, which also casts doubt on the generalizability of the first type of evidence that TDAC cites. Police work pays extraordinarily well in Nassau County. In 1995 the base salary for a patrol officer with two years of experience was $45,512. At the usual rate of increase, the figure would be about $50,000 for two years of experience in 1997. (The police union's 1992-95 contract also shows that personnel benefits are generous.) Many officers earn between $80,000 and $100,000 a year, which is why people of all educational levels (including a fair number of lawyers) avidly seek work as police officers in Nassau County.
TDAC criticizes one conflicting study on faking by noting that it studied students and not applicants. TDAC added that "it should be noted that the title of the...study refers to applicants, rather than to the student sample actually used." This selective finger-pointing could as well be directed back at TDAC itself, for neither the 1990 Hough et al. or the Nassau research TDAC cites for support included applicants (the "applicants' in the former study were actually just-inducted military recruits).
INAPPROPRIATE USE OF UNDERSTANDING WRITTEN MATERIAL TEST
TDAC defends its incumbent first percentile cutoff for passing the reading comprehension test by arguing that (1) the Nassau County police department requires 32 college credits and many incumbents have more than that (the example TDAC gives is for individuals taking the sergeants exam), (2) incumbent officers "had successfully passed the training academy," (3) dichotomized scores produced zero (i.e., failing) scores for 8% of blacks, 3% of Hispanics, and 2% of others, (4) cognitive tests, dichotomized or not, added little to observed validity, and (5) setting a low minimum reading score greatly reduced disparate impact. Stated another way, TDAC seems to be saying that non-test requirements can make up for lowering the test's own cognitive standard, and, in any case, lowering that standard didn't really hurt test validity but did substantially reduce disparate impact. None of these justifications obviates the fact that the test standards require such cut scores to be justified. TDAC's reply would have been more informative had it provided directly pertinent data, for example, the reading competence level actually represented by the low reading cutoff.
None of TDAC's five claims suffices to justify the low cutoff. As for reason (1), Nassau County can require 32 credits ONLY if they do not have disparate impact. Nor would this requirement assure that all incumbent officers are competent readers, as I describe in the "Gerrymandering" manuscript. Similarly, with regard to (2), passing the training academy does not assure that all officers will be good performers. (I doubt that TDAC would even claim so except in the current circumstance.) Some incumbent officers, including ones with more than 32 college credits, are incompetent at essential duties requiring reading and writing. Nor is it appropriate to expect the training academy to function as a fall-back cognitive selection test. Even if the academy can maintain its standards in the face of a big influx of cognitively weak trainees, the costs it incurs will be high both in financial terms (trainees earn half salary) and in damaged morale among recruits and instructors alike.
I don't see (and TDAC does not explain) the relevance of (3), which relates solely to the percentages of minorities failing the low reading cutoff. Justifications (4) and (5) boil down to the claim that minimizing cognitive demands is a "no-brainer" because even totally eliminating the cognitive tests has little effect on the Nassau test's criterion-related validity. As already discussed, the failure of denuding a test of essential cognitive content to register any effects on validity casts doubt on the credibility of the validity statistics, not of the cognitive demands themselves. The only no-brainer here is the Nassau test.
TDAC concludes by adding another reason that tradeoffs (that is, some sacrifice in validity) are appropriate: "We felt then (and still do) that police departments cannot function effectively in minority neighborhoods when virtually all police officers are white males." There was no mention of this reasoning in the technical report. Indeed, it was outside the stated scope of the study to consider such matters. However, TDAC's reliance on it here only confirms what I have suggested based on TDAC's pattern of impact-driven decisions: it responded to political preferences. Opting for racial representation at the cost of validity is a strictly political decision. It is not the province of test developers and validators to make those political decisions as if they were technical ones.
TDAC might also examine its tacit premise that police force performance is enhanced by greater racial representativeness. From what I have been able to discern, that claim is without empirical foundation but has been popularized by advocates of affirmative action. (If I am wrong, show me the evidence.) There is evidence, however, that "problem-solving" policing enhances life in minority communities--but that new form of policing requires cognitive standards to be maintained or increased. Black communities don't need black police officers. White communities don't need white officers. What all communities need is good police officers.
SOME EXCELLENT CANDIDATES FAILED AND POOR CANDIDATES PASSED
I am well aware of the Taylor-Russell tables and the high rate of false positive and negatives there will be with most tests. My point was that the background checks in Nassau County were revealing what appears to be an UNUSUALLY high rate of obvious selection errors. For example, the county is not used to finding so many borderline illiterates among its top scorers.
I am also aware that the background checks and not the entrance exam are used to identify individuals with arrest records. (I would note that only felony convictions, not previous arrests, are grounds for actually rejecting candidates.) However, it is troubling--especially with the presence of a "non-delinquency" scale in the Nassau test--to hear of an increase in the proportion of top scorers with traits normally considered antithetical to good police work. Would TDAC disagree?
OVERPROMOTION OF THE NASSAU COUNTY TEST
TDAC notes quite correctly that the Nassau test "is not portable" without further research. But did it tell that to the Justice Department, which quickly began trying to intimidate jurisdictions nationwide into adopting it?
Justice has now switched horses to the newest HRStrategies/AON test (developed for the Louisiana state police) in its pursuit of no-impact police tests. That test, developed by half the original TDAC members, was said to be based on insights from the Nassau test. Perhaps so, because the Louisiana test bears scant resemblance to its failed Nassau predecessor. But regardless of whether or not the Louisiana test is any better, it is no more portable (indeed, its validation sample is much smaller). Justice nonetheless seems to be promoting it about as aggressively.
As I note in the "Gerrymandering" manuscript, the Nassau case points up some difficult ethical questions, not all of which have clear answers. For example, TDAC speaks in its response of having worked for "two clients" (Nassau County and the U.S. Justice Department), which, I would note, are actually legal adversaries but with one being infinitely more powerful than the other. What are the ethical issues in working simultaneously for two opposed clients? Court papers actually indicate that about half the TDAC members worked for Justice and half for Nassau. Should consultants in such positions be expected to advocate for the interests of their separate clients? What if, as was the case here, some of the consultants have their own financial interests at stake, for example, by owning tests that TDAC decided to use or that Justice decided to "overpromote"?
CONCLUSION
TDAC concludes by suggesting that it is anxious to make available information to get the truth out but that I, in contrast, have been uninterested in "ascertaining the facts of the case." Let readers judge for themselves who has most assiduously pursued the truth. TDAC worries that I have damaged the field's influence among "policy makers and heads of corporations." I, in contrast, am concerned that TDAC members' reputations and expertise, and by extension the field's too, may have been prostituted to Justice's political interests. I believe that such political exploitation of technical expertise promises to hurt the field (and society) more than does its exposure.
