Skip to main content.

SIOP-St. Louis-1997 and The Nassau County Police Test Symposium

Frank W. Erwin


From the perspective of one whose first convention was in 1968 and who's been away from recent conventions for a number of years, the 1997 edition was just exceptional. Sessions and poster reports were excellent. And how different the people were from the group of the 60's and 70's! A disproportionate number of those "old-timers" were hard-drinking, hard smoking, poker-playing, almost all employed-in-industry white males (attending females were mostly wives). Wine definitely was not the in thing and one could count on a hospitality suite's much more debilitating supplies being depleted more than once before convention's end. Female industrial psychologists were few and far between, but they were carving out the wide space that their successors enjoy today. SIOP today is younger and much, much more diverse and that can be all to the good.

There was, however, one event--an "invited" symposium on the Nassau County Police Test--which wasn't really invited or a symposium at all. It was a one-sided gangup instead of the debate it could and should have been and Linda Gottfredson, who was alone on the dais in terms of position, had been denied the debate she requested, or even a more balanced group. She had asked, for example, that Frank Schmidt be on the panel and even offered to give him half of her time instead of additional time, but those involved in developing the Nassau work threatened to withdraw if he were to be included.

Although Dr. Gottfredson held her own on her view and the view of many other leading professionals on the Nassau project, there have been a number of demonizings and misleading representations made about Dr. Gottfredson and the Nassau project which could have been cleared up had those attending been given a more balanced presentation. Because she has objected to what was done in Nassau, Linda Gottfredson has been called a racist, she has had previous views misrepresented, she has been castigated in terms of funding she's received on other works (as if the Justice Department's money, support or coercion was angelic) and she has been falsely and repeatedly accused as having been paid to review the Nassau work. She's also been chastised by the developers for not asking for the Nassau report's missing data before she wrote her Wall Street Journal article. There was no assurance at all, however, that anyone could have gotten the missing data and one should not have to ask for data which already should have been in the report out of respect for colleagues and in response to the requirements of professional standards and the Uniform Guidelines. For the record, too, those data still have not been completely supplied and were described in the symposium as belonging to the project's clients, one of which is the Department of Justice, which requires the rest of the world to report such things or face its wrath.

As to the project itself, and with all due respect to the developers, had there been the desired debate, those attending the session would have heard not only that the project was one devoted principally to reducing impact with a stated commitment to only "maintain" validity without ever reporting what that validity was. They also would have been given a much fuller portrait of this Department of Justice co-sponsored and preference-driven work which ended up being so seriously flawed that it is difficult to claim any validity for it at all.

What follows, therefore, has no racist motive, as some may claim, it has no competitive motive, as also some may claim, and it is not decimal dust minutiae as some may attempt to belittle it to be. It is a set of facts about the Nassau project which need to come out into the light of professional, administrative, judicial and congressional scrutiny.

1. The mandate given by the District Court was to eliminate adverse impact as that term is defined, or to meet the requirements of the Uniform Guidelines. In fact, the project actually did neither, and the impact reduction it did achieve was at an enormous sacrifice in validity.

2. The primary focus of the debate, of course, is the extent to which general mental ability was essentially eliminated from the final Nassau product so as to reduce impact. Despite its extraordinarily detailed job analysis which confirmed that mental ability is a critical facet of police officer job requirements, the project opted not to include already-existing and proven mental ability tests in its experimental battery. It instead created weaker, untested versions of its own. Under the umbrella of innovation it also created video versions of several of those and, finally, on the non-video mental ability tests, it made materials containing actual test content available to candidates 30 days prior to the test date. In contrast, the same materials were made available to the research sample only one week before their test date, but more about this later.

3. The next step in the Nassau project was to administer all 25 experimental tests to over 25,000 candidates, each of whom had paid a non-refundable $50.00 to take a test which they had been led to believe was an already-established, valid predictor of Nassau police officer performance. In fact, no validity evidence was available at that time and 16 of the 25 tests the candidates took would later end up on the cutting room floor.

4. By the time the subsequent validation research step was completed, the only surviving mental ability test was a paper and pencil Understanding Written Materials reading test to be used as a pre-screen to eight personality tests. It really wasn't a pre-screen either, since its cut-off score had been set based on the scores of the five poorest readers in the total research sample of over 500 officers. To those who have questioned such a unique, really bizarre, action, the response has been that all of the research sample officers had passed entrance tests and the training academy and were successful performers on the job. But job success, like validity level, is not a dichotomy. If all officers in the research sample truly were equally successful performers, then validity for all predictors would be just as equally absent. The fact of the matter is first, that a reading test is far from the best of all mental ability tests, and second, that the cut score decision reduced its value even further, essentially ignoring the validity of its scores. This combination then was taken into and approved by the Nassau Court in a non-adversarial hearing in which the Department of Justice's representative, after making an erroneous comparison with Nassau's previous test, described the new test by saying, "...I've got to point out to the Court, I have not seen a test of this type that is honest and valid, having less adverse impact than this examination has upon blacks." Despite that having been the greatest misrepresentation of them all, had the Nassau project ended there, the furor which followed may have been more limited.

5. What followed, however, was federal arm-twisting marketing of the most egregious kind. Justice Department representatives and their surrogates fanned out across the country and a major new program of federal Civil Rights Divison coercion and intimidation was underway. Previously approved tests began being challenged, new data demands began being made and the Nassau product was being heralded as the Justice Department's suitable alternative grail. More than anything else, it was the abusive marketing and the extraordinary amount and nature of the data missing from the Nassau report which brought on a closer review of what had been done in Nassau and the racial gerrymandering charges which followed. It also was clear for reasons over and above the mental ability removal that the Nassau work wasn't even close to what the Nassau Court had been told or what the Justice Department was out selling.

6. Before proceeding to other aspects of the Nassau work, it may be helpful to itemize the Nassau report's missing data which contributed in large part to the questioning of what really had been done there. To summarize, therefore, there was no information on the previous test(s) used or the validity which was to be maintained. There were no zero-order r's for the 25 experimental tests, no intercorrelations among scores on the 25 tests or the batteries considered, no regression weights for the full 25 test battery or any of the smaller extractions, no score means or standard deviations for the batteries being evaluated and no separate data for whites (whites had been lumped together with all non-black, non-Hispanic minorities into a group called "Other." ) Given their number and fundamental nature and the fact that all of these items are required by Uniform Guidelines Sections 15. B. (8) and 15 B. (10) and professional practice, a suspicion that something preference-driven had taken place cannot easily be dismissed as due to philosophical differences. In addition, Department of Justice representatives already were attempting to coerce other jurisdictions into joining in or adopting the Nassau work which made waiting for someone to get around to maybe providing the missing data an unaffordable luxury. As it developed, after being bounced from Justice to the developer or vice versa, others seeking the data finally were able to obtain only the original report and only if they were willing to pay for it. As indicated, however, most of these missing data still have not been made available and in a clear lack of understanding of the nature of Uniform Guidelines reporting requirements, the developer has since declared them to be the client's property and not therefore his to provide! In fact, what is to be reported is not subject to client option, no matter who that client happens to be. It's also unclear as to why Mr. Gadzichowski and Dr. Goldstein, the Justice Department's representatives, went into Court without it.

7. Unfortunately, too, the Justice-caused review of the Nassau work by a number of leading professionals other than Dr. Gottfredson revealed a number of major flaws over and above the mental ability removal, flaws which have raised serious doubt about the Nassau product having any interpretable score validity, let alone what was reported.

(a) One of those in early agreement with Linda Gottfredson's position on the virtually complete removal of general mental ability from the Nassau product was Frank Schmidt. He also noted, however, that the multiple R used in the Nassau corrected estimates was that for the full 25-test battery and not for the smaller battery of retained tests. He advised further, as Neal Schmitt had pointed out in 1986, that use of the Wherry shrinkage estimate formula in a situation such as Nassau was incorrect. Schmidt then went on to show first that a multiple R of .228 (instead of .30) should have been used, that applying the correct capitalization on chance formula (Cattin's equation 8) to that R yielded a value of .05 instead of the .20 reported, and that correcting the .05 for criterion reliability and range restriction would yield a lower bound validity of .08. He also computed an upper bound validity by entering the multiple R of .30 observed for the scores of the full battery of 25 tests and again using Cattin's equation 8 formula to yield a shrunken validity value of .14 corrected to .20. He then estimated the true, or operational validity of the full Nassau battery to be the average of the upper and lower bound estimates, or .14. He concluded that the Nassau report had therefore overstated validity for the full 25-test battery by over 100 percent. Finally, and while questioning the accuracy of its restriction of range correction results, Schmidt estimated the lower bound true validity of the Nassau final use battery to be .09, the upper bound to be .24 and the best estimate. .17, again one-half of what was presented to the Nassau Court. (Note: In a "Draft" response, the developers agreed with Schmidt's major point, but claimed a cross-validity range of .14 to .17 and a post-adjustment range of .20 to .27 across the different smaller batteries. Even with this admission, as additional question still remains regarding correct formulas and treatments for entering partial r's in a regression equation, which appears to have been done here.

(b) Quite apart from Schmidt's challenging the procedures used to estimate true validity, a second Nassau question warranting discussion is related to basic standardization requirements. In a 1985 Nassau follow-up study on the ETS NCPOST test, Dr. Jones and Dr. Prein quite appropriately dropped 75 of that test's 165 items from further consideration because, in their words, "A Pre- Examination Study Booklet with unknown influence on individual test performance was used, thus compromising standardization of a significant portion of this test."

In the 1994 study, however, as part of the project's attempt to reduce the impact caused by mental ability tests, portions of the project's mental ability tests were made available (as opposed to given) to candidates 30 days prior to their test date, while research sample members were given the same material a week before theirs. As a consequence, no determination can be made as to which candidates actually picked up the materials, or for how long those who did so had them. One also can't determine which of those who had the materials actually studied them or how much time those who did study them spent on it, or whether they studied them alone or worked with one or more family members or friends, or what effect making test content available had on scores. In addition, no reasonable inference can be drawn about the true relationship between incumbent and candidate scores or about what the validity of incumbent scores would have been had standardized conditions been adopted. Whether it be with items or content, therefore, this kind of procedure does compromise standardization and does violate Uniform Guidelines Section 5E ("Selection procedures should be administered and scored under standardized conditions.") The standardization of the Understanding Written Materials test's administration had been compromised and if the 1985 rules had been carried into 1994, as they should have been, the test should not have been included in the final battery, no matter where the cut-off score was set.

(c) The 1994 project's Work Readiness and Adaptation Profile (WRAP) instrument presents pairs of self-descriptive items to examinees and asks them to choose from each pair the one response that they feel best describes themselves. The WRAP is thus a "forced-choice" instrument which produces "ipsative" scores; that is, the level of each score is a function of the choices presented and high scores-strengths on one set of scales are by nature related to low scores-weaknesses on others. The scores are interdependent, not independent, and the frame of reference is the individual, rather than some normative group. Owing to their non-normative nature, therefore, and whether it be one or more than one scale, it is not technically acceptable to analyze ipsative scores using the usual correlational procedures. It appears therefore that the WRAP scores used in the final 1994 battery were improperly analyzed and combined and that these, too, should not have been a part of the final Nassau battery.

Having listed here that which the symposium's structure denied, and since all bad things and things about bad things must come to an end, I would say that the symposium's most cogent observation, warning if you will, came from Frank Landy. It was he who pointed out that the real cause for concern in this matter is the role being played in Nassau and elsewhere by the Civil Rights Divison of the Department of Justice. Historically, that role could include demanding at any time, usually on the basis of complaints not difficult to generate, selection and promotion-related data from any jurisdiction it targeted, including those operating under consent decrees, some dating back almost twenty years, maybe more. It further could choose to make its demands in terms of single jobs or even single selection or promotion actions so as to increase the probablilty of its being able to make a finding. Having done that, a demand could be made for job-relatedness data, and if it existed, to demand it in raw data form so that the Department could overlay its own analysis and value judgments on what had been provided (for example, the Department of Justice has steadfastly refused to accept training results as a criterion. It also has re-analyzed data so as to produce rather bizarre and uncross-validated, impact-reducing methods of use. In Nassau, it simply kept rejecting criteria and studies, or remodeling use methods until it got what it wanted--the 1994 result). The conclusion of the process, where a consent decree hadn't already existed, could usually be the establishment of one which likely would include a number of make-well monetary or other provisions, as well as the presence of the Divison's heavy hand in jurisdiction affairs for years to come.

Nassau and its progeny, however, represent a new Civil Rights Division strategy. It still is making its molecular data demands. It still is paying experts to evaluate and/or ransack existing job-relatedness data, but now it also is paying those same experts to simultaneously oversee the development of preference-driven tests designed to replace the tests they are evaluating. It further is advising jurisdictions, upon pain of restraining orders or burial under the weight of additional data or material demands, not to use specific other test developers. It then simultaneously offers them a "safe haven" from such threatened federal harassment and subsequent public exposure. The safe haven, of course, just happens to be the Justice- sponsored test, or participation in yet another Nassau-like venture. Now that the Nassau project has been more publicly challenged, however, it appears that its final battery has been abandoned. A Nassau II study conducted in Louisiana is now the favorite in some confrontations, but it also has bizarre characteristics, suffers from its own set of deficiencies, and, again, has missing data. What the next favored test will be is unknown at this time.

In summary, the Civil Rights Division is not pursuing score adjustments based on group membership, which is now forbidden by the Civil Rights Act of 1991. It instead is spending federal tax dollars to hire experts to savage existing and often satisfactorily job-related tests while it is spending additional tax dollars to use the same experts to simultaneously oversee the development of preference-driven, content adjusted and questionably effective test batteries. It then is using even more tax dollars to intimidate and coerce jurisdictions around the country to avoid other developers, or to drop their own, often far more effective procedures and to adopt whatever the Division's favorite at the time happens to be. Left without administration and/or congressional oversight, the net effect, in addition to an already existing subversion of the Clinton Administration's C.O.P.S. and PoliceCorps quality improvement programs, will be a substantial reduction in the quality of every police force the Division touches. It is irresponsibly abusing its power to tilt police selection so as to favor impact reduction over quality improvement and it is corrupting a science in the process. As Dr. Gottfredson so aptly put it in her May 20, 1997 testimony before the Constitution Subcommittee of the House Judiciary Committee, "DOJ has no business entering the test development business and coercively marketing the test it helps develop. Its interference in that market is comparable to the FDA working jointly with a major drug company to manufacture new drugs, ignoring its own regulations for assessing their quality, creating new rules that only that company's drugs can pass, punishing whistle-blowers who expose disguised hazards and inflated claims to the drugs' efficacy, and then coercively marketing the flawed products to hospitals coast to coast." If the SIOP membership doesn't comprehend what is going on and the threat it represents to scientific principles, then its days as a credible, scientific body are numbered and the-client-made-me-do-it, or my-social- values-made-me-do-it philosophies won't save it.