Skip to main content.

Some Comments on the Nassau County Police Validity Study

Frank Schmidt

R.L. Sheets Professor of Human Resources
University of Iowa

July 2, 1996


Although it is apparent that a great deal of work went into this research, there are technical and conceptual problems in the study. The biggest and most glaring conceptual problem is the complete failure to draw on the cumulative scientific literature in any way. This report assumes that the only scientific evidence relevant to the hiring of police officers is that produced by this study. It does not even mention the meta-analytic literature on validity, nor does it mention the findings of other large scale studies of police selection. To understand the implications of this, consider a medical doctor considering prescribing a certain antibiotic for patients of his/hers who have a certain illness. Suppose the doctor ignored all the previous research studies evaluating the effectiveness of this antibiotic and instead decided to conduct his own study of how well it worked among his own patients. This would rightly appear irrational to most people (and all doctors), but this is exactly what was done in this study.

Another problem in this study is its treatment of the various cognitive measures. We see in Exhibits 25 and 26 that the job analysis finds a strong link between the cognitive measures Reasoning/Judgment and Reading Comprehension and the performance of many important police tasks. In Exhibit 61, we see that both the Situational Judgment test (the measure of reasoning and judgment) and the Reading Comprehension test are found in this study to be empirically valid for predicting important police performances. These findings are consistent with previous research in the literature--ignored by the report, as noted above--showing the importance of cognitive abilities to job performance. Despite all these facts, the final selection battery recommended by the report contains no cognitive components--except for the requirement that the applicant be above the 1st percentile of incumbents in reading comprehension. It is hard to see how this recommendation can be justified except on the basis of an overriding emphasis on reduction of group mean differences on the battery. Based on past research and experience it can be predicted that use of this battery for hiring will lead to severe performance problems in the police academy, higher flunk out rates, and lower levels of job performance for those who do get through the academy. Some of these job performance problems could well endanger public safety.

There are other such problems, and I trust that others with more time than I presently have will point them out. In the remainder of this comment, I want to focus on a technical (statistical) problem in the data analysis that results in a substantial overestimate of the validity of all the test batteries examined in the report. These overestimates result from erroneous corrections for the inflationary effects of capitalization on chance on the estimates of validity.

The report correctly recognizes (see p. 137) that if one examines the validities of a number of tests in one's sample and selects some of these tests for one's final battery based on these validities, all estimates of the validity of that battery derived from that same sample of people will be inflated. The usual solution for this problem is to have an independent cross validation group (not used in selecting the battery) and to estimate the battery validity on this group. This was not done on this study.

It is sometimes stated that an alternative is to use a statistical formula (shrinkage formula) to adjust for this inflation of validity. However, these formulas were not derived for cases in which one selects only some of the initial predictors for retention; they were derived for the case in which all the original predictors are retained. (In such a case, all the capitalization on chance is due to the fitting of the regression weights). However, these formulas can be used to provide an (lower bound) estimate of the validity of a battery selected ex post facto if three conditions are met:

1. The value for the number of predictors entered into the equation is the original number, not the smaller number of predictors retained. This study meets that condition.

2. The multiple correlation entered into the equation is that for the smaller battery of retained tests. This was not done here. We see on p. 137 that the multiple R of .30 used was that for the full battery of 25 tests. This means that the corrected estimates apply not to the battery actually used, but to the optimally weighted combination of all 25 tests--a battery not actually used or recommended for use.

3. Finally, the correct formula must be used. This study used the Wherry formula, which is not the correct one. The Wherry formula estimates how well the battery would work with the unknown population regression weights, not with our estimates of those weights. The correct formula, by contrast, estimates how well the tests will do given that we have to use our (imperfect) estimates of the optimal weights. These matters are discussed and the correct formulas are given in Cattin (1980) and in Schneider and Schmitt (1986).

The key estimates of battery validity are given in Exhibit 67 on p. 186. All batteries have nearly equal observed (uncorrected) validities, and Exhibit 67 gives .20 as the value of the validity for all these batteries after correction for capitalization on chance. However, as noted above the wrong multiple R was used: instead of .30, it should be .228 (the average of the observed battery validities). In addition, the correct formula (Cattin's equation 8) should be used. Making these two changes yields a value of .05 instead of the .20 reported. Corrected for criterion reliability and range restriction, this becomes .08. The validity values reported in Exhibit 67 are .29 and .30--which are over 3.5 times larger than the .08 value. However, the .08 estimate is a lower bound (conservative) estimate. When there has been ex post facto selection of predictors, there is no way to use shrinkage formulas to get a completely unbiased estimate of battery validity. However, in addition to the lower bound (conservative) estimate, we can obtain an upper bound estimate. This is obtained by entering for the multiple R value the value of .30 obtained for the complete battery of 25 predictors. In addition, we must use the correct formula rather than the Wherry formulas used in this study. These calculations yield a shrunken validity value of .14. Corrected for criterion unreliability and range restriction, this value is .20. The values of about .30 reported in Exhibit 67 are 50% larger than this upper bound value. So it is considerably inflated.

Probably the best estimate of true or operational validity here would be the average of the upper and lower bound estimates. This average is .14, which is less than half the erroneous validity values reported in Exhibit 67. Hence, because of statistical error in the report, the reports overstates validity by over 100%.

Although an operational validity of .14 is not useless, and can have value in comparison with random selection, it would generally be considered a small validity. A validity this small would often, perhaps typically, lead to an attempt to develop a more valid procedure.

Finally, there is one battery in Exhibit 67 for which the claimed validity is even higher than the .30 reported for the otherbatteries. A true validity estimate of .35 is reported for the "Non-cognitive plus Minimum Reading Standard at First Percentile" battery, the battery the report ultimately recommends for operational use. However, even ignoring the errors in validity estimation discussed above, this value of .35 almost certainly reflects a computational error. Looking at the previous column in the Exhibit, we see that the validity of this battery adjusted for criterion unreliability is .25, the same value as reported for all the other batteries. The jump from .25 to .35 results solely from the correction for range restriction. The only difference between this battery and the "Non-Cognitive" battery is the addition of the requirement of being at or above the 1st percentile in reading comprehension. It is highly unlikely that this simple addition could cause a 100% increase in the correction for range restriction (from about .05 to .10). Although insufficient information is presented in the report to confirm this by checking the calculations, it is highly likely that there is a computational error here.

However, even if we were to accept as correct a range restriction correction of this magnitude, the estimated validity of this test battery would still not be high. These estimates are as follows:

Lower Bound = .09
Upper Bound = .24
Best Estimate = .17

Again, the best estimate is an average of the upper and lower bound estimates.

So once again, it is apparent that none of these test batteries has high validity. A validity of .17 is almost certainly an overestimate. A more accurate estimate is .14. Although a validity of .17 is not useless, and in fact could have substantial practical value if the alternative were a method with zero validity, this level of validity is not impressive.

In summary, a major problem with this report stems from technical errors that result in inflated final estimates of operational validity for all the test batteries considered in the study. Actual operational validities are overstated by 100% or more.

Cattin, P. (1980). Estimating the predictive power of a regression model. Journal of Applied Psychology, 65, 407-414.

Schneider, B. & Schmitt, N. (1986). Staffing the Organization.