Skip to main content.

The Use of Simulated Pretests

Dennis Doverspike, Gerald V. Barrett, and Winfred Arthur, Jr.


Testing principles, as found in textbooks and professional guidelines (e.g., Nunnally, 1978; SIOP, 1987), suggest that where feasible a test battery developed based on a content validity strategy should be pretested. Given the frequent use of content valid test strategies in the public sector, this would suggest that pretesting should be used regularly in public sector personnel assessment.

However, the suggestion to use pretesting can be traced to a psychometric tradition which is based on the availability of extremely large sample sizes. It is not uncommon to find the suggestion in the same literatures that the number of pretest subjects should be 5 times the number of items. This would mean that for a 200-item test one would need 1,000 pretest subjects. Obviously, for most public and private sector agencies, such a pretest number would be impossible.

There are also additional concerns which arise with the use of pretests in the public sector, including issues of security, cost, and time. Pretesting with incumbents creates the potential for the leakage of information on the test and test items. This can lead to severe security problems and public relation disasters. Pretesting large numbers of incumbents is also costly, in that it takes employees away from their jobs. Thus, while the use of a pretest is certainly a desirable, albeit costly technique, relatively little research has addressed either the generalizability of pretests or alternatives to pretests.

Information gained from the pretest

A critical piece of information obtained from the pretest is estimates of item difficulties. When the test is designed to measure one dimension and consists of dichotomously scored items, the test and item means and variances will be a property of item difficulties. Item difficulties are also important in the calculation of item bias statistics, and play a pivotal role in so-called "Golden Rule" approaches. Thus, the estimation of item difficulties is a critical step in the development of a content valid test and one way to obtain this information is through a pretest. However, as we will suggest later, there are alternative means of obtaining the same information.

A second critical piece of information obtained from the pretest is item-total correlations. Item-total correlations are important in that they are the basis for the calculation of measures of internal-consistency reliability.

Use of Expert Judges

One alternative to the pretest is the use of expert judges to generate estimated item statistics. This method has a long tradition of use in the establishment of cutoffs; although we have not seen it proposed as an alternative to a pretest. Expert judges can provide estimates of the item difficulties or p values of an item and also inspect items for other attributes. Of course, with the Angoff method judges make item difficulty judgments and not judgments of item-total correlations. But, there would be nothing to stop judges from making estimates of item total correlations as well.

The use of expert judges does not of course eliminate all security risks. If expert judges are recruited from job incumbents or supervisors, then the potential for information leakage is still present.

Use of Simulated Pretests

We have found that many times in public sector testing, especially with public safety personnel, holding a pretest with job incumbents is not a feasible option. As a result, we have developed a "simulated pretest method." In the simulated pretest methodology, an available sample is recruited, usually students, and asked to serve as simulated applicants. For entry level tests this is a very straightforward procedure, and the simulated applicants need only be administered the test.

The situation is more complicated when the test is a job knowledge test used as part of a promotional examination. For a promotional examination, it is quite common that there be a set of source documents which are supposed to be studied before the exam. One would prefer to have all of the pretest subjects study all of the source materials, but expecting the simulated applicants to study all of the source documents would, of course, be unreasonable. A solution is to divide the students up into subgroups and have each subgroup spend several hours studying a subset of the study materials. Each group of simulated applicants then receives the subset of test questions corresponding to the set of material they have studied. Done properly, this simulated pretest can minimize cost and time demands. Furthermore, isolating pretest subjects and potential examinees, minimizes the risks of breaches or violations of test security.

Although, in the case of the job knowledge test, it may not seem to make much sense to have simulated applicants take the test without studying, we have found this to be an effective procedure and the source of valuable information. By having simulated applicants take the test without studying, we can gather information on to what degree the test may be subject to "test wiseness." That is, we can determine whether there are items that subjects can correctly answer based on the test or item format without underlying knowledge of the subject matter.

Adequacy of Expert Judges and Simulated Pretests

The question could be asked as to whether expert judges and simulated pretests provide adequate estimates of item statistics. Of course, the same question could be asked of pretests with incumbents. Unfortunately, as with many other topics involving content validity, there is not much research dealing with this question. Most of the research available deals with the adequacy and reliability of Angoff judgments.

While there is a need for additional studies on the question of simulated pretests, we were able to conduct a field study looking at the adequacy of estimates provided by expert judges and simulated pretests. The study described here involved the development of a content-valid promotional test for police officers. In developing the test, we wanted to estimate the item difficulties of test items. Pretesting using incumbent officers was not a feasible option, as it involved substantial cost, inconvenience, and the possibility of endangering the security of the exam. One option then was to run a simulated pretest.

The study conducted investigated two basic questions. First, could the item difficulties from the administration of a test to applicants be predicted from a simulated pretest sample? Second, how would this compare to the estimates from judges?

The job knowledge test was a 180-item test which was administered to 96 police officers who had applied for promotion to sergeant. Several months prior to the test, police officers were given a list of readings. The police officers were administered the 180-item final version of the test in a group administration.

The simulated pretest was based on 16 students. The students were assigned specific parts of the reading list or sources and, subsequently were only administered the part of the test corresponding to the assigned reading material.

There were 18 expert judges. All judges had previous experience in consulting projects involving the development of tests for police work. They were given the test and asked to estimate the probability that an item would be answered correctly.

The basic data used in this study was the p value for each item or the percentage of people getting each item correct. Unfortunately, item-total correlations were not available for this study.

For the applicant sample, police officers taking the test for consideration for promotion, the mean percentage test score was 69.36% (Note: All scores are presented as percentages to aid in interpretation.) Based on the simulated pretest sample, the mean test score was estimated to be 56.63%. Thus, the mean or average p value for the simulated pretest sample was quite a bit lower. However, the correlation between actual item difficulties based on the police officer sample and item difficulties estimated from the pretest sample was .57; a correlation coefficient which was significant at beyond the .01 level. Thus, the simulated pretest appeared to provide a better estimate of the relative item difficulties than of the absolute p values.

The judges estimate of the mean percentage score was 71.35% which was quite close to the actual obtained value of 69.36%. The correlation between actual item difficulties based on the police officer sample and item difficulties estimated from the judges was .39; a correlation coefficient which was significant at beyond the .01 level.

Both the judges estimates and the simulated pretest estimates of item difficulty were moderately correlated with the actual results, with the simulated pretest estimates being more highly correlated. Thus, the simulated pretest results were a better measure of the relative item difficulty than were the judges estimates. However, the judges estimates reflected more accurately the mean item difficulty of the test.

Discussion

The results were encouraging in that the simulated pretest provided estimates of p values which were moderately correlated with the p values obtained from the actual test administration. This was in spite of the fact that the simulated pretest sample (i.e. students) differed substantially from the sample used in the actual test administration (i.e. police officers). In addition to the samples being very different from a demographic perspective, the police officers had an extensive opportunity to study for the exam, while the students received only a short study period. Thus the simulated pretest, although involving a sample from a different population under very different conditions, still predicted the item difficulties on the actual test administration. Thus, if one's primary interest was in the relative item difficulty of items, the simulated pretest provided valuable information.

The estimates by the judges were much more restricted and not as highly correlated with the applicant data. However, the judges estimates were closer to the overall mean p value.

In conclusion, the results do suggest that a simulated pretest, even one under quite different conditions, will provide valuable and meaningful information on the difficulty of items. In addition, expert judges can also produce valid estimates of the item difficulties, although in this study their estimates, compared to the simulated pretest, were not as highly correlated with the applicant sample. One possibility would be to combine the two processes and use expert judges along with a pretest. This would appear to give maximum information with minimal time and effort.

While pretests may be a valuable source of psychometric information, their use in the public sector is often problematic. Two alternatives are the use of expert judges and the use of simulated pretests. While more research is needed on both alternatives, they do seem to provide a solution to the pretest problem in the public sector.

References

Nunnally, J.C. (1978). Psychometric theory (2nd Ed). New York: McGraw-Hill.

Society for Industrial and Organizational Psychology, Inc. (1987). Principles for the validation and use of personnel selection procedures (3rd ed.). College Park, MD: Author.

Dennis Doverspike & Gerald V. Barrett are from the University of Akron and Barrett & Associates, Inc.; Winfred Arthur, Jr. is from Texas A&M University.


© Copyright 1998 by the IPMA Assessment Council. All rights reserved.