Skip to main content.

Implications of the Revised Testing Standards for Personnel Testing Practice

Dr. Wayne Camara
Research and Development Director
The College Board


Doctor Camara reviewed major additions and changes in the revised joint testing standards and discussed the implications of the new standards for personnel testing and assessment. At the time of this presentation to MAPAC (9/30/99), the final version of the new standards had been endorsed by the three sponsoring organizations (AERA, APA, & NCME), but had not yet been published. Doctor Camara had access to the new standards and has reviewed them. The new testing standards are expected to be available for purchase from the American Educational Research Association in late October 1999 (see www.aera.net) and (http://www.apa.org/science/standards.html). Following are some highlights of Doctor Camara's presentation to MAPAC. More details are available in a handout that is available from Doctor Camara (email address: wcamara@collegeboard.org).

The history of previous testing standards was reviewed and the various versions of the standards were briefly compared. Standards were published in 1955, 1956, 1974, 1986, and new standards will be published in 1999. One major change in the 1999 standards compared to most previous versions is that there is no categorization of standards by level (e.g., primary, secondary, etc.) in the 1999 version. Some 1999 standards have conditional statements attached to them, but most apply in all instances.

Structurally, the 1999 standards are much more extensive than the previous versions. For example, there is a 47% increase in the number of standards (264 vs. 180) and a 33% increase in the number of paragraphs of text (240 vs. 180) in the 1999 standards compared to the 1985 standards. A new chapter was added to the 1999 standards on the topic of test fairness, and policy issues are emphasized. The fairness chapter represents an area of large, substantive change in the standards. The new standards have substantial sections on bilingual testing accommodation, testing of persons with disabilities, and other hot public policy topics. There is more advocacy of positions on public policy issues in this set of standards than in any previous set of standards.

There were 103 definitions added to the glossary of the 1999 standards. The definitions added and deleted reflect trends in assessment during the past 15 years. Some definitions added include scoring rubrics, holistic scoring, defining proficiency levels, matrix sampling, bias, meta-analysis, construct (defined very broadly), and DIF.

Concerning the purposes and use of the 1999 standard, they provide criteria for the evaluation of tests, testing practices and the effects of test use, and they promote sound and ethical use of tests. They encourage adoption of the standards by test developers, sponsors, and users. A test is defined broadly as an evaluation device or procedure in which a sample of behavior is evaluated and scored using a standardized process.

Doctor Camara reviewed some key provisions of selected chapters in the standards.

Chapter 1 on Validity describes validity as a unitary concept and refers to validation as an on-going process of accumulating evidence. Validity refers to the interpretation of scores by users, not the test itself. Validation is a joint responsibility of the test developer and user. The chapter contains increased emphasis on validity evidence based on the response process, validity generalization, and the consequences of testing. New requirements include requiring evidence that practice and coaching do not affect test scores, requiring evidence to support claims of the benefits of testing, and investigation of unintended consequences of test use. Some standards on criterion-related validation studies were deleted from this chapter compared to the 1985 standards.

Chapter 2 is on Reliability and Measurement Error. This chapter was not discussed by Doctor Camara, but he stated that it is the most difficult to understand chapter in the standards, and many I/O Psychologists will not be familiar with the concepts in this chapter.

Chapter 3 is on Test Development. It provides a step-by-step description of test development and scoring. Considerable attention is given to performance assessment. It includes a requirement for sensitivity reviews of tests by a diverse panel. Test content should assure that inferences for all groups are valid. More documentation is required. Test users must train raters and ensure adequate reliability.

Chapter 4 on Scales, Norms, and Score Comparability contains increased emphasis on linkages, adaptive testing, cut scores, and standard setting. One new requirement is to examine item context effects when changing item order in multiple forms of the same test. Regular re-examination of norms is required. Criterion referenced cut scores are endorsed.

Chapter 5 is on Test Administration, Scoring and Reporting. It states that persons of different backgrounds, ages or familiarity with testing may need nonstandard modes of test administration. This led to a discussion of a possible need to give diagnostic tests to determine if candidates are prepared to take "the" test. When test data are stored, you must preserve the test protocol.

Chapter 6 covers Supporting Documentation. It emphasizes clearly written and understandable materials for test users. It contains the "consequences" concept. That is, the higher the consequences of the test, the more the required analysis and documentation. Considerable detail is provided on what to document in test manuals.

Chapter 7 on Fairness emphasizes the importance of fairness across all aspects of testing and provides a context for the standards. Nearly all of the standards in this chapter are new or substantially revised. Fairness is to be judged in the context of feasible test and nontest alternatives. The chapter contains four definitions of fairness, however no one definition is endorsed. The chapter emphasizes sensitivity review panels, balanced content/rubrics, employing multiple measures, and emphasizes the strengths and limits of tests. Studies of fairness are required when research shows differences in item functioning, score meaning, or effects of construct irrelevant variance. Studies are required to ensure that mean score differences do not result from construct underrepresentation or irrelevant variance. The chapter emphasizes the responsibility of informing policymakers of the likely consequences of using a test. Your decision on what to measure may be a fairness issue.

Chapter 8 is on the Rights and Responsibilities of Test Takers. Test takers are to be provided, in advance of testing, where appropriate, information about the nature of the test, use of scores, and confidentiality. Test taker responsibilities are described.

Chapter 9 on Testing Individuals of Diverse Linguistic Backgrounds was described as the "worst chapter." The implication of the provisions of this chapter is that language problems are so enormous and prevalent that testing should sometimes not be conducted, and individual accommodations are needed depending on language dominance. Testing in a person's primary language is recommended, but the chapter identifies many problems and issues with translating and equating tests. The chapter requires collecting validity data for each linguistic subgroup and determining language proficiency before testing.

Chapter 10 is on Testing Individuals with Disabilities. This chapter is much more instructional and descriptive than the 1985 standards. It includes a provision on use of professional judgement in test modification, and a provision on use of multiple sources of information.

Chapter 14 on Testing for Employment/Credentialing was described as " a good primer on employment testing" and the best chapter in the new standards. It is the longest chapter. The introduction to the standards in this chapter contains 39 paragraphs. The chapter puts the burden for conducting research on sample size rather than organizational size. It contains an informative graphic and narrative model of the validation process. Some additions to these standards compared to the 1985 version include: stating the objective of testing, stating that cut scores should not be regulated by a desire to reduce the number of persons passing, and defining the role of each test when a composite predictor is used. Local validation studies are dependent on their feasibility, and existing validation evidence should be considered when interpreting the results of local studies.

The 1999 Testing Standards include 264 separate standards. Doctor Camara advised that, for documentation purposes, anyone involved in "high stakes testing" should list each standard and comment on how their test development and research meets each standard, or why meeting the standard is not relevant or feasible for the particular assessment process.

Summary prepared 10/12/99 by Charles F. Sproule, Pennsylvania State Civil Service Commission


© Copyright 1999 by the IPMA Assessment Council. All rights reserved.