Skip to main content.

Statistics Tips and Enigmas


Statistics Tip: Finding Reliable Reliabilities

Test reliability estimates can be computed using one of many different procedures. The procedures use different data and measure different sources of error or accept different assumptions. As a result, a test has as many reliabilities as there are methods for computing reliability. Reliability coefficients computed under different conditions are the same only by chance, and may differ widely. The standard error of measurements varies inversely with the reliability estimate, and so can take an equally large number of values.

Sources of variance include differences in settings, administrators, condition of the examinees, specific test content, scoring processes, and variability of scores within the sample. Different sources of error are tapped when data are collected: on one occasion or more than one, on the same of different test material, or on a single scorer or several. In addition, when different tests have reliability assessed with different populations, reliabilities are not comparable.

Statistics from computer test analysis programs often provide reliability estimates based on the Kuder-Richardson 20 coefficient, which is the special case of Cronbach's alpha for test items scored 0,1 (wrong or right). Data are collected at one time on one set of test material with a single scoring process. In this internal consistency method, the only source of error considered is variability of test content. In addition, this procedure assumes the test measures only one factor. It yields an under estimate if the test covers more than one element.

Correlating scores from parallel forms given at different times covers more sources of error. The different occasions allow variability in the setting and the condition of examinees. Parallel forms contain different test material. The contribution of several sources of error leads to a lower reliability estimate than if only one of those sources is considered.

Assessing reliability of scores requires a repeated measures analysis of variance design, which produces an interclass correlation. Descriptions of generalization theory discuss assessment of various sources of error, based on an analysis of variance procedure.

Statistics Enigma: Higher Validities Are Not Always Better

The test with the highest validity coefficient may not be the most valid. Many circumstances affect the size of a correlation coefficient besides the strength of the relationship between the variables correlated. Principal among the influences are differences in reliability with the samples and in variability of the measures.

Unreliability of the criterion will lower validity. Reliability can be defined as the correlation of a variable with itself. A variable that does not correlate with itself cannot be expected to correlate with anything else. If a criterion is measured less accurately for one group than for another, the lower accuracy will attenuate the correlation. A validity coefficient may be lower if the test is correlated with a less relevant criterion. Equally relevant criterion may also differ in reliability or variability leading to disparate validities.

A pre-selected group has less variability than a general group. In a multiple-hurdle selection process, applicants who have passed the first hurdle will likely be homogeneous on many variable relevant to selection. A validity coefficient based on those who passed the first screen will be lower than a validity coefficient computed from the scores of all applicants. When validity coefficients for two groups are different, the difference may be attributable to differences in variability.

Suppose you use a multiple-choice test as a screen and test the top ten percent with a performance test. Using the whole group the validity of the screen is .60, and for those with passing scores its validity is .20. For the latter group, the performance test validity is .25. Which test is more valid? You can't tell from these validity estimates. The .25 validity for the performance test cannot be compared with either the .60 or the .20.

Although the same people are involved in the computation of the .20 and the .25, they cannot be considered to be based on equivalent samples. Tossing out all low scores on the screening test explicitly restricted its variance. If the people were retested, some would receive below passing scores on the screening test. The performance test variance is restricted only to the extent that it correlates with the screening test. Therefore, both validities are restricted, the multiple-choice validity more so than that of the performance test. The best comparison of validities would come from administering both test to a new sample whose selection is independent of either test.

Statistical adjustments may make validities more comparable. Formulas that correct for reliability or range restriction add error as they adjust values. Each statistic involved in the correction formula is a fallible measure. That is, it contains random error. The corrected coefficient compounds the random errors from the various statistics. However, correcting statistics makes them more comparable, in spite of their being less reliable.


© Copyright 1996 by the IPMA Assessment Council. All rights reserved.