A comparison of some recently proposed procedures for detecting the presence of biased test items

Matthew Thomas Schultz, Fordham University


Considerable research efforts have been focused on the question of how to best determine the extent to which biased items contaminate multiple choice exams. Bias has been defined as being present when individuals from different ethnic, racial or gender groups with similar levels of ability (typically quantified as similar test performance) perform differently, with one groups performance clearly inferior to the others. Efforts to ensure that exams are free from bias have focused on both removing items prior to test administration and identifying biased items once an exam has been administered. The present study focused on the latter, specifically the performance of several statistical bias detection procedures under a variety of conditions. A number of statistical procedures have been developed to detect the presence of biased items. These include Item Response Theory (IRT) based methods, item difficulty indices, and variants of the chi-square. Research examining the performance of these methods has noted that, while IRT procedures seem best, their use is limited by a need for large sample sizes and sophisticated computer facilities. The present study examined procedures more widely applicable (than IRT) due to their relative simplicity and lack of statistical assumptions: Mantel-Haenszel (MH), logit, partial correlation (R), and Transformed Item Difficulty (TID). Research has suggested that in many settings these procedures perform similarly to IRT methods. Majority (White) examinees were contrasted to Minority (Hispanic and African-American) while manipulating sample size and the criterion used for equating on ability (test score versus HS GPA). The iterative use of MH and logit was also assessed. Results suggested that as sample sizes decreased the number of items defined as biased also decreased, suggesting that the procedures (except TID) may be sensitive to sample size. MH and logit demonstrated high agreement across conditions, suggesting that practitioners may use either and obtain similar results. The result of using procedures iteratively suggested that doing so offers little in added precision. The search for an independent criterion on which to match examinees was not successful, though the results did suggest that the procedures performed as they should (demonstrated agreement) even when matching was not optimal. The need to take into account Type I error rates was also demonstrated and remains a subject for future research. The sample-size dependent nature of results was disturbing, and further research to delineate the nature of this problem is clearly warranted.

Subject Area

Psychological tests|Educational evaluation

Recommended Citation

Schultz, Matthew Thomas, "A comparison of some recently proposed procedures for detecting the presence of biased test items" (1991). ETD Collection for Fordham University. AAI9215356.