Proceedings of The Physiological Society

University College London 2006 (2006) Proc Physiol Soc 3, PC64

Poster Communications

Analysis of exams using certainty-based marking

A R Gardner-Medwin1

1. Physiology, UCL, London, United Kingdom.

Certainty-based marking (CBM) has been used at UCL in 17 summative medical exams (years 1&2), each with 250-300 True/False questions and >300 students. Students enter answers on OMR sheets (Speedwell Computing Services) with an index of certainty or confidence that each one is correct. The 3-point scale (C=1,2,3) corresponds to marks given for correct answers, with penalties 0,-2,-6 for errors. This mark scheme is proper, in the sense that students gain by indicating low C when their probability of error is low and high C when it is high. Optimal threshold probabilities are 0.67 and 0.8 for C=2,3. Students were well practised through self-assessments ( and formative tests with detailed feedback. The aim is to encourage care in justification of answers and to improve exam data. CBM and conventional (number-correct: NCOR) scores were both scaled so 0%=chance performance (at C=1) and 100%= maximum. CBM scores were linearised (raised to the power 0.6), so that the regression of CBM vs NCOR is typically close to the line of equality. Mean scores were CBM=55.0±12.6% SD and NCOR=53.3±12.8% SD. A measure of exam reliability is Cronbach Alpha, indicating how well the combined data reflect a single variable ('ability') characteristic of the student. This was higher for CBM scores than for NCOR (92.4% vs 88.7%, difference 3.7± 0.31% SEM, n=17, P<0.001%). A more intuitive way to view reliability is in terms of the correlation between scores from alternate questions: sets with odd and even numbers. If the data are reliable, then the score on one set is a good predictor of the score on the other. The mean correlation coefficient (r) for CBM was 0.859±0.030 SD, significantly greater than for NCOR (0.814±0.030; difference 0.045±0.0042 SEM, P<0.001%). CBM scores were not only better predictors of CBM on the alternate set, but also better predictors of NCOR (CBM vs NCOR: r=0.829±0.030 SD, greater than NCOR vs NCOR by 0.015±0.0021 SEM, P<0.001%). Improvements were largest for the bottom third of each class, critical for standard setting and pass/fail decisions: NCOR alone r=0.428, CBM 0.560 (P<0.001%), NCOR vs CBM 0.460 (P<0.1%). Most students achieve percentages correct in the optimal ranges for each C level. Where students were over- or under-confident (too low or high a % correct with a given level), upward score adjustments (averaging 1.2%) were used in the above analysis, calculated by re-assigning C to the optimal level. The proportion of papers where this adjustment exceeded 2% was just 3.1% for over-confidence and 18% for under-confidence. Though such compensation is perhaps generous, it ensures that no student can argue that a fail mark was simply due to poor calibration of confidence. Weak students benefit if they correctly identify reliable answers, but do not lose out if they fail to do this correctly.

Where applicable, experiments conform with Society ethical requirements