Examiner Variability in Clinical Assessments: Do Examiner Pairings Inuence Candidate Ratings?

Background The reliability of clinical assessments is known to vary considerably and inter-examiner variability is a key contributor. This may result in signicant differences in scores between comparable candidates, a serious challenge in medical education. An approach frequently adopted to avoid this and improve reliability is to pair examiners and ask them to come to an agreed score. Little is known however, about what occurs when these paired examiners interact to generate a score.Methods A fully-crossed design was employed with each participant examiner observing and scoring. A quasi-experimental research design used candidate’s observed scores in a mock clinical assessment as the dependent variable. The independent variables were examiner numbers, demographics and personality. Demographic and personality data was collected by questionnaire. A purposeful sample of medical doctors who examine in the Final Medical examination at our institution was recruited.Results Variability between scores given by examiner pairs (N=6) was less than the variability with individual examiners (N=12). 75% of examiners (N=9) scored below average for neuroticism and 75% also scored high or very high for extroversion. Two thirds scored high or very high for conscientiousness. The higher an examiner’s personality score for extroversion, the lower the amount of change in his/her score when paired up with a co-examiner; reecting possibly a more dominant role in the process of reaching a consensus score.Conclusions While the variability between scores given by examiner pairs (N=6) was less than the variability with individual examiners (N=12), the reliability statistics for both assessments were comparable. Using paired examiners resulted in a more accurate and robust score than simply averaging two independent examiners scores. The higher an examiner’s personality score for extroversion, the lower the amount of change in his/her score when paired up with a co-examiner; reecting possibly a more dominant role in the process of reaching a consensus score. These ndings could have implications for the organisation and administration of clinical assessments. Further studies with larger numbers of participants might establish if personality testing before choosing examiner pairs could be utilised to help pair examiners and improve examiner variability.


Abstract
Background The reliability of clinical assessments is known to vary considerably and inter-examiner variability is a key contributor. This may result in signi cant differences in scores between comparable candidates, a serious challenge in medical education. An approach frequently adopted to avoid this and improve reliability is to pair examiners and ask them to come to an agreed score. Little is known however, about what occurs when these paired examiners interact to generate a score.Methods A fully-crossed design was employed with each participant examiner observing and scoring. A quasi-experimental research design used candidate's observed scores in a mock clinical assessment as the dependent variable. The independent variables were examiner numbers, demographics and personality.
Demographic and personality data was collected by questionnaire. A purposeful sample of medical doctors who examine in the Final Medical examination at our institution was recruited.Results Variability between scores given by examiner pairs (N=6) was less than the variability with individual examiners (N=12). 75% of examiners (N=9) scored below average for neuroticism and 75% also scored high or very high for extroversion. Two thirds scored high or very high for conscientiousness. The higher an examiner's personality score for extroversion, the lower the amount of change in his/her score when paired up with a co-examiner; re ecting possibly a more dominant role in the process of reaching a consensus score.Conclusions While the variability between scores given by examiner pairs (N=6) was less than the variability with individual examiners (N=12), the reliability statistics for both assessments were comparable. Using paired examiners resulted in a more accurate and robust score than simply averaging two independent examiners scores. The higher an examiner's personality score for extroversion, the lower the amount of change in his/her score when paired up with a co-examiner; re ecting possibly a more dominant role in the process of reaching a consensus score. These ndings could have implications for the organisation and administration of clinical assessments. Further studies with larger numbers of participants might establish if personality testing before choosing examiner pairs could be utilised to help pair examiners and improve examiner variability.

Background
To become a competent physician, undergraduate medical students must be assessed not only on factual knowledge but also on communication and clinical skills. The reliability of clinical assessments to test these skills however, is known to be compromised by high levels of variabilityi.e. different results on repeated testing 1,2 .
Candidate variability, case variability (case speci city) and examiner variability all contribute to the overall variability of a clinical assessment. Candidate variability re ects the difference between candidates and in the absence of other variables (or error) candidate variability represents the true variability. Case speci city refers to the phenomenon that a candidate's performance can vary from one case to the next due to differing levels of di culty or content 2, 3 . Examiner variability refers to the fact that two examiners observing the same performance may award different scores. Many studies have shown that examiner variability is the most signi cant factor contributing to variability in clinical examinations 4,5 and may even exceed the variability accounted for by differences in candidates 6 . The degree of examiner variability which is deemed acceptable is generally a minimum of 0.6 with 0.8 being the gold standard (where 0 shows no relationship between two examiners scores and 1 is a perfect agreement) 7 .
Variability in how examiners score candidates may be consistent, for example, an examiner who always marks candidates stringently (often referred to as a hawk) or an examiner who is consistently lenient (a dove) 3 . This kind of consistent examiner behavior can often be adjusted for when analyzing results.
However, examiner variability may not always be so consistent and predictable.
Examiners in clinical assessments are subject to many forms of bias 8 . The 'Halo effect' refers to the phenomenon where an examiner's overall rst impression of a candidate ("he seems like he knows his stuff")) leads to failure to discriminate between discrete aspects of performance when awarding scores 9 . In addition, familiarity with candidates, the mood of the examiner, personality factors, and seeing information in advance have all also been found to affect examiners judgments 10,11,12 . Variability may result in a borderline candidate achieving a score in the pass range in one assessment and the same candidate failing a comparable assessment testing the same/similar competencies. In high stakes examinations, such as medical licensing examinations, this can have serious implications for both the candidate, the medical profession and even society in general. Moreover, pass/fail decisions are now increasingly being challenged 13 .
Efforts to reduce variability in clinical assessments have ranged from utilising higher numbers of stations in Objective Structured Clinical Examinations (OSCEs), to employing objective checklists 2,14 . Many of these approaches have not been found to make any meaningful improvements to reliability 15 . However, increasing the number of observations in an assessment (by involving more examiners in the observation of many performances) has been shown to improve reliability 16 . In their evaluation of the mini-clinical exercise used in US medical licensing examinations, Margolis and colleagues stated that having a small number of raters rate an examinee multiple times was not as effective as having a larger number of raters rate the examinee on a smaller number of occasions and more raters enhanced score stability 6 . Consequently, an approach frequently adopted to improve reliability and limit the impact of interexaminer variability is to pair examiners and ask them to come to an agreed score for a candidate's performance. Little is known however, about what occurs when these paired examiners interact to generate a score.

Summary of existing literature
Although the hawk-dove effect was described by Osler as far back as 1913 17  identi ed one examiner as a hawk 18 . There was a signi cantly lower pass rate in the group of candidates where this examiner examined compared with the remainder (46.3% and 66.0% respectively).
In 2006, an analysis of the reliability of the MRCP UK clinical examination that existed at that time, the Practical Assessment of Clinical Examination Skills (PACEs) exam, found that 12% of the variability in this examination was due to the hawk-dove effect 19 . Examiners were more variable than stations.
In 2008 Harasym et al. 20 found an even greater effect due to the hawk-dove phenomenon in an OSCE evaluating communication skills. Forty four percent of the variability in scores was due to differences in examiner stringency/leniency; over four times the variance due to student ability (10.3%).
As mentioned above, many types of rater-bias are known to be at play when human judgement comprises part of any assessment process (halo effect, the mood of the rater, familiarity with candidates, personality factors etc 8, 9,10,11 ). Yeates and colleagues in 2013 proposed three themes to explain how examiner-variability arises 21 . They termed these: differential salience (what was important to one examiner differed to another); criterion uncertainty (assessors' conceptions of what equated to competence differed and were uncertain); information integration (assessors tend to judge in their own unique descriptive language forming global impressions rather than discrete numeric scores).
Govaerts suggests that some examiner-variability may simply arise from individual examiners' peculiarities in approach and idiosyncratic judgements made as a result, of the interaction between social and cognitive factors 12 .
Earlier reports had suggested that employing objective checklists would help overcome examinervariability by regulating subjectivity 2 . More recently however, several lines of evidence suggest that global judgements produce more reliable results than highly structured tools 4,14 . Furthermore, measurement instruments have been shown to account for less than 8% of the variance in performance ratings 22 .
Other proposals to improve reliability have involved increasing the number of items used per station.
However, Wilkinson et al analysed examiners marks over a four-year period in New Zealand and found that while items-per-station increased over the four years, there was no correlation between items-perstation and the station inter-rater reliability 4 .
The impact of examiner training has also been looked at in many studies 23 . Cooke et al. 24 found no signi cant effect and while Holmboe et al. 25 showed that training produced an increase in examiner stringency, this increase was inconsistent.
In a recent literature review on rater cognition in competency based education Gauthier et al. 15 summarised the situation stating: "attempts to address this variability problem by improving rating forms and systems, or by training raters, have not produced meaningful improvements".

Setting and characteristics of participants
The study population consisted of quali ed medical doctors who examine in the nal medical short-case examination at our institution. Participants were invited by email and each received a participant information lea et, electronic consent form and demographic questionnaire.

Description of all processes, interventions and comparisons
In the nal medical examination at our school, medicine and surgery are assessed together in a shortcase examination. Each candidate is assessed over 6 short-cases, a mixture of medical and surgical cases, each lasting 6 minutes using a real or simulated patient. Candidates are observed by pairs of examiners, usually a surgeon paired with a physician. After each candidates' performance, examiners discuss and come to an agreed score using a domain based marking sheet. Our data collection exercise was set up to mimic as closely as possible this real-world examination scenario using recordings of simulated patients.
Participants were strati ed to mimic the examiner pairings usually employed (a surgeon with a physician). The participants did not assess a real students' performance; instead we used video recordings of standardised student performances (using actors) that were previously created for the purposes of examiner training. We selected 3 videos as follows: one example each of a weak, average and good performance. Examiners were not aware what level of performance they would be watching.
Different case types were selected (one medical, one surgical and one general medical/surgical) to avoid one examiner being more familiar than the other examiners with the content of the selected cases. Each participant viewed, initially on their own individual screens, the three recordings and graded them independently. The total possible score at each station was 50 marks-with ten marks each allocated to ve separate domains; attitude and professionalism, communication skills, clinical skills, knowledge and lastly management. Our schools OSCE Management Information System Software-OMIS by Qpercom Ltd was used to enter marks. Utilising this software examiners were blinded to their individual scored of a given performance. When the examiners scored the performance across the individual ve domains, the scores were on a slider and the examiner did not see what their resultant overall mark was from combining the 5 domains.
After the examiners had scored the videos independently there was a break for refreshments. Examiners then completed a validated 60 item personality questionnaire -the NEO Five Factor Index (NEO-FFI) 27 . In this personality index, no single cut-off point separates those who "have" a particular personality-trait from those who do not, rather individual scores represent degrees of each of the ve main personality traits-neuroticism, extroversion, openness to experience, agreeableness and conscientiousness. Score results are usually expressed as a T score and can be further described as being very low, low, average, high and very high for each of the domains. After completing the personality questionnaire, examiners were moved to a neutral location and paired up with another examiner to review and discuss the same three performances again and this time devise a joint mark which was entered on OMIS. The order of the videos when watched as individual examiners compared with observing in pairs was counterbalanced to control for an order effect 28 . Blinding the participant as to the overall original scores given and changing the order of videos from the previous observation was particularly important to maintain internal validity. We looked for a correlation between the total amount of change in an examiners marks from when they examined individually to when they examined in a pair, and their personality scores (See gure 1 for scatter plots).

Statistical analysis
Data collected on candidate scores was analysed using the OMIS OSCE management software and SPSS 24 (IBM corp). Preliminary analyses con rmed that the data were not normally distributed and, therefore, non-parametric methods were employed in the statistical analysis. Descriptive statistics were generated using tables and charts. The OMIS OSCE management software allowed for psychometric analysis and provided support for generalisability analysis.

Results
Fifty potential participants were contacted by email and invited to participate. Seventeen respondents accepted the invitation and twelve completed the study -10 male and 2 female. They had an average of 13.6 years' experience examining in the nal-medical short-case examination at our institution. Two thirds were in posts that were combined clinical and academic. Two participants held any formal quali cations in medical education.
Variability Table 1 and gure 2 show the overall scores awarded by each examiner to the three candidates when examining alone and demonstrate considerable variability in examiners' scores. Table 2 and gure 3 show the overall scores awarded by Examiners when in pairs to the three candidates. The ranges and standard deviations reveal that the variability between scores given by examiner pairs is, as might be expected, less than that in the assessment using 12 individual examiners.
Generalisability analysis allows for more in-depth analysis of the variance of our assessments, identifying the relative contribution of each of the components (or facets) of that assessment-the examiners (observations, O), the scenarios (S) and their interactions (SO). In the assessment using individual examiners, 87.1 % of variance was found to be due to examiners while 12.9% was due to the interaction between the examiner and the scenario (table 3).

Reliability
Using Classical Test Theory Cronbach's alpha and intra-class correlation coe cients were calculated for the assessment using 12 single examiners and the second assessment using 6 examiner pairs. The reliability statistics for the two assessments were in fact comparable (table 4).
Using Generalisability theory, the G-coe cient of the assessment using 12 individual examiners was calculated as 0.95. The Standard Error of Measurement (SEM) was 4.5% (see table 5) which means the candidates' true score lies between the observed score +/-4.5%. This is quite a high margin which would have signi cant consequences for marks around the pass/fail and honours/pass thresholds. However, our Decision-study (D-study) -which gives us an indication of what happens to the reliability and SEM of an assessment if we increase the number of scenarios, showed that increasing the number of scenarios from 3 to 12 would reduce the SEM to a more acceptable level of 2% (see table 6).

Impact of pairing up on Candidates' score/outcome
We compared candidates scores when they were examined by 12 individual examiners with their scores when they were examined by 6 examiner pairs (see tables 1 and 2). The 'good' performance was awarded an honour by all 12 individual examiners and all 6 examiner pairs. Similarly, the weak performance was failed by all examiners. However, when examined by individual examiners, the average performance was awarded 4 passes, 6 borderline results (between 40 and 49%) and failed by 2 examiners. When assessed by examiner pairs the average performance was not failed on any occasion but received 4 borderline marks and 2 passes. Wilcoxon signed rank test showed a statistically signi cant difference between mean scores for the average student (p = 0.0430).

How each examiners' marks changed when they were paired up
The marks given by each examiner when they examined singly were compared with the agreed mark given by the same examiner to each candidate when examining in a pair. The amount of change in each examiner's overall mark for the three candidates was calculated. Table 7 shows the change in examiners marks and the direction of that change (a minus sign indicated their mark reduced when they paired up). The amount of change (regardless of whether positive or negative) for each examiner was then summed across the three candidates to devise a gure representing the total amount of change in marks per examiner.
There was a statistically signi cant negative correlation (-0.808) between extroversion and change in examiners score -the higher an examiners' score for extroversion the lower the degree of change in his or her score when paired up with a co-examiner (p = 0.001) (see table 8).

Discussion
Our study shows that there is less examiner variability and therefore improved reliability in using examiner pairs. Using paired examiners resulted in a more accurate and robust score than simply averaging two independent examiners scores. The average performance was passed by all examiner pairs however two examiners failed this candidate when examining individually (p = 0.0430). This has implications for candidate outcomes. The correlation between degree of change of examiners mark and score for extroversion suggests personality traits do have an impact on examiner behaviour and candidate outcomes.
Comparing the marks given by examiners in pairs to the marks they previously gave when examining alone proved revealing (see table 7). In no instance did the new mark simply equate to the mean of, or midpoint between the two individual examiners marks. Instead, in each case, the marks awarded by examiner pairs tended towards one examiner's previous original mark rather than the other, the 'dominant' examiner if you will. In 5/6 pairs this 'dominant' examiner was a physician. All of the physicians scored high or very high for extroversion and we found a statistically signi cant correlation between change in examiner score and extroversion -the higher an examiners score for extroversion the lower the amount of change in his or her score when paired up (p = 0.001). This is perhaps not surprising as extroverts are described as assertive and talkative, two characteristics which would certainly enable an examiner to "stand their ground" as it were.
Our sample con rmed the ndings of previous studies that in personality testing, doctors tend to score low for neuroticism and high for extraversion 29 . We did not nd any relationship between examiner personality and stringency as was found in a previous study in our school 17 .
Our ndings support the opinion that the score of examiner pairs may be a more accurate and robust score than simply averaging two independent examiners scores. This could have implications for the organisation and administration of clinical assessments. Further study with a larger number of participants might establish if personality testing before choosing examiner pairs is warranted. Limitations: Recruitment of participants proved di cult and so our sample was small. There was a small number of female particpants. It could be argued that there was a learning or testing effect in the set-up of our mock examination whereby the examiners assessed the same performances twice. Ideally, we would have used a larger number of video recordings to avoid compromising the internal validity of this study in this way however, increasing the length of the process would have made recruitment even more di cult. Some investigators raised concerns about the recording of participants' discussions giving rise to "the Hawthorne effect" where the awareness of being observed impacts on research participants' behaviour 30 however, a review of the literature found very little empirical support for this effect in medical education 31 .

Conclusions
Our study shows that the practice of using paired examiners in clinical assessments is to be recommended. While using paired examiners may use more resources, in the case of high stakes assessments and an increasingly litigious society, grades are awarded by examiner pairs after robust discussion and therefore can be more easily defended in the case of appeals.           Figure 1 Scatter plots for correlation between change in examiners marks and personality Variability of overall scores -Individual Examiners Variability of overall scores -Paired Examiners