Assessment of examiner leniency and stringency ('hawk-dove effect') in the MRCP(UK) clinical examination (PACES) using multi-facet Rasch modelling
© McManus et al; licensee BioMed Central Ltd. 2006
Received: 24 March 2006
Accepted: 18 August 2006
Published: 18 August 2006
A potential problem of clinical examinations is known as the hawk-dove problem, some examiners being more stringent and requiring a higher performance than other examiners who are more lenient. Although the problem has been known qualitatively for at least a century, we know of no previous statistical estimation of the size of the effect in a large-scale, high-stakes examination. Here we use FACETS to carry out a multi-facet Rasch modelling of the paired judgements made by examiners in the clinical examination (PACES) of MRCP(UK), where identical candidates were assessed in identical situations, allowing calculation of examiner stringency.
Data were analysed from the first nine diets of PACES, which were taken between June 2001 and March 2004 by 10,145 candidates. Each candidate was assessed by two examiners on each of seven separate tasks. with the candidates assessed by a total of 1,259 examiners, resulting in a total of 142,030 marks. Examiner demographics were described in terms of age, sex, ethnicity, and total number of candidates examined.
FACETS suggested that about 87% of main effect variance was due to candidate differences, 1% due to station differences, and 12% due to differences between examiners in leniency-stringency. Multiple regression suggested that greater examiner stringency was associated with greater examiner experience and being from an ethnic minority. Male and female examiners showed no overall difference in stringency. Examination scores were adjusted for examiner stringency and it was shown that for the present pass mark, the outcome for 95.9% of candidates would be unchanged using adjusted marks, whereas 2.6% of candidates would have passed, even though they had failed on the basis of raw marks, and 1.5% of candidates would have failed, despite passing on the basis of raw marks.
Examiners do differ in their leniency or stringency, and the effect can be estimated using Rasch modelling. The reasons for differences are not clear, but there are some demographic correlates, and the effects appear to be reliable across time. Account can be taken of differences, either by adjusting marks or, perhaps more effectively and more justifiably, by pairing high and low stringency examiners, so that raw marks can be used in the determination of pass and fail.
An examiner for the MRCP(UK) clinical examination, PACES, in an informal, personal account of examining, wrote:
"Outside, seagulls, starlings, and sparrows, and the occasional blackbird, come and go. Inside, there are hawks and doves." 
Clinical examinations require, to a large extent, that judgements of candidates are made by experienced examiners. A potential vulnerability of any clinical examination is that examiners differ in their relative leniency or stringency. Traditionally this is known as the 'hawk-dove' effect, hawks tending to fail most candidates because of having very high standards, whereas doves tend to pass most candidates. Indeed so notorious is the problem that some individual examiners, such as Professor Jack D. Myers ("Black Jack Myers") in the United States, have become famous in their own right as notorious hawks . Although the problem of hawks and doves is easy enough to describe, finding an effective statistical technique for assessing it is far from straightforward.
The hawk-dove nomenclature has itself been criticised (although it must be said that the terms hawk and dove are well-known in the literature, e.g. [3–10]). Alternative suggestions have included 'stringent' and 'lenient', and from a different perspective there is a suggestion that examiners can either be 'candidate centred' (i.e. their sympathies are primarily with the candidates, of whom they wish to pass as many as possible) or 'patient centred' (i.e. their primary aim is to maintain clinical standards at a high level so that patients are protected and provided with competent doctors).
A slightly different approach to naming refers to 'examiner specificity' (e.g. ), a candidate's marks depending on the particular examiner(s) they happen to see. The name suggests that this concept is similar to 'case specificity', in which, because candidates are not equally proficient at all clinical tasks they have areas of weakness and strength, and hence can get lucky or unlucky in the particular cases they happen to see, sometimes seeing cases with which they are familiar and other times seeing cases with which, for a host of reasons, they are unfamiliar. Case specificity is said to be found in a wide range of assessment contexts (see e.g. [11–16]), although an important recent study suggests that much case specificity may actually be variance due to items within cases rather than cases per se . However, it is not clear that the parallel between case specificity and examiner specificity is in fact appropriate. The key feature of case specificity is that it is a variation in a candidate's ability across different types of case, and so examiner specificity should also refer to a variation in hawkish-dovishness according to the particular case. That though is not what we are referring to here (although it could be analysed), but are instead only considering an examiner's overall propensity for being strict or lenient (in the same way as the overall candidate effect looks at their overall propensity to be correct or incorrect). We will not therefore use the term 'examiner specificity'.
None of the terms is entirely satisfactory, but the hawk-dove nomenclature has the advantage of being in use for at least three decades, and being an effective and easy metaphor (and one which is used in several other areas of science as well, as for instance in games theory and evolutionary biology ). Leniency and stringency are however somewhat less emotional descriptors, and we will therefore use the terms leniency and stringency while discussing statistical results, but will also use hawk and dove on occasion when they are useful metaphors in discussion. We must emphasise that when we use the latter terms they should be seen as extremes on a continuum, rather than as discrete classes of individuals (although one does occasionally see comments implying the latter, such as in a surgery examination where is was suggested that "the ratio of hawks to doves is said to be 9:1 or 8:2, so expect at least one examiner of the ten that you meet to appear as 'smiling death' " , or in the phrase that, "comparing results across examiners shows that we tend to be either 'hawks' (marking hard) or 'doves' (marking easily)" ). However, just as most people are neither extraverts nor introverts, and are neither tall nor short, but instead are somewhere in the middle of the range, so it is likely that most examiners are somewhere between the extremes of being a hawk or dove, and hence are in the mid-range of stringency-leniency. Of course, once stringency-leniency becomes measurable then the shape of the distribution becomes an empirical matter, and will be discussed below.
Although the problem of hawks and doves in medical examination is often mentioned, there are relatively few statistical analyses of the problem (although there is some work within medicine [21, 22] and elsewhere [23–25]). An early example of a statistical analysis looking at hawks and doves is to be found in a paper from 1974 which describes a previous major revision of the MRCP(UK) . It considered 10 examinations, taken by 2269 candidates and in whom the overall pass rate was 62.8%. Each candidate was seen by two examiners and together the two examiners produced an agreed mark. "Examiner X" had examined 367 candidates (with 10 different other examiners), and only 46.3% of those candidates had passed the exam, a highly significant difference from the 66.0% pass rate in the remaining candidates (assuming, as the paper says, that candidates were effectively allocated to examiners at random). The paper concludes, "There can be little doubt that X was a 'hawk' whose influence on his colleagues was such as to lower the pass rate for the candidates he examined substantially below the expected level" .
The statistical identification of hawks and doves is not straightforward. At first sight it might seem that examiners could be compared on the average marks they award, with those giving higher marks being classified as doves, and those giving lower marks being classified as hawks. That however assumes that indeed all other things are equal, which is unlikely to be the case. Examiners do not all see the same candidates (and it is possible that candidates in some centres may be less competent than those in other centres). Stations can also differ in difficulty, and examiners not examine an equal numbers of times on each station, so that examining more often on difficult stations might artefactually make an examiner appear to be more hawkish. In this paper we wish to describe a statistical analysis of a large number of candidates who have taken PACES, the clinical examination of the MRCP(UK), in which we use multi-facet Rasch modelling to identify examiner effects.
The examination for the Membership of the Royal Colleges of Physicians of the UK (MRCP(UK)) has always included a clinical examination. In the past the examination took a very traditional format of one long case, several short cases, and an oral examination . In June 2001 the examination was radically restructured into the Practical Assessment of Clinical Examination Skills (PACES) . Before taking the examination, candidates must have passed the Part 1 and Part 2 written examinations, which assess clinical knowledge and applied biomedical science. Selection, training and monitoring of examiners is provided, as described in a document provided by the Colleges [see Additional File 1].
Details of the examination are given in the Method section below, but here it will suffice to say that each candidate receives two separate marks on each of seven different clinical activities. The key to understanding the assessment of examiner stringency in the PACES examination is to realise that each candidate on each station is always seen by two examiners. The two examiners observe the identical clinical encounter at the same time, candidate, centre, patient or simulated patient being seen, clinical task, words spoken by the candidate and the examiners, all being identical. The only thing that differs is the two examiners themselves. If one examiner is more stringent than the other then they will systematically tend to give a lower mark.
If examiners A and B assess together on a number of occasions then a comparison of their paired marks gives an index of their relative stringency. If subsequently B examines with C and then C examines with D, then the paired comparisons allow each of the four examiners to be placed in order, with estimates of the standard errors of their relative stringency. This design is, in effect, an incomplete paired comparison design, and the statistical analysis by the Bradley-Terry-Luce model has been explored for many years [29–31]. In the context of professional sport such models are routinely used for assessing the international ranking of tennis players and chess players based on who has played and beaten whom. The methods are also equivalent to the calculations used in the class of models developed by Georg Rasch (1901–1980), now known as Rasch models [32–34], and which are routinely used for assessing the performance of questions and candidates in a wide range of examinations. In general Rash modelling is straightforward because each candidate will answer every examination question, and item and candidate scores can readily be calculated. That feature is not however necessarily present for assessing examiner effects.
A potential problem for applying Rasch models to examiner stringency is the concept of 'linkage' or 'relatedness'. In a particular diet of an exam, examiners A, B, C and D may have examined together as described above, because they were all working together in a particular centre. At another centre, examiners E, F, G and H may also be working together, and hence an estimate of their relative stringency can also be calculated. However geographical separation, coupled with the practicalities of a single examination at a single point in time, means that none of A, B, C and D ever examines with any of E, F, G and H. The two sets of results from the different centres are therefore not linked, and no estimates can be calculated of the relative stringencies of the entire set of examiners.
A solution to the problem of linkage is found if some examiners examine on several different diets at different centres. If on the next diet, E travels to the other centre and examines with examiner A, then the minimal condition is met for all eight examiners being linked, and a joint analysis of the two diets can rate the stringency of all examiners. The analysis described here considers the first nine diets of the PACES examination, and it will be shown that sufficient examiners have examined with enough other examiners for there to be linkage.
The statistical analysis described here uses the program FACETS  which carries out Rasch modelling for constructing linear measures from qualitatively ordered counts in multi-facet data. To summarise succinctly, the relationship between a conventional Rasch model (which is now commonly used to analyse the results of examinations) and FACETS, is similar to that of the relationship between simple regression and multiple regression. In simple regression one asks how an outcome variable, such as blood pressure, is related to a background (or independent) measure such as age, whereas multiple regression allows one to see how an outcome measure relates to several background variables, such as age, height, serum cholesterol, and so on. Similarly, while a Rasch model shows how the probability of answering an item correctly on an examination relates to the difficulty of an item and the ability of a candidate, with FACETS one can assess how the probability of answering an item not only relates to item difficulty and candidate ability, but also to a range of background factors, including characteristics of examiners and the nature of the assessment. FACETS, most simply, is therefore a multivariate generalisation of Rasch modelling. That can be seen more clearly in a formal mathematical model.
The Rasch model
The basic Rasch model considers only a group of n candidates, who each have an ability, C i , (i = 1,n), and who each takes a set of m tests, each of which has a difficulty T j (j = 1,m). The probability of candidate i correctly answering test j, Pij, is then estimated as:
logit (P ij ) = log(Pij/(1-P ij )) = C i - T j .... (1)
Given a reasonable number of candidates taking a reasonable number of tests it is then possible to use maximum likelihood methods to calculate separately an ability measure for each candidate and a difficulty measure for each test item. In addition, a standard error can be calculated for each of these measures. A practical point to note in equation 1 is that it has used the conventional method of scoring in which higher candidate scores indicate higher ability (and hence a greater likelihood of answering the question being answered correctly), and difficult tests also have a higher score (and hence, because of the negative coefficient in equation 1, a lower probability of being answered correctly). Later this can be seen more clearly in the "yardstick" output from FACETS, where the various scores are placed side by side. A very competent candidate climbs high up the diagram, and therefore is successfully answering more difficult stations, and is also satisfying the more hawkish examiners. Rather like a high-jump exam, the better jumpers have a higher chance of clearing the higher jumps.
The partial credit model
The basic Rasch model considers only items which are answered correctly or incorrectly. However on many forms of examination the examiners rate candidates on a ranking scale (e.g. as in PACES, 'Clear Fail', 'Fail', 'Pass' and 'Clear Pass'). Although conventionally scored as 1,2,3 and 4, there is no statistical basis for treating such judgements as being on an equal interval scale. Such ratings are readily incorporated into the Rasch model, and the size of the intervals can be assessed directly. Let candidates be assessed on a scale with r categories, so that each mark has its own difficulty, M k , (k = 1,r) . The partial credit model is then:
logit (P ijk ) = log(Pijk/(1-P ijk )) = C i - T j - M k .... (2)
where P ijk is the probability of candidate i on test j receiving a mark of k. Once again the negative coefficient for M k means that high scores for M k mean it is more difficult for a candidate to get a higher mark. The partial credit model allows the differences between the various points on a mark scale to be assessed. (Note, although we here refer to this model as the partial credit model, it is in essence identical to the rating-scale model [36, 37]).
The multi-facet Rasch model
A further extension of the Rasch model allows additional parameters to be estimated which take into account other factors in the design of the test and might account for variability. Although in principle there is no limit to such additional FACETS, here we will only consider the situation relevant to PACES, in which examiners also differ in their stringencies. Let there be p examiners, each of whom has a stringency, E l (l = 1,p), with a high stringency meaning that a candidate is less likely to receive a higher mark from that examiner than they are from a less stringent examiner. The equation then can be expressed as:
logit (P ijkl ) = log(Pijkl/(1-P ijkl )) = C i - T j - M k - E l .... (3)
The probability of a candidate receiving a particular mark then depends on their own ability (C i ), the difficulty of the test (T j ), how high the mark is (M k ), and the stringency of the examiner (E l ). In this paper we will restrict ourselves to the model shown in equation 3. Although in theory it is straightforward to include in the model other FACETS which might affect the performance of candidates, that is not always easy in practice because the data in complex designs are not always 'linked' or 'connected', where connected is used in the technical sense used in graph theory, in that there is a path between all possible pairs of vertices. For further discussion of this see the FACETS manual , which also refers to the work of Engelhardt , and says that the algorithm for testing connectedness is an extension of that described by Weeks and Williams .
The primary interest of the present paper will be in the differences which occur between examiners, and in how these may be estimated, and in ways in which they may be corrected for in the marking of the examination. Examiner variation can reduce the validity of an examination since the likelihood of a candidate passing depends not only upon the candidate's own ability, but also upon whether they got lucky or unlucky in their particular choice of examiners. Although examiners are to a first approximation randomly allocated to candidates, there can also be systematic biasses (and in the case of the PACES examination it is known that candidates sitting in centres outside the UK often have a lower overall pass rate than those taking the examination in UK centres, and it is also the case that the most experienced UK examiners are also the ones who collaborate with local examiners in examining at centres outside the UK). We will also try and assess what demographic factors characterise stringent or lenient examiners, and whether examiners vary in their stringency in different tests of the examination, or with particular types of candidate.
The results of the first nine diets of PACES (2001 to 2004) were analysed. In addition and where possible, data were collected on the demography of examiners and candidates, in order to assess their contribution to variation in examiner stringency.
The PACES examination
At each station the candidate is assessed by two examiners, each of whom marks entirely independently of the other examiner, and there is no conferring or discussion after the candidate has left the room. Marking takes place on a proforma on which examiners indicate the candidate's proficiency on a number of sub-scales, and then they make an overall judgement of the candidate. The overall judgement is implicitly criterion-referenced, and has four categories (Clear Pass, Pass, Fail, Clear Fail), with anchor statements describing the performance at each level. It is intentional that there is no judgement between the marks of Pass and Fail, so that examiners explicitly have to make a decision about each candidate, relative to the standards expected of a just-passing candidate taking the examination.
The four categories of Clear Pass, Pass, Fail, Clear Fail receive numerical marks of 4,3,2 and 1. Since each candidate receives a total of fourteen marks, the total mark is between 14 and 56. For various historical reasons, and after development and piloting, the pass mark at the first diet of PACES was set at 41, and has been maintained at that level for the first nine diets, which are reported here. An additional rule, which applies to only a tiny percentage of candidates, is that any candidate who receives three Clear Fail marks from three different examiners will automatically fail the examination; in practice most candidates meeting this criterion would have failed the examination anyway as their total mark is below 41.
Meaning of 'stations'
It should be noted that although there are five physically separate stations in the examination proper, for the remainder of this paper the term 'station' will be used rather more conveniently to refer to each of the seven separate assessments made of a candidate, rather than to the five twenty-minute sessions within which those seven assessments are made.
The first nine diets of the PACES examination were taken between June 2001 and March 2004, with two diets in 2001, three in 2002 and 2003, and one in 2004. The total number of candidates taking the examination was 10,145, an average of 1,127 on each diet (range 784–1,355). Some candidates took the exam on more than one occasion, and for the present analysis they have been treated as if at each time they were separate candidates (since it was to be expected that their performance may have improved across diets) . On each diet the pass mark was set at 41 (for explanation see Dacre et al ). Overall the pass rate was 46.6% (4724/10145). Overall the analysis considered a total of 142030 marks.
6834 candidates were male (67.4%) and 3311 were female (32.6%). 4483 (44.2%) of the candidates were graduates of UK medical schools, and 8916 (87.9%) of the candidates were taking the examination in centres based in the UK.
Overall 1259 examiners took part in the 9 diets of PACES, with each examiner being paired in every case with a second examiner. Each examiner had assessed an average of 113 candidates (SD: 83, range 1–593; quartiles = 50–153; median = 96; mode = 40). 1037 examiners (82.4%) only examined candidates in the UK, 119 examiners (9.5%) only examined candidates outside the UK, and 103 examiners (8.2%)examined candidates in both the UK and elsewhere. The latter group had more overall experience of examining in PACES (mean candidates examined = 238; SD = 104), than those examining only in the UK (mean candidates examined = 105; SD = 73), or those examining only outside the UK (mean candidates examined = 73; SD = 44). The average year of birth of examiners was 1951.4 (SD = 6.6, range = 1931 – 1968). 1042 examiners were known to be male, 123 were known to be female, and the database did not have information on the other 94 examiners (60 of whom examined only outside the UK and 34 of whom examined only in the UK).
Multi-facet Rasch modelling
Separating the effects of candidate ability, test (station) difficulty, examiner stringency and the marking scale
A three-facet Rasch model was run on all 142,030 examination marks, using the model described in equation 3 above. Of particular importance was that FACETS reported that subset connection was "OK", meaning that the data were connected and that linkage had occurred satisfactorily, so that examiner stringency could be compared on a common scale across all examiners.
Since subset connection is so important to FACETS, we investigated the extent to which it was achieved by smaller data sets. We tried three different sets of just three diets (1,2 and 3; 7, 8 and 9; 1, 4 and 9) and in each case connection was OK. When however we tried just two diets (8 and 9; 1 and 9) we found that while the latter was connected, the former showed 19 disjoint subsets. It seems likely that three diets is necessary for connection to be adequate. We also attempted to run a model using data from just one station, but that analysis failed with multiple disjoint subsets, showing that although case connection is satisfactory for the entire data set, it is vulnerable as soon as only a part of the data is used.
Station (test) differences
Differences between the seven stations in average mark, and effect estimated by FACETS. The final column shows the average adjusted mark, after taking examiner and candidate differences into account.
Station (test) effect
SE of effect
Communication and Ethics
The yardstick in Figure 2 shows that examiners are more variable than are stations, the standard deviation being 0.33 for examiners but only 0.10 for stations. However it should also be noted that the SD for candidates is 0.87, meaning that the spread of candidates is 2.64 times that of examiners (and hence the candidate variance is 6.97 times as large as the examiner variance). On the basis of those figures, 87% of the systematic variance in the marks is due to differences in candidates, 12% due to differences in examiners, and 1% due to differences in station type. In interpreting these results it should be noted that the FACETS analysis cannot take into examiner-by-station and examiner-by-candidate variance, and hence any calculation of reliability or similar figures is likely to be inflated relative to the true value. However, these statistics are not the main interest of the present paper, so that the problem is not a serious one.
The yardstick of figure 2 shows that the extremes of examiner stringency are of a similar size to the difference between the Pass and Fail borderlines. The distribution of examiner stringency estimates in the yardstick also makes clear that, to a good first approximation, stringency is normally distributed, and refutes any simple differentiation of examiners into two separate classes who can be called hawks and doves.
An important question concerns the factors which differentiate between stringent and lenient examiners. Information was available on the sex of examiners and their year of birth, the number of candidates they had examined, and the proportion of candidates who were examined in the UK. These will be considered in turn.
The 1040 male examiners had a slightly higher stringency score (mean = .002; SD = .326) than the 123 female examiners (mean = -.0537; SD = .359), although the difference was not statistically significant (t = 1.774, 1161 df, p = .076).
Year of birth
There was no significant correlation of stringency with examiner year of birth (r = -.028, n = 1162, p = .336). Scattergrams showed no evidence of curvilinearity, and neither did quadratic regression show any significant effect.
Number of candidates examined
Proportion of candidates examined in the UK
Ethnic origin of examiners
Self-reported ethnic origin was available for 955 examiners, of whom 84 (8.8%) were non-European and the remainder were European. Ethnic origin was available for only 5 of the examiners who examined entirely outside of the UK, and all 5 were non-European. The 84 non-European examiners had a significantly higher stringency score (mean = .075, SD = .326), compared with the 871 examiners of European ethnic origin (mean = -.0187, SD = .326), the difference being significant (t = 2.509, 953 df, p = .012).
Many of the background factors describing examiners were confounded (e.g. female examiners tended to be younger, and to have examined fewer candidates). The simultaneous effects of sex, year of birth, number of candidates examined, and proportion of candidates examined in the UK were examined by a backwards elimination multiple regression. Missing values were handled by mean substitution. Two effects were independently significant. Examiners who had examined more candidates were more hawkish (beta = .089, p = .005), and examiners of non-European ethnic origin were more hawkish (beta = .079, p = .014). There was no significant sex difference in the multivariate analysis.
Candidate differences: reliability of candidate marks
As well as scale, station and examiner differences, the yardstick of figure 2 also shows scores for candidates. It is clear that there is a wide variation in candidate ability, as might be expected. The standard deviation of the candidate logit scores is 0.87 (and the adjusted standard deviation is 0.78), with logit scores in the range of -4.04 to +5.10. The average standard error (root mean square) of the logit scores is 0.37. As a result the reliability of the estimates of candidate ability is 0.82. The standard error on the logit scale for candidates at the pass mark on the raw scale of 41 is about 0.35, which is equivalent to 3 marks on the raw mark scale. That means that an individual candidate scoring 41 has a 95% chance of their true score being two standard errors either side of their actual mark, in the range 35 to 47. As mentioned above, FACETS cannot take into examiner-by-station and examiner-by-candidate variance, and hence estimates of reliability may be inflated relative to the true value.
The conventional pass mark for the examination based on summed raw marks is 41 (although it is probably better described as being 40.5 since candidates with a mark of 41 pass whereas those with a mark of 40 fail, and the true pass mark is somewhere between those two bounds). The vertical and horizontal dashed lines in figure 8 are therefore set at 40.5, and indicate those candidates who would pass or fail using the raw or the adjusted mark. Of the 10,145 candidates, 4568 would pass using either criterion, and 5158 would fail using either criterion. However 263 candidates (2.6%) who failed using the raw mark criterion would have passed using adjusted marks, and 156 candidates (1.5%) who currently have passed the examination would have failed using adjusted marks. The use of adjusted marks would therefore have increased the pass rate from 46.6% to 47.6%, a 1% change in the proportion of candidates passing the examination.
Testing the assumptions of the model
Like all statistical models, the multi-facet Rasch model is a simplification of the subtleties of real data. It is useful as a model in so far as a relatively small number of parameters can explain most of the variation present in a complex data. In the present case, 142,030 data points are being explained by a total of 11,412 parameters (10,145 for the 10,145 candidates, 1,258 for the 1,259 examiners, 6 for the seven stations, and 3 for the four scale points), a 92% reduction in information (and equivalent to roughly one fourteenth of the total data).
FACETS provides a number of statistics for diagnosing the quality of the fit of the model to the data, of which the manual states that "If mean-squares indicate only small departures from model-conditions, then the data are probably useful for measurement". The manual also says that mean-square statistics in the range 0.5 to 1.5 are desirable, and that those over 2 should be treated with care. There have also been other criticisms of goodness-of-fit statistics derived from Rasch models .
Examiner goodness of fit statistics
Candidate goodness of fit statistics
The long-term reliability of examiner stringency measures
We also assessed long-term stability using firstly diets 1–3 and then diets 7–9, since the latter diets were all separated from the former by at least a one-year interval. The within-period reliabilities reported by FACETS for these groups of diets were 0.63 and 0.61 (and of course they are lower than those calculated for all nine diets, reported earlier, because they are based on fewer data). The correlations of the stringency estimates across the two periods were 0.402 (n = 146, p < .001) for those examining more than 200 candidates in the entire data, 0.442 (n = 309, p < .01) for those examining 100–199 candidates, 0.335 (n = 101, p < .001) for those examining 50–99 candidates, and 0.468 (n = 20, p = .037) for those examining 49 or fewer candidates overall. These between-period correlations are all compatible with the within-period reliabilities and confirm that stringency is stable across periods of a year or two within the limits of measurement.
Differences between 'communication' and 'examination' stations
An important innovation of the PACES examination in the context of the MRCP(UK) was the introduction of two stations which assessed 'communication' rather than 'clinical examination' skills, one assessing history taking and the other assessing how the candidate handled difficult communication and ethical situations. Examiners sometime express concern that they feel less confident in assessing these stations, because of a relative lack of experience in contrast to the 'examination' stations, which assess skills which they all use and assess on a daily basis, and in which they have been proficient for many years. It is therefore of interest to compare the performance of the two communication stations with the five examination stations.
The first question concerns whether the communication stations differ in their difficulty or discrimination as compared with the other stations. The Rasch model used for the multi-facet modelling is a one-parameter Rasch model, and it therefore only allows stations to differ in their overall difficulty (and the analyses reported earlier suggest that there are only relatively small differences in overall difficulty). The Rasch model used by FACETS in fact assumes that not only stations, but also examiners, candidates and marks differ only in their difficulty. That assumption cannot be tested directly with FACETS, as it does not allow discrimination to differ between the various components of the examination. However two-parameter item response theory (2-IRT) models do allow differences in discrimination between stations [45, 46] (although 2-IRT models can only fit a single facet to the data). Here the program Xcalibre  is used to fit a 2-IRT model to the marks at each station in order to assess the difficulty and the discrimination of each station.
The 2-IRT model fitted by Xcalibre has two parameters for each component in the test; the difficulty, which is equivalent to the single difficulty parameter fitted in the Rasch model, and the discrimination, which allows the slope of the item response curve to differ between stations. A partial credit model was also fitted by fitting three separate binary response measures to each judgement, one for the Clear Fail-Fail borderline, one for the Fail-Pass borderline, and one for the Pass-Clear Pass borderline . For each candidate there were therefore three binary marks derived from each of the fourteen scaled marks, making 42 items altogether. Because each candidate was assessed on each station by two examiners, two marks were analysed for each station, although each was fitted separately. However the two marks which came from each examiner were effectively randomly allocated, the two parameters were expected to be similar, and indeed that was exactly the case. The two parameters from each station at each mark have therefore been averaged for the present analysis.
Item difficulty and discrimination parameters fitted by Xcalibre to the candidate by station data for PACES. The three separate borderlines for the four points on the mark scale are shown separately. Figures in brackets are standard errors of the estimates. The five examination stations are shown at the top, ranked from most to least difficult on the fail-pass criterion. The two stations at the bottom, in bold, are the communication stations.
4 Communication & Ethics
Although a concern of examiners has been that they are uncertain whether they are discriminating well on the communication stations, the fact is that the discrimination parameters at the pass mark are higher in the two communication stations (0.73 and 0.79) than in any of the five examination stations (range = 0.58 to 0.70), and that is also the case at the Clear Fail-Fail boundary (Communication stations 0.81 and 0.89; examination stations, range = 0.65 – 0.80). Only at the Pass-Clear Pass boundary are discriminations of the two communication stations (0.61 and 0.64) similar to those in the examination stations (range = 0.56 to 0.70). Overall it can be concluded that examiners are discriminating somewhat better on the communication stations, although the differences are relatively small. Certainly there is no evidence that examiners are performing less effectively on the communication stations than on the examination stations.
Multi-facet Rasch modelling of communication and examination stations
Since the assumption of equivalent levels of discrimination across stations has been met to a reasonable degree in the different types of station, it is possible to use FACETS to examine scores of candidates on just the communication stations or just the examination stations. There are however five examination stations and only two communication stations, making it difficult to have a direct comparison of scores on the two, since inevitably there is less measurement error with five stations. As a result a composite of the two communication stations was compared with a composite of the respiratory and cardiovascular stations. There was somewhat more variation amongst candidates in the communication stations (SD = 1.83) than in the examination stations (SD = 1.43), and as a result the reliability for measurement of candidate ability was slightly higher in the communication stations (.77) than in the examination stations (.69). There was also a little more examiner variation in the communication stations (SD = .78) than in the examination stations (SD = .67), and the examiner reliability was also marginally higher in the communication stations (.70) than in the examination stations (.68). The boundaries between Clear Fail, Fail, Pass and Clear Pass were also slightly further apart in the communication stations (-2.60, -0.05 and 2.65) than in the examination stations (-2.12, 0.10 and 2.02). Taken overall, though, the picture is of remarkable similarity between the communication and the examination stations, with a slightly higher reliability and variability in the communication stations, both amongst candidates and amongst examiners.
This paper describes an analysis of 10,145 candidates taking the PACES examination over nine diets, when they were examined by a total of 1,259 examiners who awarded a total of 142,030 marks. The multi-facet Rasch model allowed the data to be broken down into three separate components of candidate, examiner and station, along with separate measures for the several components of the four-point marking scheme. Overall the model fitted the data reasonably well, particularly in the examiner facet, although there was some evidence that the model fitted less well for candidates, which may have been due to case-specificity in which candidates differed idiosyncratically, perhaps as a result of different training, on different stations. Nevertheless the latter effects are small, and not relevant to the purpose of the examination as a whole, which requires an overall pass across the different stations.
The multi-facet Rasch model has some limitations, which should be emphasised strongly in considering the current analysis. In particular, FACETS, unlike generalisability theory, cannot consider variance due to interaction effects, primarily because with, say, many candidates and many examiners, there is an extremely large number of degrees of freedom relating to interaction effects (and hence an extremely large number of dummy variables is needed), and it is unrealistic to attempt to estimate so many parameters. Generalisability theory approaches this estimation problem in a very different way, and such interaction effects can be calculated, given certain design constraints (which unfortunately are not applicable here), and the variance terms are often found to be significant and meaningful in examination situations. As a result the estimates provided in this paper of total variance, and the contribution of various facets, may be inaccurate, and should be treated with care. In particular, they probably should not be used for calculating the overall reliability of the examination, or similar statistics. However, and it is an important however, the major interest of this study is in differences between examiners in leniency-stringency, and those differences are primarily likely to be main effects, which FACETS can handle appropriately. There might also be additional variance consisting of interactions between examiners and other aspects of the examination (such as candidates or cases), and future work needs to look for such effects using different methodologies, but the main effects analysed and discussed here are unlikely to disappear in any such analyses. The FACETS analysis is therefore appropriate and adequate as a first approach to studying examiner effects on leniency and stringency.
The principle interest of this study is in differences between examiners. Humans differ in many behavioural attributes, and it is hardly surprising that examiners also differ in their propensity to pass or fail candidates. This study of hawks and doves amongst examiners found highly significant differences in examiner behaviour, which subsidiary analyses showed were consistent across time (within the limits of the reliability of the measures). Examiner variance accounted for about 12% of the systematic variance (as compared to only 1% depending on differences in difficulty of stations, and 87% depending on differences between candidates). Nevertheless these differences are meaningful, particularly to a borderline candidate for whom random allocation of examiners happens to mean that a majority of examiners assessing them could be construed as 'hawks'. FACETS allows for raw marks to be adjusted for differences in stringency between examiners. If the PACES examination is re-marked using adjusted marks then about 4% of candidates would change their result across the pass-fail boundary, slightly more going up than down, so that the overall pass rate would increase slightly from 46.6% to 47.6%.
The reliability of the adjustments for examiner stringency is easier to apply as more examiners assess more candidates across more diets. It is not technically possible to adjust the results of a single diet of PACES because one cannot obtain linkage across the various subsets of examiners (and even if it were possible, the result would be less reliable than a correction based on as much data as possible on the behaviour of examiners, based on all examinations in which they had taken part). In passing it should be said that it might be possible to obtain linkage within a single diet if linkage could be obtained across stations, perhaps by using a simulator or video station which was objectively marked and therefore of fixed difficulty either for all candidates or for large groups of candidates across examination centres. That is not though possible at present.
If examiner stringency can only be assessed reliably across multiple diets then correction for stringency does require that stringency is a stable characteristic of examiners. The comparison of examiners in diets 1–3 with those in diets 7–9 suggests that there is reasonable stability in stringency across a period of a year or two, although that needs to be further examined.
It is sometimes suggested that examiners who are 'hawks' or 'doves' should be given feedback about their propensity for marking or high in order that they can then try and correct that tendency. The present analysis would in fact require the precise opposite. It is better given the method of analysis that examiners do not try and correct any differences in stringency, but instead they continue to behave as they have always done. Biasses of any sort which are fixed and unchanging can be corrected statistically, whereas biasses which are varying are, of their very nature, difficult to predict and correction will be less reliable (and hence less valid and less justifiable).
There might be an argument for pairing examiners on the basis of their stringency, so that if a candidate sees one examiner known to have a high stringency then the other will have a relatively low stringency. Whether that would be practicable given the complex constraints of a real examination is not clear, but it might be worth investigating. The clear advantage would be that the marking of the examination could then be based on raw scores, which have a high degree of face validity and are easy to justify.
Although there seems little doubt that examiners do differ in their stringency, it is much less clear where those differences come from. Because our sample has a large sample of more than a thousand examiners it is possible to assess the role of several background factors. Important negative results are that we could find no sex differences, and neither did there seem to be any relationship to age, older examiners not being more hawkish than younger examiners. Examiners who had examined more candidates were more hawkish, although whether that is the result of experience making them more hawkish, or more hawkish examiners choosing to examine more often is not clear. Likewise our data suggest that UK examiners from minority ethnic groups are also more hawkish, and again we have no explanation for that, although we did find some evidence for a similar effect in a different analysis . An interesting and important analysis would be to assess how the ethnic origin of an examiner and a candidate interact, but as yet that analysis has not been possible for a host of technical reasons. We are however working on it.
The reasons for differences in examiner stringency could form the basis for a number of future studies. If, as seems possible, stringency is a relatively stable trait then it might be predicted that it would relate to other aspects of personality or behaviour, and in particular the Big Five, which have been shown to relate to many and varied aspects of human behaviour [48, 49]. We would hope to address this issue in a future study.
The use of FACETS has allowed a full analysis of the marks from nine diets of the PACES examination, it has allowed the separate and independent estimation of effects due to candidate, examiner and station type. As a result it allows a fuller discussion of the origins of examiner effects, and on ways in which the examination might be revised. A point of some importance in the context of designing examinations is that we would not have been able to carry out the present analysis if each station had been assessed by only a single examiner. A recurrent suggestion within the literature on the design of clinical examinations, usually driven by analyses based on generalisability theory, is that when one is trying to maximise the reliability of an OSCE-style examination, "where rater availability is a limiting factor to increasing test length [due to scarcity and expense], more can be gained by using one rater per station and having more stations than using two raters per station" . Although that seems a reasonable strategy, it has two potential problems. Firstly, it does assume that examiner behaviour remains unchanged when only one examiner is present rather than two. However the presence of another examiner, and the potential for cross-checking between independently given marks, may well encourage each of the examiners to be more careful in carrying out their task, and that a lowered examiner reliability for examiners working individually may mean that the overall exam reliability does not increase as much as might be predicted from theoretical calculations. Secondly, and it is one which is particularly relevant to the present analysis, the use of a single examiner at each station does not allow any statistical evaluation of hawk and dove effects, with the likelihood that such effects may well increase in the absence of effective monitoring.
The figurative description of behavioural differences by using animal names is nothing new, the use of hawk to describe, "a person who advocates a hard-line ... policy", goes back to at least 1548, although intriguingly such hawks were contrasted with a range of animals including beetles (1824) and pigeons (1843), whereas the modern contrast with doves only came into use in 1962 at the time of the Cuban Missile Crisis, when, "The hawks favored an air strike to eliminate the Cuban missile bases... The doves opposed the air strikes and favored a blockade." (Oxford English Dictionary online ). The earliest usages of which we are aware in the context of medical education are both from 1974 [7, 26], with one of them concerning the MRCP(UK) examination . However, the problem of hawks and doves amongst examiners is not a new one, and has been described for a century or more in education , under a number of different names. Hawks and doves were described as 'the Vulture' and 'the Husbandman', by A C Hilton in a poem written in 1872 , and variants of the Hawk were described as 'the Spider' and 'the Poultryman' in a 1904 poem by T C Dent, a surgical examiner . In 1913, Sir William Osler referred to 'Metallics', with their "aggressive, harsh nature and ... hard face", whose, "expression sends a chill to the heart of the candidate, and it reaches his bone marrow [with ] the first question...", to be contrasted with the 'Molluscoid', the "invertebrate examiner, so soft and slushy that he has not the heart to reject the man". Nevertheless, Osler recognised that, "between the metallic and the molluscoid is the large group of sensible examiners" . Despite the long-running awareness of the hawk-dove problem in medical examinations, we are not aware of any previous study which has used a rigorous statistical method to assess properly the stringency or leniency of large numbers of examiners, and to examine how background factors relate to stringency and leniency.
There is little doubt from these data, that examiners do differ in their leniency or stringency, and the effect can be estimated using Rasch modelling. The reasons for the differences are not so clear, although there are some demographic correlates, and the effects appear to be reliable across time. Various ways are suggested by which account may be taken of differences, either by adjusting marks or, perhaps more effectively and more justifiably, by pairing high and low stringency examiners, so that raw marks can then be used in the determination of pass and fail. The performance of the PACES examination is under continual review by the Colleges, and the implications of these and other findings for the running of the examination are a part of that review
This is not an abbreviation or an acronym but the name of a computer program
Membership of the Royal Colleges of Physicians of the United Kingdom
Objective Structured Clinical Examination
Practical Assessment of Clinical Examination Skills
United Kingdom of Great Britain and Northern Ireland
We are grateful to various members of the MRCP(UK) PACES Board and the MRCP(UK) Research Committee for their comments on previous drafts of this manuscript, and in particular we thank Simon Williams for his assistance, and Brian Clauser for his detailed comments on an earlier version of the manuscript.
- Douglas C: A day in the country. British Medical Journal. 2004, 328: 1573.View ArticleGoogle Scholar
- Weisse AB: The oral examination: awesome or awful?. Perspectives in Biology and Medicine. 2002, 45: 569-578.View ArticleGoogle Scholar
- Ferrell BG: A critical elements approach to developing checklists for a clinical performance examination. Medical Education Online. 1996, 1: 5.View ArticleGoogle Scholar
- Norcini J, Blank LL, Arnold GK, Kimball HR: Examiner differences in the Mini-Cex. Advances in Health Sciences Education. 1997, 2: 27-33. 10.1023/A:1009734723651.View ArticleGoogle Scholar
- Taylor A, Rymer J: The new MRCOG Objective Structured Clinical Examination – the examiners' evaluation. Journal of Obstetrics and Gynaecology. 2001, 21: 103-106. 10.1080/01443610020025930.View ArticleGoogle Scholar
- Harden RM, Cairncross RG: Assessment of practical skills: the objective structured practical examination (OSPE). Studies in Higher Education. 1980, 5: 187-196. 10.1080/03075078012331377216.View ArticleGoogle Scholar
- Stokes JF: The clinical examination – assessment of clinical skills. ASME Medical Education Booklet number 2. 1974, Dundee: Association for the Study of Medical EducationGoogle Scholar
- Schubert A, Tetzlaff JE, Tan M, Ryckman JV, Mascha E: Consistency, inter-rater reliability, and validity of 441 consecutive mock oral examinations in anesthesiology: Implications for use as a tool for assessment of residents. Anesthesiology. 1999, 91: 288-298. 10.1097/00000542-199907000-00037.View ArticleGoogle Scholar
- Allison R, Katona C: Audit of oral examinations in psychiatry. Medical Teacher. 1992, 14: 383-389.View ArticleGoogle Scholar
- Newble D, Dawson B, Dauphinee D, Page G, Macdonald M, Mulholland H, Swanson D, Thomson A, van der Vleuten C: Guidelines for assessing clinical competence. Teaching and Learning in Medicine. 1994, 6: 213-220.View ArticleGoogle Scholar
- Ferrell BG, Thompson BL: Standardised patients: a long-station clinical examination format. Medical Education. 1993, 27: 376-281.View ArticleGoogle Scholar
- Solomon DJ, Ferenchick G: Sources of measurement error in an ECG examination: implications for performance-based assessments. Adv Health Sci Educ Theory Pract. 2004, 9: 283-290. 10.1007/s10459-004-4844-6.View ArticleGoogle Scholar
- Schuwirth LW, Van der Vleuten CP: The use of clinical simulations in assessment. Med Educ. 2003, 37: 65-71. 10.1046/j.1365-2923.37.s1.8.x.View ArticleGoogle Scholar
- Govaerts MJ, Van der Vleuten CP, Schuwirth LW: Optimising the reproducibility of a performance-based assessment test in midwifery education. Adv Health Sci Educ Theory Pract. 2002, 7: 133-145. 10.1023/A:1015720302925.View ArticleGoogle Scholar
- Wass V, van der Vleuten C, Shatzer J, Jones R: Assessment of clinical competence. Lancet. 2001, 357: 945-949. 10.1016/S0140-6736(00)04221-5.View ArticleGoogle Scholar
- Ferrell BG: Clinical performance assessment using standardized patients: a primer. Fam Med. 1995, 27: 14-19.Google Scholar
- Norman G, Bordage G, Page G, Keane D: How specific is case specificity?. Medical Education. 2006, 40: 618-623. 10.1111/j.1365-2929.2006.02511.x.View ArticleGoogle Scholar
- Maynard Smith J: Evolution and the theory of games. 1982, Cambridge: Cambridge University PressView ArticleGoogle Scholar
- Lovett B, Collins R, Lamparelli M: The intercollegiate exam. 2001, [http://www.frist.org/asit/meetings/yearbook2001/INTERCOLLEGIATE.pdf]Google Scholar
- Teaching on the run tips 9: in-training assessment. Medical Journal of Australia. 2005, 183: 33-34.
- Raymond MR, Viswesvaran C: Least squares models to correct for rater effects in performance assessment. Journal of Educational Measurement. 1993, 30: 253-268. 10.1111/j.1745-3984.1993.tb00426.x.View ArticleGoogle Scholar
- Houston WM, Raymond MR, Svec JC: Adjustments for rater effects in performance assessment. Applied Psychological Measurement. 1991, 15: 409-421.View ArticleGoogle Scholar
- Williams SM: Laterality newsletter. Laterality Newsletter. 1988, 2 (1).
- Fitzpatrick AR, Ercikan K, Yen WM, Ferrar S: The consistency between raters scoring in different test years. Applied Measurement in Education. 1998, 11: 195-208. 10.1207/s15324818ame1102_5.View ArticleGoogle Scholar
- Raymond MR, Webb LC, Houston WM: Correcting performance-rating errors in oral examinations. Evaluation and the Health Professions. 1991, 14: 100-122.View ArticleGoogle Scholar
- Fleming PR, Manderson WG, Matthews MB, Sanderson PH, Stokes JF: Evolution of an examination, MRCP(UK). British Medical Journal. 1974, i: 99.View ArticleGoogle Scholar
- Gibberd FB: The MRCP(UK) examination. 1994, London, Edinburgh and Glasgow: Royal Colleges of PhysiciansGoogle Scholar
- PACES: Practical Assessment of Clinical Examination Skills. The new MRCP(UK) clinical examination. J R Coll Physicians Lond. 2000, 34: 57-60.
- Bradley RA, Terry ME: The rank analysis of incomplete block designs. I. The method of paired comparisons. Biometrika. 1952, 39: 324-345. 10.2307/2334029.Google Scholar
- Bock RD, Jones LV: The measurement and prediction of judgement and choice. 1968, San Francisco: Holden-DayGoogle Scholar
- David HA: The method of paired comparisons. 1963, London: Charles GriffinGoogle Scholar
- Rasch G: An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology. 1966, 19: 49-57.View ArticleGoogle Scholar
- Andrich D: Rasch models for measurement. 1988, Newbury Park: SageView ArticleGoogle Scholar
- Fischer GH, Molenaar IW, (editors): Rasch models: Foundations, recent developments, and applications. 1995, New York: SpringerGoogle Scholar
- Linacre JM: FACETS Rasch measurement computer program. 2004, Chicago: Winsteps.comGoogle Scholar
- Linacre JM: The Partial Credit Model and the One-Item Rating Scale Model. 2006, Accessed 18th August 2006, [http://www.rasch.org/rmt/rmt191e.htm]Google Scholar
- Anonymous: Polytomous Rasch model. 2006, Accessed 18th August 2006, [http://en.wikipedia.org/wiki/Polytomous_Rasch_model]Google Scholar
- Engelhardt G: Constructing rater and task banks for performance assessments. Journal of Outcome Measurement. 1997, 1: 19-33.Google Scholar
- Weeks DL, Williams DR: A note on the determination of connectedness in an N-way cross classification. Technometrics. 1964, 6: 319-324. 10.2307/1266048.View ArticleGoogle Scholar
- Anonymous: Examination Format. 1995, Accessed 18th August 2006, [http://www.mrcpuk.org/PACES/PacesFormat.htm]Google Scholar
- Anonymous: User's manual for the XCALIBRE marginal maximum-likelihood estimation program. 1995, St Paul, Minnesota: Assessment Systems CorporationGoogle Scholar
- Dacre J, Besser M, White P: MRCP(UK) PART 2 Clinical Examination (PACES): a review of the first four examination sessions (June 2001 – July 2002). Clin Med. 2003, 3: 452-459.View ArticleGoogle Scholar
- Dewhurst NG, McManus IC, Mollon J, Dacre JE, Vale JA: Performance in the MRCP(UK) Examination 2003–4: Analysis of pass rates of UK graduates in relation to self-declared ethnicity and gender. 2006.Google Scholar
- Karabatsos G: A critique of Rasch residual fit statistics. Journal of Applied Measurement. 2000, 1: 152-176.Google Scholar
- Hambleton RK, Swaminathan H, Rogers HJ: Fundamentals of item response theory. 1991, Newbury Park: SageGoogle Scholar
- Weiss DJ, Yoes ME: Item Response Theory. Advances in Educational and Psychological Testing: Theory and Applications. Edited by: Hambleton RK, Zaal JN. 1991, Boston, London, Dordrecht: Kluwer Academic Publishers, 69-95.View ArticleGoogle Scholar
- Huynh H: On equivalence between a partial credit item and a set of independent Rasch binary items. Psychometrika. 1994, 59: 111-119. 10.1007/BF02294270.View ArticleGoogle Scholar
- Matthews G, Deary IJ, Whiteman MC: Personality traits. 2003, Cambridge: Cambridge University Press, secondView ArticleGoogle Scholar
- McCrae RR, Costa PT: Personality in adulthood: A five-factor theory perspective. 2003, New York: Guilford PressView ArticleGoogle Scholar
- Newble DI, Swanson DB: Psychometric characteristics of the objective structured clinical examination. Medical Education. 1988, 22: 325-334.View ArticleGoogle Scholar
- Anonymous: dove, n. 2006, http://dictionary.oed.com/cgi/entry/50069159?query_type=advsearch&queryword=hawks+&first=1&max_to_show=10&search_spec=simple%3Afulltext&queryword2=cuban&search_spec2=simple%3Afulltext&logic_op=and&proximity_op=before&proximity_num=1&order=ab&return_set=entries&sort_type=alpha&result_place=1&control_no=50069159&search_id=PuYN-DpjOU5-527&side=M -- Accessed 18th August 2006Google Scholar
- Jones JG, Zorab JSM: Hawks, doves and vivas; plus ça change ...?. Bulletin of the Royal College of Anaesthetists. 980-983. 2003, Number 20 (July)
- Dent TC: The Spider and the Poulterer, a Yarn of the Spun. Annals of the Royal College of Surgeons of England. 1954, 15: 348.Google Scholar
- Osler W: Examinations, examiners and examinees. Lancet. 1913, 1047-1050.Google Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6920/6/42/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.