We have reported the results of a high stakes National Assessment Centre process to determine entry into a specialist (general practitioner) training program in which, for the first time, a combination of an observed Multiple-Mini-Interview (MMI) and a written Situational Judgement Test (SJT) was used. In our study, the MMI was observational, focussed on non-cognitive skills, and used behavioural type questions. The written SJT was also focused principally on non-cognitive skills, and used situational type questions. In relation to the construct validity of the MMI in postgraduate settings, we have demonstrated that interviewers can make moderately reliable and valid decisions about the non-cognitive characteristics of candidates with the purpose of selecting them for entry into general practice training using the MMI. Our data confirms that as in other MMI settings, the main source of error is interviewer subjectivity
[13, 33, 34], as opposed to context specificity, which is often a major source of bias in communication OSCEs
[35, 36]. We also demonstrated for the first time a relationship between the MMI and the SJT. We discuss these findings in more detail.
The finding that a significant proportion of variance (28%) is related to the desired behaviours of a candidate resonates with Dore et al’s study, which included a significantly smaller sample size.
 In examining the sources of interviewer subjectivity, a number of conclusions can be drawn. The highest source of variance related to views that interviewers have for particular candidates, because of their particular perspective or pre-conception, which accounted for nearly half (40%) of all the variance. This represents a large discrepancy between a candidate’s scores as a result of individual interviewer bias – or the snap judgments regarding a candidate that interviewers make. In our study there was a higher proportion of candidate variance compared for example with the 22% in a graduate entry situational style MMI question.
 This might reflect the greater certainty amongst interviewers in determining the trainability of a doctor as opposed to determining the aptitude of a student for undertaking a medical degree. Equally it might reflect the lack of independence between the different MMI questions, i.e. they were testing very similar things. Addressing interviewer training and tightening definitions in the marking criteria has traditionally been used to address interviewer subjectivity, particularly in situations where the interviewer pool, in our case GP supervisors is finite and can’t be further diversified. However, neither strategy alone or in combination, has resolved the persistent challenges of interviewer variability, and there is a need for novel evidence-based approaches, which have so far been discussed in the context of work-based assessment around rater cognition
[37–39]. Interviewers may have used different schemas in judging candidate performance, in a process that has similarities to clinical reasoning, the notion of making some instant and intuitive decisions about candidates based on pattern recognition and making more considered and analytical decisions
. By investigating the perceptual and processing capacities of our interviewers, and the schema they operate by, and then aligning the scoring system, we may be able demonstrate improved discrimination between candidates in future iterations of the MMI.
Although interviewer stringency leniency accounted for 9% of the variance in our study, it is generally thought that this is a relatively stable characteristic of interviewers, and is not impacted upon by training
. However, consideration could be given to adjusting candidates’ scores by using a measurement model
, which accounts for the stringency/leniency of whichever interviewers the candidate saw. Increasing the number of MMI questions is another way in which reliability may be added to the MMI, particularly as a comprehensive question bank is developed
, but can be problematic logistically. Our D study suggests that in order to achieve a reliability of 0.80, there would need to be 10 MMI stations, which was logistically impossible because candidates are required to sit the SJT on the same day. Although a minimum 6-station MMI with a G of 0.70 is recommended to ensure a balance between reasonable reliability and resources available, future flexibility in offering more MMIs might be afforded by developing on-line testing facilities for the SJT.
The assessment blueprint guiding the content areas for the MMI and SJT had content validity because they was developed fit for purpose by organisational psychologists specialising in selection focussed assessment. However they were developed differently across two different formats, and the professional colleges (RACGP and ACRRM) would be advised to revisit the blueprints focussing on the anticipated attributes of GP registrars. Expected relationships of the MMI with independent external variables, such as the SJT provide some evidence to support the validity of its use in postgraduate settings
. There is also a pragmatic interest in the relationship, as it has been claimed that SJTs would be a more cost-efficient methodology compared with more resource intensive assessments of non-cognitive attributes, such as the MMI
 The finding of a modest disattenuated correlation (r = 0.35) between the behavioural MMI and the SJT suggests that the two formats are testing differing non-cognitive aspects and should be retained on the argument of divergent validity. One advantage of situational questions is that all interviewees respond to the same hypothetical situation rather than describe experiences unique to them from their past. Another advantage is that situational questions allow respondents who have had no direct job experience relevant to a particular question to provide a hypothetical response. Where feasibility and cost constrain the number of assessment formats that can be used, it raises the question as to which best predicts GP registrar performance either in-training or in professional college examinations. This NAC principally focussed on non-cognitive characteristics of candidates. There has been international interest in postgraduate settings, to offer some testing of clinical competence, particularly where many candidates have received their medical degrees and early training in multiple settings, some of which are of varying quality. For example within the UK, the postgraduate selection community has favoured the combination of a cognitive test, the clinical problem-solving test (CPST)
 with the non-cognitive SJT to ensure a broader coverage of desired candidate attributes. To date, in Australia, candidates’ clinical competence are assumed as being represented in either an Australian primary medical degree or passing an Australian Medical Council Accreditation Examination for international medical graduates. Perhaps, because of a lack of assessment in the intern and resident years, there has been sufficient concern about the clinical competence of borderline candidates that sections of the GP selection community have pushed for an element of clinical competence testing, alongside the SJT and the MMI. The relationship between the MMI and the clinical knowledge section of the SJT is of interest in this context. Decisions to determine the best combination of selection formats will likely require validity studies of the success of the MMI and SJT, individually and in combination in determining what best predicts observed performance in practice. Further debate is required amongst stakeholders to ensure that the validity of the MMI continues to have relevance when considering logistically sustainable combined measures of the trainability of entrants into specialist training.
NAC decision making
Developing a cut score for the combined NAC score that is both psychometrically robust and acceptable to all stakeholders is a complex process. However, it is important to provide data on which to base these on-going discussions. Although no formal standard setting procedure was used in the NAC, we modelled possible standard setting procedures for future iterations. We had anticipated the MMIs reported precision
[10, 13] would allow relative ranking of candidates. From Figure
3, the MMI contains enough precision to suggest concern that 23 (1.7%) candidates had failed the MMI with 95% confidence. However, the confidence interval crosses three quartiles giving less than 95% confidence that a candidate at the bottom of the top quartile might behave better than a candidate at the top of the 3rd quartile
. There needs to be acknowledgement that large-scale performance-based assessments are logistically complex and costly to run. Scores based solely on performance-based stations, such as the MMI require extended testing time to achieve acceptable generalizability, to which would be added time for question development and training. Combining scores from performance-based formats and written formats may improve test generalizability, and methods to do this already exist
[42, 43]. It could be possible that the combined NAC score was more generalizable than either of the two measures individually, and potentially a better use of resources. In considering the construct validity, generalizability and the precision of the combination of the MMI and SJT, more data would need to be made available on the detailed scoring of the SJT, in order for an acceptable methodology that all stakeholders had confidence in. Additionally, a method for providing a cut score for the SJT, for example with a modified Angoff, would need to be provided.
Limitations of the study
The strength of this study was that it evaluated a high stakes National Assessment Centre approach, with sufficient numbers to ensure adequate sampling of all the factors. However, the study was a secondary analysis of a process that was conducted naturalistically and was constrained by what was logistically possible. As is often the case in such settings, there was no fully formalized design that assigned specific interviewers or a specific set of items to each MMI circuit, nor which version of the SJT they sat. We had initially anticipated that ‘candidate*interviewer’ and ‘candidate*MMI question’ interactions would be confounded
 and included in the error term, because of the single interviewer within station design. However given the GLM procedure was able to provide estimates because there were enough degrees of freedom for this to happen. We therefore assumed a partially crossed model of generalisability to best reflect this particular setting
. In this study we were unable to link interviewer demographics and provide additional analysis about the impact of rater characteristics on interviewer subjectivity as we have done in previous studies