We found that the performance of practicing specialist physicians on complex, realistic simulated critical event scenarios involving teamwork is highly context-specific. Context-specificity has been identified for technical and communication skills in computer-based simulations for medical students in the 1980’s [5, 6], and assessment using standardised patients in 2004 [8]. Variation due to task sampling (context specificity attributable to the content of the scenario) is known to affect the validity and reliability (generalisability) of scores [17, 18]. However, measurement properties of assessment scores have not been well characterized for the population of practicing anesthesiologists we studied, nor have they been well evaluated for the types of critical event scenarios we modeled. The simulation cases used in this study were constructed to reflect the timing of events in actual clinical practice in scenarios that require both technical and behavioural expertise to effectively manage the patient’s condition. These cases were presented with as much realism as is achievable with current simulation techniques. Scenarios were developed to accurately reflect the types of challenges faced by practicing anesthesiologists in real-world, emergency situations where the correct answer is not clear and the outcome is not predetermined. Thus, we believe that the performances elicited were likely a fair reflection of how the subjects would have acted in real situation, although that cannot be know with certainty. Yet, even with some reservations about drawing strong conclusions from this data about the reliability of an assessment using this approach, we were surprised to find that the performance of physicians in one of two critical event simulation scenarios often did not predict their performance in the other. This was true for both behavioural and technical ability.
Previous studies have investigated the psychometric properties (reliability, validity) of scores from standardised patient encounters as well as other simulation-based assessments [19,20,21]. While reliable and valid scores and associated high stakes competency decisions can be obtained, these decisions demand broad sampling of the domain and effective rater training [22]. High-stakes applications, such as the introduction of objective structured clinical examinations (OSCE) into the primary board certification of anesthesiologists [23], require an evaluation of the sources and magnitude of measurement error to determine the number of scenarios needed to obtain sufficiently precise estimates of ability. The American Board of Anesthesiology (ABA) recently introduced OSCEs to assess two domains that “may be difficult to evaluate in written or oral exams - communication and professionalism and technical skills related to patient care” [24]. Those examinations are comprised of seven stations. Other certification bodies, including the Royal College of Physicians and Surgeons of Canada, also realize the unique ability of simulation-based assessment to evaluate domains not covered with traditional assessment techniques [25]. However, the types of assessment encounters administered can be highly context specific, that is, because of the nature of the management task, the skills measured in one patient management problem may not generalise to another. This indicates that numerous performance samples are needed to get sufficiently reliable ability estimates.
Despite the validity advantages of assessment based on real or realistic clinical encounters, the inconsistent performance by trainees on different cases, and the variability of assessment judgements has necessitated the use of simpler or focused cases that typically lead to results that are similar to a cheaper and easier test such as a multiple choice written exam [5]. Lengthy and expensive examinations are not considered valuable to practicing clinicians, and as Van Der Vleutin posits, “Assessment not accepted by staff or students will not survive.” But individual scores are not the only, nor even the most important use for clinical assessment. Test results can be used for individual reflection, feedback for instructors, and quality monitoring of training programs. Moreover, the input of multiple assessors may capture different meaningful aspects of highly complex and nuanced performance within the same case or across a range of cases [26]. While inconsistent and unreliable scoring may be problematic for certification examinations, these cases may be highly valuable for participant growth and development.
The D study, although limited because each participant was only evaluated in three different scenarios in two pairs, suggests that greater than 20 scenarios would be required to achieve a reliability of 0.8 (desirable for high-stakes assessments [16]). Controlling for numbers of scenarios and raters, the estimated generalisability coefficients from our study were lower than those reported elsewhere [27, 28]. While the scenarios were modeled to present management challenges that all practicing, board certified, anesthesiologists should be able to handle, we found that some participants could perform well on one scenario and do poorly on the next. A similar observation was made in a recent analysis of anaesthesiology residents who were scored on four simulation scenarios [29]. In our analysis, this variation was seen in both technical and behavioural performance. The scenarios were developed to elicit nuanced performances that may have been more content specific because clear-cut management expectations were accompanied by ambiguous, real world interactions with others embedded in various provider roles within the scenario. This result highlights the challenge of developing content-valid and practice-relevant simulation-based performance assessments for practicing physicians, especially if these are to be used for summative purposes.
For most performance-based assessments, the variance attributable to the task, and associated interactions, outweighs that associated with the rater [30]. While variance attributable to the rater was less than that attributed to the task (scenario), it was not zero for the second pair of scenarios. Even though rater training was quite stringent, individual evaluators still varied with respect to how they used the scoring rubrics. Also, with longer scenarios, the raters had to aggregate holistic judgments over time, potentially leading to more variation between raters. Future studies could explore these potentially biasing effects by collecting performance ratings over time and comparing these with overall judgments. As it stands, at least for the types of complex scenarios modeled in our investigation, the ability estimates of the practitioners were highly dependent on the choice of scenarios and, to a lesser extent, the choice of raters.
To improve reliability, the problem of context specificity can be addressed in a number of ways, including shortening the scenarios (to allow for the collection of more performance samples) and making scenario content more generic. A study of junior anaesthesiology trainees that used a behaviourally anchored ratings system to score seven, 15-min scenarios achieved a generalisability coefficient of 0.81 [11]. However, shortening the scenarios, while increasing sample sizes, could have a negative impact on validity. One of the strengths of utilizing longer scenarios is that they more accurately represent the clinical environment, thus allowing for the assessment of patient management strategies over a realistic, evolving, event.
We intentionally scored behavioural and technical skills separately, hypothesizing that behavioural skills would be less content specific than technical skills. Our results did not support this hypothesis; behavioural performance was as scenario-specific as technical performance. While one might expect that behavioural skills would be more generalizable across different patient encounters, communication skills have been found to be domain specific in other work-based assessments26. To overcome this confounder, typical standardised patient scenarios that measure doctor-patient communication are focused and graded using a process-based checklist [31], to provide reliable assessment of particular skills. Our scenarios included communication with various providers, including a first-responder anesthesiologist, other physicians, and various healthcare professionals and this likely affected the generalisability of communication skills measurement. It is likely that for actual critical events the context and criticality of the patient presentation, as well as the particular person, or persons present have a significant effect on both the technical and non-technical skills.
Our results suggest that a robust simulation-based high-stakes performance assessment for practicing anesthesiologists would be challenging and, perhaps, impractical. We hesitate to make such a firm conclusion because of the limited number of samples for each subject in this analysis. Regardless of the practicality of simulation for high-stakes assessment, formative assessment of individual performance in these kinds of longer, more complex, critical event scenarios still has considerable value for individuals as well as for learning how clinicians perform in general. Numerous studies have shown that simulation-based medical education fosters self-reflection and identification of performance gaps [32,33,34]. As part of ongoing professional improvement, providing feedback to individual physicians about their performance on the management of specific clinical emergencies is likely to have a positive impact on the quality of their subsequent patients’ care. Additionally, standardised technical and behavioural learner-specific feedback would likely have a greater impact on the learner’s awareness of their knowledge and performance gaps for a particular event than self-assessment. This use of simulation could be initiated using the scenarios and assessment tools we have developed. Objective, specific feedback should have a positive long-term impact on the quality of patient care delivered by individuals who participate in these formative, simulation-based assessments [35].
Although there have been numerous changes in undergraduate medical education and residency training guidelines, “graduate medical education (GME) lacks a data-driven feedback system to evaluate how residency-level competencies translate into successful independent practices...” [36]. Simulation-based performance data from practicing clinicians could be aggregated to inform modifications in educational and training programs to address specific performance deficiencies across specialties. The impact of this approach for the profession and our patients might actually be greater than administering high-stakes summative examinations because the goal would be to raise the performance of the entire profession rather than to identify and restrict the low performers from practicing.
Our study had a number of limitations, most importantly the small group of participants who agreed to being studied as the primary provider in two scenarios. To the extent that these participants are not representative of practicing anesthesiologists as a whole, the generalisability of our findings could be questioned. A larger-scale study, where participants are required to manage more scenarios, would better quantify the effect of task sampling on the reliability of the scores. Although the order of the cases was not randomised specifically for this subset, it was also not prescribed, and neither of the two cases could be the first case of the day. Further, our study was limited to two independent ratings of each scenario. While rater effects should tend to cancel out with sufficient numbers of scenarios and raters, we were not able to adequately investigate this. For future studies specifically designed to assess the numbers of scenarios and raters needed to achieve adequate reliability for high-stakes assessment, it would be appropriate to incorporate a design where more participants managed a larger number of encounters and with more raters.
Second, the study was embedded within a required formative educational experience for board-certified anesthesiologists [33] and this affected the design of the scenarios, which were found to have some differences in difficulty. Although this may be attributable to the clinical problem being managed, it may also have been a reflection of a scenario that was not optimally designed or administered and hence was more difficult for the participants to interpret and manage. For example, the LAST case may have been more challenging than anticipated due to the unrealistic portrayal of seizures by manikins. Since the cases were primarily designed for formative education, the content, timing, and delivery may have been affected. Thus our results may not fully generalise to a high-stakes assessment setting where both individual factors (e.g. motivation) and environmental factors could be quite different.