Assessing the validity of an OSCE developed to assess rare, emergent or complex clinical conditions in endocrinology & metabolism

Background Assessment of emergent, rare or complex medical conditions in Endocrinology and Metabolism (E&M) is an integral component of training. However, data is lacking on how this could be best achieved. The purpose of this study was to develop and administer an Objective Structured Clinical Examination (OSCE) for E&M residents, and to gather validity evidence for its use. Methods A needs assessment survey was distributed to all Canadian E&M Program Directors and recent graduates to determine which topics to include in the OSCE. The top 5 topics were selected using a modified Delphi technique. OSCE cases based on these topics were subsequently developed. Five E&M residents (PGY4-5) and five junior Internal Medicine (IM) residents participated in the OSCE. Performance of E&M and IM residents was compared and results were analyzed using a Generalizability study. Examiners and candidates completed a survey following the OSCE to evaluate their experiences. Results The mean score of IM and E&M residents was 41.7 and 69.3 % (p < 0.001), respectively, with a large effect size (partial η2 = 0.75). Overall reliability of the OSCE was 0.74. Standard setting using a borderline regression method resulted in a pass rate of 100 % of E&M residents and 0 % of IM residents. All residents felt the OSCE had high value for learning as a formative exam. Conclusions The E&M OSCE is a feasible method for assessing emergent, rare and complex medical conditions and this study provides validity evidence to support its use in a competency-based curriculum.


Background
The shift towards competency-based medical education (CBME) in post-graduate medical education requires frequent assessment of physician competencies across various clinical contexts within each specialty [1]. However, there is significant variability in the cases that residents may encounter during their training. Because of this, residents may never be assessed on their ability to manage some rare, emergent or complex conditions which would be essential for becoming an expert in their field. If residents have had limited or no exposure to these cases, then their ability to manage patients with these conditions may negatively impact future patient outcomes. This leads to the question of how do we evaluate competencies in certain clinical scenarios that are not easy to access in the learning environment? Ideally, one would like to observe how trainees perform during real clinical encounters, however exposure to some conditions may be limited and so other assessment opportunities must be sought. In these cases, simulation in the form of Objective Structured Clinical Examinations (OSCEs) may be useful [2]. In a CBME model, it is important to demonstrate progression of clinical skills through frequent observational assessments. Within a programme of assessment, multiple methods of assessment can be combined to achieve an overall impression of competency within a specific domain [3]. Part of this design will involve an increasing number of workplace assessments to assess trainees' progression along a continuum towards expertise. This is possible in situations where a clinical problem is common, with several opportunities for hands-on advancement of skills. However, in clinical scenarios that are rare, emergent or complex, assessing competence in a real-life setting may not be consistently possible during their training period.
Introducing a formative OSCE to address progressive assessment across training years may be of benefit in a CBME model where assessment data from multiple methods contribute to competency decisions [3]. Furthermore, the use of OSCEs as progress tests have been shown to be useful in discriminating between levels of training within an Internal Medicine (IM) program [4], but has not been described in Endocrinology programs. Although it is known that OSCEs are resource-intensive, they can be viewed as a more suitable method for assessing cases that are neither easily accessible in the workplace (i.e., rare, emergent, or complex) nor easily assessed in a written format [2]. OSCEs are purported to be objective and structured but they are not necessarily considered superior to other methods, rather, they can be complementary within a programme of assessment [5]. However, when deciding to include a particular assessment into a curriculum one must consider sources of validity evidence to justify its position in the program [6]. An assessment with robust validity evidence enables assessors to trust that the scores obtained represent the construct it intends to measure [7].
Messick and Kane's modern validity frameworks aim to gather evidence from various sources in order to demonstrate construct validity (i.e., the degree to which the test measures what it purports to measure) [8][9][10][11]. However, the advantage of Kane's validity framework is that it prioritizes validity evidence in key phases or inferences within the validity argument: scoring, generalization, extrapolation and implications [12,13]. The first step of Scoring, seeks to ensure that the scores obtained from the observed actions best represent the performance [8]. The second step of Generalization refers to obtaining an overall test score that represents the general performance test setting in equivalent types of tests [8,12]. Thirdly, the Extrapolation phase aims to determine if the observed performance correlates with real world performance or other measures of the same or similar performance domains [14,15]. Lastly, the Implications of the assessment tool includes decision making (i.e., pass/fail) or consequences of the test on those assessed [14]. To date, there have been no published articles on assessing validity evidence in OSCEs within the Endocrinology & Metabolism (E&M) specialty.
The purpose of this study is two-fold: (1) to develop a pilot OSCE for E&M residents to assess their management of rare, emergent or complex E&M scenarios that may be missed in clinical training; and (2) to gather validity evidence for this OSCE in light of Kane's framework. We aim to address the following questions: To what extent does the validity evidence support the use of the E&M OSCE as a formative assessment for rare, emergent or complex cases? More specifically, to what degree does the OSCE represent the constructs it intends to measure? Finally, what is the perceived value for learning from a resident's perspective? In order to achieve this, we carefully designed an OSCE that represents what we intended to assess, while collecting validity evidence.

OSCE Development and Design Needs Assessment
An electronic survey (via Survey Monkey©) was sent to all 13 E&M Program Directors across Canada as well as 29 recent E&M graduates (i.e., those who graduated within the last two years) to seek their opinion regarding gaps in their residency training program and which topics they believe would be important to consider for an OSCE.
A list of rare and emergent cases was included in the survey to rank (determined from objectives of E&M training and content expert agreement), in addition to a free-text area to suggest topics. Consensus was ascertained using a modified Delphi technique, involving two rounds of ranking the top ranked priority topics. From this, a list of five top-ranked topics was identified, all of which are reflected in the Royal College of Physicians & Surgeons of Canada (RCPSC) Objectives for Training for E&M residents (http://www.royalcollege.ca/rcsite/ibdsearch-e?N=10000033+10000034+4294967098).

Case Development
Five cases based on the top-ranked were developed by a specialist in E&M and were reviewed by three additional content experts. Through an iterative process, each case was reviewed and revised by three of the study investigators (SD, JM and DP) and the current E&M Program Director at the University of Ottawa (UofO).

Setting and administration
The OSCE was administered at the UofO in the 2018-2019 academic year. To accommodate all candidates, the OSCE was administered twice, using one track (5 cases consecutively). Five candidates participated in each administration (total n = 10). Each administration contained the same five cases and each case lasted 12 min. Each candidate was assessed by a unique rater for each station and the raters remained unchanged for each administration.

Five E&M resident physicians [3 PGY (Post Graduate
Year)-4 s and 2 PGY-5 s] were recruited as participants. Additionally, five Internal Medicine (IM) residents (PGY1 to 3) were recruited as a comparison group. All residents went through an informed consent process with the Research Assistant. Immediately preceding the OSCE, each resident group participated in an orientation session (led by SD) to explain the purpose and structure of the OSCE and to address any concerns.

Examiners (raters)
Raters included faculty experts (four Endocrinologists and one Internal Medicine Specialist). An orientation session prior to the OSCE was provided to the examiners (led by DP) to explain the purpose and structure of the OSCE, to ensure that they were familiar with the use of scoring instruments, and to provide the opportunity to ask questions about the OSCE.

Standardized patients
Experienced standardized patients (SPs) were recruited and received training for their roles by experienced trainers, in line with current global standards of SP training [16].

Scoring Instruments
Participants were assessed by raters using scenariospecific scoring sheets (consisting of checklists and a series of rating scales) with items that were case-specific. Each contained "key feature" items that are deemed to be important actions necessary to meet the topic objectives. The case-specific checklists were developed by the principal investigator (SD) and reviewed using consensus agreement amongst content experts, including the E&M Program Director. Rating scales were used to rate performance in the areas of: (1) Organizational skills; (2) Ability to communicate plan; and (3) Ability to prioritize acute medical issues. A global rating score (GRS) designed to rate candidates' overall competence was also included.

Analyses
Using Kane's modern validity framework, sources of validity evidence were gathered and analyzed in the domains of Scoring, Generalization, Extrapolation, and Implications.

Scoring
Weighting for the checklist and rating scale components was determined by a panel of experts in E&M and OSCE administration. A total score for each case was derived by combining the total checklist scores (70 %) with the rating scales (30 %). Descriptive statistics were calculated, as well as item-total correlations for each case using SPSS Software version 25. To ensure the integrity of data, quality assurance measures were employed during data collection and data entry. Immediately following the OSCE, the examination staff ensured that all checklists and rating scales were completed accurately. Data entry was double checked by experienced staff who employed quality control checks to ensure accuracy of scores entered in analyses.

Generalizability
Although this was a small-scale OSCE, the blueprint was derived using consensus methods to gain input from various stakeholders.
Measures of Generalizability include the reliability (i.e., reproducibility) of the scores, and the degree to which the stations represent the domain of interest. Since stations have multiple factors that can contribute to variance, Generalizability Theory (G-theory) was applied to quantify to what degree each variable (i.e., resident type, training level, participants, or stations) contributed to the overall variability in the scores. To generate the variance components, a mixed analysis of variance was conducted with students nested with discipline and crossed with stations. These variance components were then used to generate the reliability of the exam scores. Because we were interested in scores and not the reliability of the pass/fail standard, a relative reliability was used. We also used the results of the generalizability analysis to conduct a decision study, which uses the variance components to derive estimates of reliability if various factors in the model are varied. This analysis will be useful for determining how many stations are needed to produce a reliable set of exam scores.

Extrapolation
The ability of the OSCE to discriminate between novice (PGY 1-3) and expert groups (PGY 4-5), was measured using an independent t test.

Implications
Although this was designed as a formative examination, the Borderline Regression Method (BRM) was used to demonstrate how to apply methods for standard setting. This method involves a linear regression approach where all candidates' checklist scores are regressed onto their global rating score to produce a linear equation [17]. The cut-score is determined by inserting the midpoint of the GRS (which is 3.5 on the current 6-point scale) into the equation, which results in a corresponding predicted checklist score [17].
Pass-fail decisions on this OCSE had no bearing on participants' progression through the E&M program and were used to help determine if certain stations were unfairly difficult or if there were areas of underperformance that would require attention. Identification of difficult stations were utilized to inform curriculum change to promote learning in weaker areas.
To obtain the residents' perspective, a post-OSCE survey was used to evaluate the degree of acceptability of the examination, and the degree to which they felt the OSCE has value for learning.
The top five selected topics in the "rare or complex" category were: (1) complex Cushing's disease, (2) investigation and management of hyperaldosteronism, (3) pre-op management of pheochromocytoma, (4) Graves' disease in pregnancy and (5) MEN syndromes (Fig. 2). The total frequencies per topic were totalled and the top 10 topics were subsequently used for ranking in the second survey.
There were 14 respondents from the second survey (6 PDs and 8 E&M Graduates) of the original 42 that were invited (33 % response rate). Five topics emerged from the ranking exercise in the second survey: (1) pre-operative management of pheochromocytoma; (2) thyroid storm; (3) pituitary apoplexy; (4) Graves' disease in pregnancy and (5) investigation and management of hyperaldosteronism (Fig. 3). These topics were used as the basis for OSCE case development.
To ensure that the top ranked cases adequately represented the construct we intended to measure, the study investigators (SD, JM, DP) reviewed the results in detail to come to a consensus and were deemed suitable to meet the objectives of this OSCE.

Evidence for validity Scoring
Mean scores out of 10 and item-total correlations (ITCs) per station are represented in Table 1. Stations scores with ITCs over 0.3 are considered to have good correlations with the total score. The station ITCs are quite high, indicating that scores for all stations are highly correlated with the total score for the entire OSCE. This is not surprising, given that the OSCE was designed to assess a relatively focused area of clinical practice (i.e., rare, emergent and complex conditions in E&M).

Generalizability
The results of the generalizability analysis are displayed in Table 2. The discipline (d) effect (Endocrinology & Metabolism versus Internal Medicine) accounts for 59 % of the variance of scores, indicating a difference between E&M residents and IM resident performance. Participants within each discipline (p:d) accounted for 10 % of the variance, indicating that there were small differences between people within each discipline. The differences between stations accounted for 14 % of the variance, however the interaction between stations and discipline (ds) was 0 %, suggesting that the stations performed similarly for each discipline.
The G-coefficient, considering both disciplines and stations simultaneously, was 0.74, which is sufficient for lower-stakes exams [18]. If this was designed to be a higher stakes examination, 7 stations in total would be needed to achieve a reliability of 0.80.

Extrapolation
The mean score in group 1 (E&M Residents) was 69.3 % for the entire OSCE, compared to 41.7 % for group 2 (IM group) ( Table 3). Independent t tests for group 1 versus group 2 show a significant difference between the two groups (p = 0.001) ( Table 4). Differences in performance based on discipline were calculated using a factorial ANOVA with discipline as a between subjects factor and the total score as the dependent measure. There was a statistically significant difference between IM and E&M Residents: F (1,8) = 23.46, (p = 0.001), with a very large effect size (partial eta 2 = 0.75).

Pass-fail results
Using the BRM, the overall OSCE cut score was determined by totaling the cut scores for all 5 stations, resulting in a total cut score of 29.6 out of 50 (equivalent to 59.2/100). Residents in the E&M Group passed more stations than the IM Group (Table 5). All 5/5 (100 %) of E&M Residents would have passed the entire OSCE, conversely, 0 % of IM Residents would have passed if this was considered high stakes, which is not surprising given that the was an OSCE designed to assess rare, emergent and complex conditions in E&M.

Resident Experience
The majority of IM residents had not encountered any of these OSCE scenarios in real life. However, the E&M residents reported variable exposures to these cases in the clinical world. Overall, residents felt that the OSCE cases were relevant to their learning with 57 % stating that the topics "mostly aligned with their learning objectives" and 43 % stating that they "completely aligned with their learning objectives". Further to this 57 % felt that the OSCE was "likely to drive further learning", and 43 % stated it was "very likely to drive further learning".
Examiner Experience All five examiners completed the post-OSCE survey, reporting that their case was either a "good" representation (80 %) or an "excellent" representation (20 %) of a true patient encounter. Furthermore, most rated the SP performances as "excellent" (80 %), while the remainder rated it as "good" (20 %) for their respective stations.

Discussion
Because many challenging clinical conditions may not be encountered during a resident's E&M training, we developed and administered a novel OSCE to assess their knowledge and skills in managing rare, emergent and complex conditions. Validity evidence for this OSCE was gathered and analyzed through the lens of Kane's validity framework. Evidence for 'scoring' was demonstrated through the use of multiple content experts in the development of the rating instruments, through the careful training of raters and SPs, as well as through the high item total correlations that were found. 'Generalization' evidence included rigorous methods to determine the exam content (i.e., national survey, consensus methods, and expert review) and relatively high exam reliability as determined by a G study. The differences demonstrated between experts and novices provided evidence for 'extrapolation'. In addition, the survey data confirmed that raters and trainees viewed the experience as authentic and relevant. And finally, although this was a formative OSCE, evidence for 'implications' was demonstrated through the use of standard-setting methods and by informing curricular changes. Current literature supports the value of simulation in procedural based specialties such as Surgery, Anesthesia, and Emergency Medicine for both training and assessment [19,20]. An OSCE was developed to assess resuscitation skills in Emergency Medicine, showing the value of simulation in acute and emergent settings, by differentiating between skill levels [19]. Similarly, a simulation curriculum called the National Anesthesiology Simulation Curriculum (CanNASC), has been implemented across 17 Anaesthesia training programs in Canada designed to assess knowledge or skills gaps in the curriculum [21]. In contrast, there is a lack of data on the use of simulation in medicine sub-specialties. Our study is the first to describe the development and evaluation of an OSCE for rare, emergent and complex conditions in E&M. Since some E&M conditions are encountered so infrequently, certain topics may be missed during the course of a resident's sub-specialty training. OSCEs therefore could have a role in providing an opportunity to practice in an environment that does not impact patient safety [22]. Although our OSCE was designed to specifically assess difficult areas of assessment in the E&M specialty, similar methods could be used to produce an assessment tool in any clinical context.
The OSCE was intended to represent a sample of priority topics felt to be rare, emergent or complex in the E&M specialty. At present, there have been no established methods for assessing E&M Residents for this category of conditions within the educational literature. The Royal College of Physicians and Surgeons of Canada require that E&M residents meet objectives including managing Endocrine emergencies, in addition to other rare conditions captured within this OSCE [23]. However, since there is no 'practical examination' for E&M board certification, residents may never be assessed in this domain. It would be important that E&M residents are exposed to these important clinical topics both to drive learning and provide an opportunity for competency assessment. As we progress towards CBME, E&M programs will require more collection of data to support competency decisions, particularly through observation of trainees. Assessment of competencies in the form of Entrustable Professional Activities (EPA) require progressive levels of observation [3]. The OSCE is a useful tool for providing opportunities for observation in circumstances where clinical scenarios are rarely encountered, therefore providing an opportunity for assessment in areas that may be missed in the clinical environment. As EPAs have become the main unit of competency assessment in post-graduate training programs, it would be important to ensure that future OSCE scores and rating scales align with milestones to inform EPAs.
As evidenced by the positive learning experience comments from the residents, having a formative OSCE in this category of topics, provides an opportunity to drive further learning. It is known that competence is "context specific" and can vary based on the situation, the learner and assessor [24]. However, within a programme of assessment, this OSCE can serve as one of many different instruments used to provide data towards the continuum of competency assessment.
Although it is recognized that it may not be possible to assess every competency in a CBME curriculum [3], providing a sample of potentially missed scenarios as simulated cases can serve as an impetus towards "assessment for learning". Including simulated cases within a programme of assessment, may allow residents to use feedback as a vehicle to re-focus their learning to fill competency gaps [5]. Pugh et al. studied how formative OSCEs affect learning by surveying Internal Medicine residents, illustrating that residents view OSCEs as "a hurdle to overcome", while simultaneously viewing it as a platform to receive feedback and learn [25]. The residents in our OSCE mirrored this belief that having exposure to these cases was likely to drive further learning. If this type of assessment is implemented into an E&M training program, residents may be more self-aware of their learning gaps and will hopefully mitigate these gaps before they enter independent practice.

Limitations
Potential limitations of this study include the limited sample size and the implementation of the OSCE at a single centre. If additional funding was available, recruitment of additional E&M residents outside of our institution would be beneficial to increase generalizability outside of our local context. We would have ideally administered the OSCE for E&M residents from multiple programs and would seek to do this at a national level, for example, at an annual conference. Although not all possible topics could be captured in this OSCE, the performance may be representative of resident performance on topics within this domain based on their similarity in cognitive complexity and level of expertise. Although we were able to differentiate between the junior IM group (PGY 1-3) group and E&M group (PGY 4-5) based on performance in this OSCE, we would want to further delineate performance between PGY-4 vs. PGY-5 vs. practising physician for this specialty exam. This would require a larger scale study with more resident participants in each PGY group if we intend to see a difference between years of training. Administering the OSCE annually using different cases within a similar clinical context would provide data to observe progression in achieving competencies.

Conclusions
The development and evaluation of a pilot OSCE designed for assessing rare, emergent and complex topics in Endocrinology & Metabolism was shown to be a feasible means to distinguish between levels of expertise. We applied Kane's framework to acquire validity evidence and we can infer that the scoring instrument was useful in rating candidates' performance. Importantly, the OSCE scores distinguished between novice-expert learners meaning that the difficulty of the content was set at the appropriate level. The reliability of the results was adequate for a low-stakes formative exam. As we move towards CBME, a novel OSCE that addresses these important clinical scenarios would be important to include within a programme of assessment. This type of exam could also have potential utility as a high-stakes exam for qualification at the end of E&M training as it assesses unique scenarios that are expected of an Endocrinologist.