Assessing the validity of an OSCE developed to assess rare, emergent or complex clinical conditions in endocrinology & metabolism

Dizon, Stephanie; Malcolm, Janine C; Rethans, Jan-Joost; Pugh, Debra

doi:10.1186/s12909-021-02653-4

Research article
Open access
Published: 20 May 2021

Assessing the validity of an OSCE developed to assess rare, emergent or complex clinical conditions in endocrinology & metabolism

Stephanie Dizon^1,2,3,
Janine C Malcolm^1,2,3,
Jan-Joost Rethans⁴ &
…
Debra Pugh^1,3,5,6

BMC Medical Education volume 21, Article number: 288 (2021) Cite this article

2051 Accesses
4 Citations
Metrics details

Abstract

Background

Assessment of emergent, rare or complex medical conditions in Endocrinology and Metabolism (E&M) is an integral component of training. However, data is lacking on how this could be best achieved. The purpose of this study was to develop and administer an Objective Structured Clinical Examination (OSCE) for E&M residents, and to gather validity evidence for its use.

Methods

A needs assessment survey was distributed to all Canadian E&M Program Directors and recent graduates to determine which topics to include in the OSCE. The top 5 topics were selected using a modified Delphi technique. OSCE cases based on these topics were subsequently developed. Five E&M residents (PGY4-5) and five junior Internal Medicine (IM) residents participated in the OSCE. Performance of E&M and IM residents was compared and results were analyzed using a Generalizability study. Examiners and candidates completed a survey following the OSCE to evaluate their experiences.

Results

The mean score of IM and E&M residents was 41.7 and 69.3 % (p < 0.001), respectively, with a large effect size (partial η² = 0.75). Overall reliability of the OSCE was 0.74. Standard setting using a borderline regression method resulted in a pass rate of 100 % of E&M residents and 0 % of IM residents. All residents felt the OSCE had high value for learning as a formative exam.

Conclusions

The E&M OSCE is a feasible method for assessing emergent, rare and complex medical conditions and this study provides validity evidence to support its use in a competency-based curriculum.

Peer Review reports

Background

The shift towards competency-based medical education (CBME) in post-graduate medical education requires frequent assessment of physician competencies across various clinical contexts within each specialty [1]. However, there is significant variability in the cases that residents may encounter during their training. Because of this, residents may never be assessed on their ability to manage some rare, emergent or complex conditions which would be essential for becoming an expert in their field. If residents have had limited or no exposure to these cases, then their ability to manage patients with these conditions may negatively impact future patient outcomes. This leads to the question of how do we evaluate competencies in certain clinical scenarios that are not easy to access in the learning environment? Ideally, one would like to observe how trainees perform during real clinical encounters, however exposure to some conditions may be limited and so other assessment opportunities must be sought. In these cases, simulation in the form of Objective Structured Clinical Examinations (OSCEs) may be useful [2].

In a CBME model, it is important to demonstrate progression of clinical skills through frequent observational assessments. Within a programme of assessment, multiple methods of assessment can be combined to achieve an overall impression of competency within a specific domain [3]. Part of this design will involve an increasing number of workplace assessments to assess trainees’ progression along a continuum towards expertise. This is possible in situations where a clinical problem is common, with several opportunities for hands-on advancement of skills. However, in clinical scenarios that are rare, emergent or complex, assessing competence in a real-life setting may not be consistently possible during their training period.

Introducing a formative OSCE to address progressive assessment across training years may be of benefit in a CBME model where assessment data from multiple methods contribute to competency decisions [3]. Furthermore, the use of OSCEs as progress tests have been shown to be useful in discriminating between levels of training within an Internal Medicine (IM) program [4], but has not been described in Endocrinology programs. Although it is known that OSCEs are resource-intensive, they can be viewed as a more suitable method for assessing cases that are neither easily accessible in the workplace (i.e., rare, emergent, or complex) nor easily assessed in a written format [2]. OSCEs are purported to be objective and structured but they are not necessarily considered superior to other methods, rather, they can be complementary within a programme of assessment [5]. However, when deciding to include a particular assessment into a curriculum one must consider sources of validity evidence to justify its position in the program [6]. An assessment with robust validity evidence enables assessors to trust that the scores obtained represent the construct it intends to measure [7].

Messick and Kane’s modern validity frameworks aim to gather evidence from various sources in order to demonstrate construct validity (i.e., the degree to which the test measures what it purports to measure) [8,9,10,11]. However, the advantage of Kane’s validity framework is that it prioritizes validity evidence in key phases or inferences within the validity argument: scoring, generalization, extrapolation and implications [12, 13]. The first step of Scoring, seeks to ensure that the scores obtained from the observed actions best represent the performance [8]. The second step of Generalization refers to obtaining an overall test score that represents the general performance test setting in equivalent types of tests [8, 12]. Thirdly, the Extrapolation phase aims to determine if the observed performance correlates with real world performance or other measures of the same or similar performance domains [14, 15]. Lastly, the Implications of the assessment tool includes decision making (i.e., pass/fail) or consequences of the test on those assessed [14]. To date, there have been no published articles on assessing validity evidence in OSCEs within the Endocrinology & Metabolism (E&M) specialty.

The purpose of this study is two-fold: (1) to develop a pilot OSCE for E&M residents to assess their management of rare, emergent or complex E&M scenarios that may be missed in clinical training; and (2) to gather validity evidence for this OSCE in light of Kane’s framework. We aim to address the following questions: To what extent does the validity evidence support the use of the E&M OSCE as a formative assessment for rare, emergent or complex cases? More specifically, to what degree does the OSCE represent the constructs it intends to measure? Finally, what is the perceived value for learning from a resident’s perspective? In order to achieve this, we carefully designed an OSCE that represents what we intended to assess, while collecting validity evidence.

Methods

OSCE Development and Design

Needs Assessment

An electronic survey (via Survey Monkey©) was sent to all 13 E&M Program Directors across Canada as well as 29 recent E&M graduates (i.e., those who graduated within the last two years) to seek their opinion regarding gaps in their residency training program and which topics they believe would be important to consider for an OSCE.

A list of rare and emergent cases was included in the survey to rank (determined from objectives of E&M training and content expert agreement), in addition to a free-text area to suggest topics. Consensus was ascertained using a modified Delphi technique, involving two rounds of ranking the top ranked priority topics. From this, a list of five top-ranked topics was identified, all of which are reflected in the Royal College of Physicians & Surgeons of Canada (RCPSC) Objectives for Training for E&M residents (http://www.royalcollege.ca/rcsite/ibd-search-e?N=10000033+10000034+4294967098).

Case Development

Five cases based on the top-ranked were developed by a specialist in E&M and were reviewed by three additional content experts. Through an iterative process, each case was reviewed and revised by three of the study investigators (SD, JM and DP) and the current E&M Program Director at the University of Ottawa (UofO).

Setting and administration

The OSCE was administered at the UofO in the 2018–2019 academic year. To accommodate all candidates, the OSCE was administered twice, using one track (5 cases consecutively). Five candidates participated in each administration (total n = 10). Each administration contained the same five cases and each case lasted 12 min. Each candidate was assessed by a unique rater for each station and the raters remained unchanged for each administration.

Context and subjects

Five E&M resident physicians [3 PGY (Post Graduate Year)-4 s and 2 PGY-5 s] were recruited as participants. Additionally, five Internal Medicine (IM) residents (PGY1 to 3) were recruited as a comparison group. All residents went through an informed consent process with the Research Assistant. Immediately preceding the OSCE, each resident group participated in an orientation session (led by SD) to explain the purpose and structure of the OSCE and to address any concerns.

Examiners (raters)

Raters included faculty experts (four Endocrinologists and one Internal Medicine Specialist). An orientation session prior to the OSCE was provided to the examiners (led by DP) to explain the purpose and structure of the OSCE, to ensure that they were familiar with the use of scoring instruments, and to provide the opportunity to ask questions about the OSCE.

Standardized patients

Experienced standardized patients (SPs) were recruited and received training for their roles by experienced trainers, in line with current global standards of SP training [16].

Scoring Instruments

Participants were assessed by raters using scenario-specific scoring sheets (consisting of checklists and a series of rating scales) with items that were case-specific. Each contained “key feature” items that are deemed to be important actions necessary to meet the topic objectives. The case-specific checklists were developed by the principal investigator (SD) and reviewed using consensus agreement amongst content experts, including the E&M Program Director. Rating scales were used to rate performance in the areas of: (1) Organizational skills; (2) Ability to communicate plan; and (3) Ability to prioritize acute medical issues. A global rating score (GRS) designed to rate candidates’ overall competence was also included.

Analyses

Using Kane’s modern validity framework, sources of validity evidence were gathered and analyzed in the domains of Scoring, Generalization, Extrapolation, and Implications.

Scoring

Weighting for the checklist and rating scale components was determined by a panel of experts in E&M and OSCE administration. A total score for each case was derived by combining the total checklist scores (70 %) with the rating scales (30 %). Descriptive statistics were calculated, as well as item-total correlations for each case using SPSS Software version 25. To ensure the integrity of data, quality assurance measures were employed during data collection and data entry. Immediately following the OSCE, the examination staff ensured that all checklists and rating scales were completed accurately. Data entry was double checked by experienced staff who employed quality control checks to ensure accuracy of scores entered in analyses.

Generalizability

Although this was a small-scale OSCE, the blueprint was derived using consensus methods to gain input from various stakeholders.

Measures of Generalizability include the reliability (i.e., reproducibility) of the scores, and the degree to which the stations represent the domain of interest. Since stations have multiple factors that can contribute to variance, Generalizability Theory (G-theory) was applied to quantify to what degree each variable (i.e., resident type, training level, participants, or stations) contributed to the overall variability in the scores. To generate the variance components, a mixed analysis of variance was conducted with students nested with discipline and crossed with stations. These variance components were then used to generate the reliability of the exam scores. Because we were interested in scores and not the reliability of the pass/fail standard, a relative reliability was used. We also used the results of the generalizability analysis to conduct a decision study, which uses the variance components to derive estimates of reliability if various factors in the model are varied. This analysis will be useful for determining how many stations are needed to produce a reliable set of exam scores.

Extrapolation

The ability of the OSCE to discriminate between novice (PGY 1–3) and expert groups (PGY 4–5), was measured using an independent t test.

Implications

Although this was designed as a formative examination, the Borderline Regression Method (BRM) was used to demonstrate how to apply methods for standard setting. This method involves a linear regression approach where all candidates’ checklist scores are regressed onto their global rating score to produce a linear equation [17]. The cut-score is determined by inserting the midpoint of the GRS (which is 3.5 on the current 6-point scale) into the equation, which results in a corresponding predicted checklist score [17].

Pass-fail decisions on this OCSE had no bearing on participants’ progression through the E&M program and were used to help determine if certain stations were unfairly difficult or if there were areas of underperformance that would require attention. Identification of difficult stations were utilized to inform curriculum change to promote learning in weaker areas.

To obtain the residents’ perspective, a post-OSCE survey was used to evaluate the degree of acceptability of the examination, and the degree to which they felt the OSCE has value for learning.

Results

Needs Assessment Survey

Seven out of 13 PDs (54 %) and 14/29 (48 %) E&M Graduates responded to the initial survey, with an overall response rate of 50 % (21/42). The top five selected topics from the “emergent” category in order of frequency were: (1) thyroid storm, (2) pituitary apoplexy, (3) severe hypocalcemia, (4) myxedema coma and (5) diabetic ketoacidosis in pregnancy (Fig. 1).

The top five selected topics in the “rare or complex” category were: (1) complex Cushing’s disease, (2) investigation and management of hyperaldosteronism, (3) pre-op management of pheochromocytoma, (4) Graves’ disease in pregnancy and (5) MEN syndromes (Fig. 2). The total frequencies per topic were totalled and the top 10 topics were subsequently used for ranking in the second survey.

There were 14 respondents from the second survey (6 PDs and 8 E&M Graduates) of the original 42 that were invited (33 % response rate). Five topics emerged from the ranking exercise in the second survey: (1) pre-operative management of pheochromocytoma; (2) thyroid storm; (3) pituitary apoplexy; (4) Graves’ disease in pregnancy and (5) investigation and management of hyperaldosteronism (Fig. 3). These topics were used as the basis for OSCE case development.

To ensure that the top ranked cases adequately represented the construct we intended to measure, the study investigators (SD, JM, DP) reviewed the results in detail to come to a consensus and were deemed suitable to meet the objectives of this OSCE.