How well do second-year students learn physical diagnosis? Observational study of an objective structured clinical examination (OSCE)

Background Little is known about using the Objective Structured Clinical Examination (OSCE) in physical diagnosis courses. The purpose of this study was to describe student performance on an OSCE in a physical diagnosis course. Methods Cross-sectional study at Harvard Medical School, 1997–1999, for 489 second-year students. Results Average total OSCE score was 57% (range 39–75%). Among clinical skills, students scored highest on patient interaction (72%), followed by examination technique (65%), abnormality identification (62%), history-taking (60%), patient presentation (60%), physical examination knowledge (47%), and differential diagnosis (40%) (p < .0001). Among 16 OSCE stations, scores ranged from 70% for arthritis to 29% for calf pain (p < .0001). Teaching sites accounted for larger adjusted differences in station scores, up to 28%, than in skill scores (9%) (p < .0001). Conclusions Students scored higher on interpersonal and technical skills than on interpretive or integrative skills. Station scores identified specific content that needs improved teaching.


Background
Learning the skills of physical diagnosis is a critical part of the medical school curriculum. While there is widespread agreement on what skills should be learned [1,2], there is little information on how well those skills are learned, especially among second-year students. Measuring skill ac-quisition objectively is the essential first step in improving clinical competence throughout undergraduate and postgraduate training [3,4].
During the past 25 years, the objective structured clinical evaluation or examination (OSCE) has become an impor-tant method of assessing skills at all levels of medical training [5,6], complementing traditional evaluations of knowledge that use written multiple choice questions and essay questions. Compared with other levels of training [7], little is known about the use of the OSCE in physical diagnosis courses for second-year medical students.
Several studies have used the OSCE to assess the effect of educational interventions on specific skills at the secondyear level, such as history-taking for smoking [8], or examination of low back pain [9] or the breast [10,11]. Others have examined the use of different examination personnel as examiners or patients [12][13][14], compared students' course feedback to their OSCE performance [15], examined costs [12,16] or reliability and generalizability [7], compared training locations [17] or provided general descriptions of their OSCE's [18][19][20][21][22]. We found no studies that have used the OSCE to report comprehensively on the spectrum of skills learned in a second-year physical diagnosis course. A comprehensive investigation is likely to help determine what aspects of the educational process should be improved.
We used the OSCE to examine how well second-year students learned clinical skills in the second-year physical diagnosis course at Harvard Medical School. We were particularly interested which skills students performed best and which were most difficult. We assessed what factors affected their performance on the overall OSCE, and on individual skills and stations. Finally, we examined whether student OSCE scores varied from year to year, medical students performed differently from dental students, learning at different teaching sites affected student performance, and preceptors and examination logistics affected student scores.

Setting
This study took place at Harvard Medical School as part of the required second-year physical diagnosis course, Patient-Doctor II [4]. The course is taught from September to May in the same general sequence at 9 clinical sites affiliated with the medical school. Each site is assigned 6-45 students for the entire 220-hour course, including a total of 30 second-year students from Harvard School of Dental Medicine. These dental students are preparing for careers in consultative dentistry and are required to learn the same clinical skills as Harvard medical students. The course involves a total of almost 700 faculty members. One or two faculty members at each site function as site director(s) and are intimately involved in teaching the students and organizing other faculty to teach in the course.
Teaching sessions are organized by organ system. Students first learn skills by practicing on each other and by taking histories and performing physical examinations on selected patients. Each year, approximately 130 medical students and 30 dental students participate in the course. Site directors meet monthly as a group to determine the curriculum, teaching techniques, and evaluation of the course.

Objective structured clinical examination (OSCE) Development
We developed our OSCE primarily for educational purposes: to identify skills that each student has learned well and those that need improvement during the final portion of the course. Performance on the OSCE is not formally factored into a student's grade for the course, but individual student OSCE scores are reviewed by site directors.
We designed the OSCE stations in 1994, pilot-tested them at evaluation sessions held in 1995 and 1996, and reported on our results for 1996 [23]. Following established methods [7,24,25], the course director and a committee of site directors and 4 th -year student representatives developed case scenarios, detailed instructions and checklists consisting of questions or tasks for 16 stations focused on specific clinical areas. From 1994-1996, we refined the content of the stations and the OSCE organization through frequent discussions with all site directors and through feedback from students and OSCE preceptors. We made no changes to the exam during 1997-1999. Site directors determined that all OSCE questions reflected essential skills to be mastered by second-year students. We did not weight OSCE questions, stations or skills according to degree of difficulty. Annual feedback from students and faculty endorsed the face validity of the OSCE. In 1999, 90% of students and 91% of faculty agreed that the OSCE represented an appropriate and fair evaluation method, and that enough time was given to complete the stations.
In the 16-station OSCE, nine different formats were used alone or in combination: question and answer, preceptor role play, standardized patients, actual patients, mechanical or structural models, 35-mm slides, audiotape, videotape, and CD-ROM (Table 1). OSCE committee members designated each question or task in the 16 stations as one of 7 clinical skills, defined as follows: asking appropriate questions for the history (history-taking); performing the physical examination correctly (physical examination technique); understanding the pathophysiology of physical findings (physical examination knowledge); identifying abnormalities on physical examination (identification of abnormalities); developing appropriate differential diagnoses for the clinical information obtained (differential diagnosis); utilizing appropriate patient-doctor interaction techniques (patient interaction); and orally presenting the history and differential diagnosis after taking a clinical history (patient presentation). The total number of OSCE questions each year was 382, and the mean number of questions per skill was 55 (range 14-70), evenly distributed except for patient interaction and patient presentation.

Implementation
Each year, we held 10 sessions of the OSCE on 3 days (Monday, Wednesday and Friday afternoons) during a one-week period in April for all second-year students. Two consecutive, early and late afternoon sessions each consisted of the same 16 stations and lasted 2.5 hours. To accommodate all students, sessions were conducted simultaneously on 2 floors of the medical school's education center, for a total of 10 OSCE sessions. Other than by date and time, the sessions varied only in the assignment of preceptors. With the help of guides, timers and a strict schedule, students rotated through the 16 clinical stations, each precepted by a faculty member. All preceptors received standardized guidelines for checklists and feedback prior to each OSCE session, as did the standardized patients or actors for the abdominal pain, alcohol/abdominal exam, knee and thyroid stations. Fourteen stations were each 6 minutes in duration, and twoabdominal pain and headache -were 12 minutes in duration.
At each station, the student performed the indicated tasks for two-thirds of the time, while the faculty preceptor observed and checked off the tasks performed correctly, as defined by checklists, one for each student. All tasks performed or questions answered by each student were scored dichotomously as correct (1) or left blank (0) on the checklists. During the final one-third of time at each station, the preceptor provided feedback on the student's performance, as advocated by others [26]. Each year, approximately 150 preceptors participated in the OSCE, and 60% have had experience with this OSCE and the checklists from prior years.

Data collection and analysis
Correct answers to all OSCE questions were recorded on checklists by preceptors, double-entered by research staff into an ASCII file, and analyzed in SPSS [27]. Total OSCE, skill and station scores were calculated as follows. Each task or question counted one point, and the sum of tasks performed or questions answered correctly for each station was designated the station score. The sum of station scores produced a total OSCE score for each student. Means of students' scores ± one standard deviation for each of the 16 stations were computed. To compute the skills score, each task or question on the checklist for every station was designated as one of 7 skills. The sum of tasks performed or questions answered correctly for each skill produced a student's skill score. Means of students' scores for each of the 7 skills were computed. We combined the data from the 1997, 1998 and 1999 OSCE's.
Total OSCE score, scores for each clinical skill, and scores for each station were the primary outcome variables. In addition to the checklists completed by faculty preceptors at each station for each student, we collected data on student, preceptor and examination variables to examine factors that might predict students' OSCE scores. Student variables were type of student (medical or dental), and teaching site (Site A-I). The preceptor variables were the floor (first or third) and session group (early or late afternoon) assigned to each OSCE preceptor. Examination variables consisted of OSCE year (1997, 1998 or 1999), the day each student took the OSCE (first, second or third), and sequence of stations.
For all predictor variables, total OSCE, skill and station score means were compared with one-way ANOVA. Predictor variables significantly associated at p < .05 with students' total OSCE in univariate analysis were entered into a linear regression model, with the single dependent variable being a student's total OSCE score. The predictor variables were also entered into two multivariate analysis of variance (MANOVA) models, each of which included multiple dependent variables. As dependent variables, one model used clinical skill scores, and the second model used station scores. Separate models were used due to the high co-linearity between the skill and station scores, since both of these scores drew from the same item pool. P-values within each MANOVA model were adjusted for multiple comparisons. In addition, we set the threshold for judging statistical significance at p <= .001 to further reduce the influence of multiple comparisons on p values.
Because it was not logistically possible to obtain interrater reliability due to the large number of preceptors, we used generalizability theory analysis [28]. This analysis accounts statistically for rater error by parsing out the variance relevant to the instrument in question. By modeling the variances as separate characteristics, we isolated the variance due to student ability, which in classical test theory is equivalent to true score variance. Other variances related to the test are treated as error variances. In this framework, we treated error due to differences in raters as error variance.
We calculated the Kuder-Richardson-20 coefficient of reliability, KR-20, for the total OSCE score, clinical skill and station scores. The KR-20 [29] is used for binary items and is comparable to Cronbach's alpha. This measure of internal consistency is the best measure of reliability when there are many more than two raters. It is equivalent to the generalizability or G coefficient which examines total scale scores across raters in a D-study scenario (total scores are normally distributed), when the main effect variance due to raters is assumed to be zero [30][31][32]. In our study, we assumed zero main-effect variance to be the average across the large pool of student raters, because student assignment to a preceptor for any given station was essentially random.

Results
Over three years, 489 second-year students (402 medical and 87 dental) and 445 faculty participated in the OSCE for second-year physical diagnosis course. Students answered slightly more than half of all the OSCE items correctly, 57% ± 6% (Figure 1a), with almost no change over 3 years (p = .28). Individual student scores on the entire OSCE ranged from 39% to 75%.
For adjusted total OSCE scores, medical students scored 6% higher than dental students, 57% vs. 51% (p < .0001, Table 2). No other variable was found to predict total OSCE scores. For adjusted clinical skill scores, the largest score differences were associated with the student variable -medical vs. dental. Medical students' scores were 9% higher than dental students' scores for patient presentation (and were slightly but significantly higher for all other clinical skills except history-taking, not shown). Table 2 shows other significant differences among several tested variables and groups, but the absolute score differences for these variables were relatively small, 8% or less.
When we examined the mean total, clinical skill and station scores according to student, preceptor and examination variables, we found many statistically significant associations in the univariate analyses. Multivariable analyses yielded fewer but still similarly significant results. Table 2 presents the highest scoring groups of predictor variables and the largest adjusted differences between the highest scoring and the reference groups.
Adjusted station scores demonstrated the largest differences, notably for teaching sites ( Table 2). For the thyroid station, the scores of students at site H were 28% higher than scores for students at reference site I. Other predictor variables accounted for smaller differences. Medical students' adjusted scores on the rectal/prostate station were 15% higher than dental students' scores. They were also significantly -but less than 10% -higher for 8 other stations, and no different for 7 stations (not shown). Other variables -preceptor groups, OSCE day and OSCE yearalso demonstrated some variation, with the largest differences being 14% among preceptor groups for the knee station. We used MANOVA for comparisons among the predictor variables and multiple dependent variables. 1 Adjusted score difference denotes adjusted score difference between the highest scoring group and the reference group means. All differences were significant at p <= .001. 2 C.I. denotes confidence interval. 3 n.a. denotes not applicable. Because teaching sites demonstrated the greatest differences in OSCE station scores, even after adjustment for other variables, we examined detailed inter-site differences (Table 3). Eight adjusted station scores showed substantial and significant differences in student scores among teaching sites: thyroid (28%), knee (26%), ear (23%), arthritis (17%), heart (13%), mental status (11%), lung (11%) and skin (10%) (p <= .001). There were no significant inter-site differences for the breast, abdominal pain, presentation, headache, alcohol/abdominal exam, rectal/prostate, hemoptysis and calf pain stations. At every teaching site, adjusted scores for 1 or 2 stations were higher than at reference site I, while scores for 1 to 3 other stations were lower than those for the reference site.
The overall reliability coefficient for the OSCE of 382 items was .86 (Table 3), indicating good reliability of the OSCE total score [25,31,32]. The reliabilities of the clinical skill scores ranged from .57 to .77 (not shown). All but one of these scores -identification of abnormalities, .57had a reliability coefficient of .65 or higher. Reliabilities for clinical skill scores were generally higher than for station scores which ranged from .40 to .83 (Table 3).

Discussion
In an OSCE for a second-year physical diagnosis course, we found a similar pattern of clinical skill acquisition for three successive classes of students. Students performed better on interpersonal and technical skills -patient interaction, history-taking, physical examination technique, identification of abnormality, and patient presentationthan on interpretative or integrative skills -knowledge of the pathophysiology of physical examination findings, and differential diagnosis. Teaching sites differed widely from one another in performance on individual OSCE stations, only modestly on clinical skills, and not at all on total OSCE scores. Medical students scored somewhat better than dental students on the overall OSCE, all clinical skills except history-taking, and almost half of the stations.
To our knowledge, this study is the first to examine comprehensively student performance for general clinical skills and specific OSCE stations at the second-year student level. Other studies of OSCE's for second-year students have focused on specific skills or content [8][9][10][11], or logistics and psychometrics [7,12,16]. None of the other studies employed multivariable analysis in examining factors associated with OSCE performance. By including such analysis, we were able to hold student and examination variables constant in order to determine what parts of the curriculum students mastered best and which sites best taught specific physical diagnosis content.
Higher scores on technical and patient interaction skills, compared to integrative skills, are not surprising. Students at Harvard and in many medical schools begin to practice some interviewing, history-taking and patient interaction during the first year curriculum, and they spend the entire second-year physical diagnosis course learning the techniques of physical examination. Investigators have reported similar results in other settings. OSCE scores among clinical clerks were higher on history-taking/physical examination skills (mean score ± s.d., 61 ± 4%) and interviewing skills (69 ± 11%), and lower on problem solving (50 ± 6%) skills [33]. In a non-OS CE examination using patient management problems, second-year students scored 70 ± 9% on history, 66 ± 10% on physical examination, and 40 ± 15% on diagnosis [34]. However, in an OSCE for a second-year neurology skills course, this pattern did not hold: interpretative skill scores (76 ± 16%) were higher than technical performance scores (67 ± 17%), but no significance testing was reported [15].
Differential diagnosis has traditionally been considered a secondary goal of our physical diagnosis course, so performance might be expected to be lower. However, pathophysiology of disease is a major focus of the second-year curriculum. Lower performance in knowledge of the pathophysiology related to physical diagnosis, compared with technical performance of the physical examination, suggests that improvements integrating pathophysiology into the teaching of the history and physical examination are needed.
Our other key finding was the variable performance by students from different teaching sites on half the OSCE stations, despite similar performance by sites on the overall OSCE. Every site scored highest or next-to-highest on at least one station, and every site also scored lowest or next-to-lowest among sites on at least one station. Because of the large numbers of students in this study, even differences of 2% were statistically significant, but we consider differences greater than 10% to be educationally significant and worthy of targeted improvement efforts.
We found the largest differences for the thyroid, knee, ear, arthritis, heart, lung, mental status, and skin stations. While students may have acquired overall physical diagnosis skills similarly from site to site, our results suggest they did not learn equally at every site the skills required for adequate examination or understanding of these specific organ systems. Inter-site differences in content-specific station scores represent opportunities for teaching sites to learn from one another, using strategies such as structured clinical instruction modules [9,35] or reinforced practice [11] and developing more uniform learning objectives and curriculum.
Raw score results at one medical school must be interpreted with caution, since OSCE's at other schools may differ in degree of difficulty. The mean total OSCE score of 57% ± 6% in our study compares favorably with results from one report on second-year students (52 ± 6%) [36], a report on clinical clerks (57 ± 4%) [33], and a study of thirdyear medicine students (58%) [3], but less favorably with another report on second-year students, 70% [12]. None of these studies adjusted their student scores.
Consistent with a prior study from the U.K. [37], we found that dental students scored lower than medical students, but not at a level which raises serious concerns about their participation in the physical diagnosis course. While dental students scored lower on the majority of stations, they performed as well as medical students on some stations with content that is not related to their ultimate professional focus, such as breast, mental status and abdominal pain.
This study has several limitations. We have not directly assessed inter-rater reliability because of logistical and cost constraints. To address this methodological concern, we used generalizibility theory (GT) to produce a measure of reliability similar in quality to inter-rater reliability [32].
There are a number of examples of the use of GT to account statistically for rater error [32,[38][39][40]. Using GT can also overcome some problems inherent in inter-rater reliability, such as overestimating reliability [41]. Due to the large number of preceptors involved in our OSCE, we made the statistically reasonable assumption that any error due to rater differences is randomly distributed. Since randomly distributed error has a mean of zero, the error variance due to differences among all preceptors is zero. In our OSCE, the variation of individual raters around the mean station score of all raters is very close to 0 (e.g., .04 for the presentation station, data not shown), and the standard deviations of student scores are comparatively large (e.g., 15 for the presentation station). Finally, our GT-based assumption is especially appropriate when the test scores used in the analysis are created by summing many items across each scale. Summing in this fashion has the effect of further randomizing the error variance. The reliability, or internal consistency, of the overall OSCE was good at .86. The reliability of 6 of 7 skill scores, and 9 of 16 station scores, were acceptable at > .60.
Another benefit of the GT approach is that the reliability coefficient derived from the GT analysis is equivalent to Cronbach's alpha coefficient which, for binary items, is equivalent to the KR-20 reliability coefficient. The alpha coefficient is especially useful during test development because it gives a measure of how each item is contributing to the scale to which it has been assigned. This measure makes it easy to target items for editing or deletion if they are not performing well. Since we are ultimately interested in using the scale scores for our research study, the GT measure of reliability is appropriate for OSCE's involving many preceptors.
The validity of our OSCE is only partially established. While several features support its face and content validity, construct and criterion validity remain to be tested. Multiple refinements of stations over the two developmental years of the OSCE prior to this study yielded broad agreement among the teaching site directors that all OSCE questions reflected essential skills that should be taught to and mastered by second-year students. Five successive years of post-OSCE student and faculty evaluations have endorsed the OSCE as a highly appropriate and acceptable method of education and evaluation. Finally, a more recent investigation supports predictive validity of our OSCE. Physical diagnosis skills examined in the present study correlated with scores on the USMLE Step 1 exam, and the skills that foreshadow the clinical clerkshipsidentification of abnormality and development of differential diagnoses -best predicted USMLE scores [42].
Variation in skill scores may be due to different OSCE station content. Three of the skills drew their questions from a smaller number of stations: patient interaction, 3 stations; history-taking, 3 stations; presentation, 1 station. However, patient interaction and history-taking drew their questions from the same stations. More importantly, the remaining 4 skills each drew their questions from 6-8 stations. For these 4 skills (physical examination technique, physical examination knowledge, identification of abnormalities, and differential diagnosis), the range of case content is considerable and counters the concern that variation might be caused by case content rather than by student performance.
Variation in skill scores may be also due to inherent differences in the degree of difficulty of exam questions. In our exam, we did not weight OSCE questions according to degree of difficulty. We were not trying to create an exam in which all items were of equal difficulty. Instead, we created an OSCE in which the course directors considered all test items essential to be mastered by the students. The results showed variation in the degree to which the students mastered different clinical skills. Remarkable stability of overall scores over the three years of this study with three different cohorts of students provides evidence that there has been no significant "teaching to the OSCE." This finding is consistent with a prior study of fourth-year students [43].
The successful implementation of the OSCE at our medical school is relevant to all medical schools that face the logistical challenges posed by multiple sites and preceptors for student training in physical diagnosis. Further-more, the results from the second-year OSCE reported here and our pre-fourth year OSCE [44] have been useful in helping to identify areas of weakness that could benefit from remediation prior to the start of clinical clerkships. This benefit is especially true for students with the lowest performance on individual stations and skills. For site directors and faculty, the OSCE has also helped identify those parts of the curriculum students had difficulty mastering. Holding a second-year OSCE prior to the end of a physical diagnosis course helps medical school faculty identify opportunities for remediation, focus the remaining sessions of the course, and improve future physical diagnosis teaching.

Conclusions
Objective identification of skills acquired in a physical diagnosis course is a necessary first step in improving the quality of both the teaching and the learning of those skills. In our OSCE for a second-year physical diagnosis course, students scored higher on interpersonal and technical skills than on interpretive or integrative skills. Station scores identified specific content needing improvements in students' integrative and organ systemspecific skills of physical diagnosis, and in the teaching of these skills.