Learning objectives for joint examinations are included in the National Catalogue of Learning Objectives in Medicine [18], which was approved by each medical faculty in Germany. Beforehand, each medical school had individual surgical OSCE-stations (for example assessing knee-joint examination by testing only ligament stability tests), thus we created basic and consistent OSCE-stations, which implemented a complete structured knee- or shoulder-joint examination.
Checklists to assess structure, performance, and knowledge of the joint examination were developed and students were scored using a 3-step-Likert-scale (part A) (Additional files 1 and 2). Additionally, how well the student communicated and interacted with the patient was scored using a global rating scale (part B) (Additional file 3) with 5 items, each being scored on a 5-step-scale. Part B of the checklist was equal for both joint assessments. Scores from each joint examination were then tallied in a way that two-thirds and one-third of a student’s score were from part A and part B, respectively.
Students had up to 5 min to perform and explain the joint examination to a standardized patient, an actor or actress who had been instructed to play a patient in a standardized, consistent role (for example a patient with typical impingement syndrome of the shoulder).
Five German medical schools (named in the following sections sites (S) 1–5) agreed to implement the standardized OSCE-stations in their local surgical OSCE.
To minimize bias from different central examiners, we appointed a single reference examiner to assess each student in addition to a local examiner. The reference examiner was a male resident of orthopaedic surgery with long experience in assessing practical skills during OSCE for which he had completed several rater trainings beforehand. He scored every student with the original checklist and his results were later used for comparison of outcome at the different medical schools.
For this study, outcome of the basic, consistent part of the checklists were evaluated, and the scores from the reference examiner were compared to the ones from the local examiner to calculate interrater-reliability. Because local exams are a matter of each medical faculty themselves, each medical school could add items for their local outcome, for example about further diagnostic investigation (Ultrasound, X-ray, MRI). However, it was not allowed to omit a basic item. Also some medical schools used their own raw scoring system in order to stick to the scoring points of other OSCE-station (for example all scores were doubled). By comparing in percentage points it was possible to compare different sites even if the raw scoring was different as long as the items were all scored separately or the grouping in rubrics was comparable.
Depending on the medical faculty, between two and four local examiners with different levels of professional (clinical) experience administered the OSCE. Thus, results were correlated with the examiner’s level of clinical experience and evaluated in relation to their gender.
Altogether, 180, 147, 137, 31, and 45 students from sites 1, 2, 3, 4, and 5, respectively, were included in the study. Unfortunately, the local checklists of site 4 differed to the original, standardized checklists; thus, only the scores of the reference examiner (by using the original checklists) were used for evaluation. Although including all the agreed items, items at site 4 were not scored separately and the 3-step Likert-scale was not used. Part B of the local checklist at site 3 was excluded because some agreed items were not scored separately.
The study was approved by the ethics committee of the organizing university.
Statistics
Because the reference examiner and one local rater assessed every student, the means and standard deviation of both ratings were calculated and compared. Additionally, results were calculated separately for male and female examiners. Significant mean differences were evaluated with Analysis of variance (ANOVA) if distribution was normal or Kruskal-Wallis test if not. Significant differences between individual sites were identified by comparing pairs using the Duncan test. Differences were considered significant if p < 0.05. Interrater-reliability was calculated and expressed using the Kendall-W coefficient. The Kendall-Tau-b coefficient was applied in order to evaluate correlation between the examiner’s level of clinical experience and the student’s outcome. For expressing effect strength for significant differences in the gender analyses, Cohen’s coefficient d was calculated. IBM SPSS version 19 (SPSS, Inc., Chicago, IL, USA) was used for the statistical analyses.