A systematic review of factors influencing student ratings in undergraduate medical education course evaluations

Background Student ratings are a popular source of course evaluations in undergraduate medical education. Data on the reliability and validity of such ratings have mostly been derived from studies unrelated to medical education. Since medical education differs considerably from other higher education settings, an analysis of factors influencing overall student ratings with a specific focus on medical education was needed. Methods For the purpose of this systematic review, online databases (PubMed, PsycInfo and Web of Science) were searched up to August 1st, 2013. Original research articles on the use of student ratings in course evaluations in undergraduate medical education were eligible for inclusion. Included studies considered the format of evaluation tools and assessed the association of independent and dependent (i.e., overall course ratings) variables. Inclusion and exclusion criteria were checked by two independent reviewers, and results were synthesised in a narrative review. Results Twenty-five studies met the inclusion criteria. Qualitative research (2 studies) indicated that overall course ratings are mainly influenced by student satisfaction with teaching and exam difficulty rather than objective determinants of high quality teaching. Quantitative research (23 studies) yielded various influencing factors related to four categories: student characteristics, exposure to teaching, satisfaction with examinations and the evaluation process itself. Female gender, greater initial interest in course content, higher exam scores and higher satisfaction with exams were associated with more positive overall course ratings. Conclusions Due to the heterogeneity and methodological limitations of included studies, results must be interpreted with caution. Medical educators need to be aware of various influences on student ratings when developing data collection instruments and interpreting evaluation results. More research into the reliability and validity of overall course ratings as typically used in the evaluation of undergraduate medical education is warranted. Electronic supplementary material The online version of this article (doi:10.1186/s12909-015-0311-8) contains supplementary material, which is available to authorized users.

before 1978 USA X X Of 158 students enrolled in an anatomy course at the Medical University of South Carolina, 113 provided data (45 before an exam, 42 after an exam, and 26 out of 71 who were sent the form 3 weeks later). Twenty students were excluded from the analysis; thus, total response rate was 58.9% (93/158).
31 items with 5-point Likert scales; the first 20 items assessed attitudes (satisfaction) and produced a sum score (max. 100 for the most positive rating) sum score = 81.2  7.8 High-achievers provided more positive ratings than low-achievers (r between performance and evaluation ratings: 0.42), but time of rating did not affect results; no interaction between achievement and timing. Lecture attendees had higher exam scores and provided better evaluation ratings (potential confounding by recency effects). No sign. difference in performance levels between respondents and nonrespondents (post-exam group only). 9.5 [3]  Scales with positive anchors on the left produced significantly more favourable ratings with less variance than scales with positive anchors on the right. In a course with more positive overall ratings, the primacy effect was stronger for SDs than for means and also stronger for scales with fewer options. In a course with less positive overall ratings, the primacy effect was stronger for mean than for SDs and also stronger for scales with more options. When all positive-items within a scale were collapsed into one new scale, this yielded higher mean scores than a sub-scale only containing negative-item scales.
Negatively phrased items were associated with lower scale reliability and were less sensitive to change over time. Following publication of the exam blueprint, students were slightly more satisfied with both the exam and (non-significantly) the course itself. 8.5 [11] 1999-2002 Germany X No information on sample size (169 lectures, 288 seminars) 13 to 15 items on 6-point scales Factor analysis produced two factors in the 13-to 15-item tool; the first factor ('didactics') was correlated to initial interest (r = 0.59).
Higher lecture attendance (>80%) was associated with better ratings than less frequent attendance (<80%; effect size of the difference  = 0.44) Mandatory seminars received better ratings than lectures with voluntary attendance.
8 [12] before 2000 USA X 34 out of 83 (41%) and 15 out of 81 (19%) fourth-year students completing paper and online forms, respectively 62 items on 5-point scales addressing different clearkships 1) Response rate: online 19%; paper 41%; more omitted items in online forms 2) Online (e-mailed) forms were returned more quickly than mailed paper forms.
3) no significant differences in ratings between online and paper forms. 8 [13]  Think-aloud interviews were done while students completed these forms.
Evaluation items were ambiguous for some students. Student ratings were based on unique or unexpected criteria. The lower end of the rating scale tended to be avoided. Exams were not mentioned by students as potential confounders of overall ratings.
- [16] 2004-2005 USA X 84 first-year and 64 third-year students enrolled in 5 specialty courses at Texas A&M University Response rate 100% (mandatory) Course-specific forms with 15-24 items on 5-point scales and 1 overall rating on a 4point scale. Evaluation forms were completed either at the end of an entire course of at the end of a rotation within a course.
The following items were associated with better overall ratings: a) Administrative aspects including course organization b) Clearly communicated goals c) Instructional staff responsiveness Similar loadings were observed in different courses.
8 [17]  With more elapsed weeks, quality mean ratings increased and variability decreased; effect sizes were small (around 0.06). 8 [19] 2006-2007 The Netherlands X X Study 1: 380 first-year students; response rates: opinion condition 79%; prediction condition 60% Study 2: 450 first-year students; response rates: opinion condition 88%; prediction condition a 76%; prediction condition b 70% All students were enrolled in the 10week 'Bodily functions and homeostasis' course at the University Medical Centre Groningen Paper evaluation forms (9 items on 4-point scales) to be completed after the final course exam Both prediction-based methods required fewer respondents than the opinionbased method. Informed prediction required the smallest sample size. Outcomes produced by all methods were fairly similar, but prediction-based methods produced less extreme results. This central tendency was more pronounced for items with more extreme ratings in the opinion condition. Online evaluations were closed before students were informed about exam results.
Four factors were identified (loaded on by 11 out of 25 items): a) Exams (fairness and alignment with course objectives) b) Small-group learning c) Basic science teaching d) Teaching diagnostic approaches Together, these explained 50% of the variance. Overall ratings were most strongly associated with ratings related to the exam. In the second year, exams were the only predictors of overall ratings.

7.5
*Year refers to the time when the study was conducted, not year of publication. Please see the reference list for year of publication. **MERSQI Score was derived from two independent ratings for each study. Differences between the two raters were resolved by discussion. Qualitative studies did not receive a MERSQI rating. One paper ( [20]) reported findings of two different studies. MERSQI scores for these two studies are displayed separately. [21]  The prediction-based method required fewer respondents than the opinion-based method.
Outcomes produced by the two methods were fairly similar, but overall, the prediction-based method produced less extreme results. Prediction-based outcome data were more robust against bias; individual ratings were more positive in students who were female and more satisfied with the exam.
11.5 [24] 2011 Germany X 573 out of 977 students in years 3-5 at Göttingen Medical School; response rates for individual teaching modules: 36.7-75.4% a) Motivation survey (3 items on 6-point scales) at the start of each module b) Traditional evaluation form with 6 items on 6-point scales (after each module) c) Performance gain calculated from repetitive self-assessments (before and after each module). Average values for 15 learning objectives per teaching module The traditional tool and the performance gain tool produced different module rankings. Motivation ratings obtained before module attendance were positively correlated with evaluation ratings obtained after the modules. All items on the traditional tool were highly correlated with each other; there was hardly any correlation with performance gain results.
8.5 [25] 2011 Germany X 17 self-selected students in years 3-4 at Göttingen Medical School Does not apply Student remarks were related to 4 distinct themes (teaching quality, perceptions of evaluation, data collection tools, evaluation consequences). Student ratings are mainly based on 'gut feelings' rather than objective benchmarks. Overall ratings are mainly influenced by student satisfaction with teaching and exam difficulty. Students are more satisfied with teaching if they got the feeling to have learned something. Low response rates may be due to evaluation overload or a lack of feedback following evaluation. Students preferred evaluations to occur after end-of-course exams. They also preferred online over paper evaluations and open questions / discussions over scaled questions.