Our research questions were whether it is possible to shift the education of Kampo medicine from an apprenticeship system to a standardized evaluation system, and whether it is possible to develop an OSCE that can evaluate competence in Kampo medicine with high reliability and content validity.
OSCEs with simulated patients are reported to be more effective than role-play education among students [13]. In this study, we developed a new OSCE assessment method involving simulated patients for Kampo medicine education. In addition, since OSCEs have been reported to be an appropriate means of assessing communication skills in medical education in regard to the assessment of medical interviews and attitudes [14], the OSCE in this study included interviews, Kampo-based medical examinations, and patient attitudes as assessment items. An important implication was that we have succeeded in introducing a reliable and valid OSCE into Kampo medicine education, which suggests that Kampo medicine education could transition from an apprenticeship system to a standardized evaluation system within the framework of medical education. Regarding whether this OSCE meets the needs of Kampo, the clearly defined assessment items were assumed to have enabled the assessment of standardized clinical skills in Kampo medicine. This allowed us to develop clinical skills and Kampo assessment items that had only been conducted by a limited number of Kampo instructors, as an evidence-based assessment standard. Further study is needed to determine whether the Kampo-OSCE developed in this study meets existing needs.
Number of stations and content validity
In this study, we developed a highly reliable evaluation method focusing on Kampo medicine techniques, which is one of the consensus competencies among Kampo educators in Japanese medical schools, as well as a blueprint of the diagnostic process and basic theory of Kampo. Based on this blueprint, we selected a Kampo formula for a selection task and created three task scenarios.
We set three tasks and conducted the Kampo-OSCE in three stations. Regarding whether this number of stations was sufficient, it has been reported that eight stations is a reasonable compromise as a screening test in terms of high sensitivity (88–89%) and specificity (83–86%) [15]. To our knowledge, no reports of OSCEs for Kampo examinations have been published, so it is unclear how many stations are needed for evaluating Kampo. However, the findings of the present study suggest that the reliability of the test with regard to the number of stations may be maintained, even if the number of tasks for the evaluation of experienced candidates is small, while three or more tasks are desirable for the evaluation of inexperienced candidates.
In our study, Cronbach’s α between stations 1, 2, and 3 showed a reliability of 0.59–0.95. One candidate had a low α, suggesting that the reliability may not be maintained across the three tasks for candidates with little experience in Kampo medicine. However, the three tasks were highly reliable for candidates with experience in Kampo. This suggests that for inexperienced examinees, differences between stations may be seen when they are affected by nervousness or not skilled at a task. On the other hand, examinees with lengthy experience in Kampo were less affected by the content and circumstances of the task and demonstrated a certain level of Kampo examination ability.
Reliability
The reliability of the Kampo-OSCE developed in this study was sufficiently high. A previous systematic review regarding the inter-rater reliability for communication skills assessment noted that the agreement between reviewers was 0.45 [16].
In the present study, we established one evaluator for each station and conducted the evaluations under direct observation. Afterwards, three evaluators conducted individual evaluations while watching video recordings, and the inter-rater reliability of the three evaluators was examined. A high degree of reliability was obtained between the three raters, so it could be said that the raters could evaluate the Kampo consultations similarly.
On the other hand, direct visual and video-recorded assessments differ. The inter-rater reliability in this study was assessed using video recordings, and the results showed high reliability. Previous reports have found that assessment using clinical imaging and video correlate well with OSCEs [17]. Therefore, because high reliability was obtained in the video-based evaluation in this study, the inter-rater reliability was judged to be high for the Kampo-OSCE. It will be necessary to consider whether it is better to evaluate the OSCE under direct visual assessment or to evaluate it after the fact by video recording.
Limitations
There are several limitations to this study. The first is the number of cases. The number of cases in this study does not cover all the important clinical concepts of Kampo medicine. The blueprint shows scenarios regarding important clinical concepts from the perspective of Kampo medicine, including qi stagnation, qi counter flow pattern, static blood, and fluid retention, but not qi deficiency and blood deficiency. Therefore, it may be desirable to add more stations for qi deficiency and blood deficiency.
The second is the training of simulated patients. In this study, two evaluators with sufficient experience in Kampo medicine conducted a preliminary examination to improve the content validity of the simulated patients, and confirmed that the scenario and simulated patients were consistent in their findings. However, the more experienced examinees felt uncomfortable with the differences between the real patients and scenarios. This has been pointed out previously in a study in which 28 internists took part in an OSCE of a cardiac physical examination using three methods: real patients, cardiac audio–video simulations associated with “normal” standardized patients, and a cardiac patient simulator [18]. The correlation coefficients between participants’ physical examination skills and diagnostic accuracy were 0.39 (P < 0.05) for real patients, 0.29 for standardized patients, and 0.30 for the cardiac patient simulator, and were significantly higher for real than for standardized patients and the audio–video system combination [18], which suggests that the diagnostic accuracy is higher when using real patients; this may be a limitation of using simulated patients for physical examination evaluations. Since the test was conducted on simulated patients, the experienced doctors were more likely to notice the differences between the findings of real and simulated patients, which was confusing. The challenge is to collect simulated patients that match the scenario. In addition, specialists may feel uncomfortable with the simulated patients, which may result in not getting a high score. It is necessary to increase the number of medical specialists who take the test and examine the reliability of the test.
Third, it will be necessary to train evaluators in order to apply the test as a standardized test for students in the future. Whether reliability can be maintained when the number of evaluators increases is an issue for the future.