Skip to main content

Automatic analysis of summary statements in virtual patients - a pilot study evaluating a machine learning approach



The ability to compose a concise summary statement about a patient is a good indicator for the clinical reasoning abilities of healthcare students. To assess such summary statements manually a rubric based on five categories - use of semantic qualifiers, narrowing, transformation, accuracy, and global rating has been published. Our aim was to explore whether computer-based methods can be applied to automatically assess summary statements composed by learners in virtual patient scenarios based on the available rubric in real-time to serve as a basis for immediate feedback to learners.


We randomly selected 125 summary statements in German and English composed by learners in five different virtual patient scenarios. Then we manually rated these statements based on the rubric plus an additional category for the use of the virtual patients’ name. We implemented a natural language processing approach in combination with our own algorithm to automatically assess 125 randomly selected summary statements and compared the results of the manual and automatic rating in each category.


We found a moderate agreement of the manual and automatic rating in most of the categories. However, some further analysis and development is needed, especially for a more reliable assessment of the factual accuracy and the identification of patient names in the German statements.


Despite some areas of improvement we believe that our results justify a careful display of the computer-calculated assessment scores as feedback to the learners. It will be important to emphasize that the rating is an approximation and give learners the possibility to complain about supposedly incorrect assessments, which will also help us to further improve the rating algorithms.

Peer Review reports


Clinical reasoning is a complex set of skills healthcare students have to acquire during their education. Apart from face-to-face teaching scenarios, such as bedside teaching, clinical reasoning can be trained with web-based virtual patients (VPs) [1]. VPs are scalable, allow for deliberate practice, and provide a safe environment in which neither students nor patients are harmed.

CASUS [2] is a virtual patient software, that supports clinical reasoning training in multiple ways - with a variety of question types, a specific clinical reasoning tool [3], and the composition of a summary statement by the learners.

Summary statements are short presentations of a patient of usually one to three sentences length. The ability to present a patient in such a concise way is a good indicator for clinical reasoning skills, because the student has to summarize and synthesize a patient’s information [4]. In CASUS, learners currently get feedback in form of a static expert statement after having submitted their own statement, but the statements are not yet assessed in an automatic way, thus, no dynamic and individual feedback is provided.

Smith et al. have developed a rubric to assess the quality of such summary statements and provide structured feedback to learners [5]. Their rubric includes five components - factual accuracy, appropriate narrowing of the differential diagnosis, transformation of information, use of semantic qualifiers (SQ), and a global rating. Each component can be rated on a two- or three-point scale. With this detailed assessment considering different aspects the rubric can help learners to monitor and assess their progress. However, this approach is based on human raters; for an implementation for real-time rating and feedback in VPs, the summary statements have to be analyzed automatically.

In the recent years natural language processing (NLP) and machine learning (ML) tools became more accessible as services and have also been applied in medical education [6]. Such techniques aim to enable computers to parse and interpret spoken or written human language as humans would do [6].; for example, Denny at al. describe the use of NLP to identify competencies from students’ clinical notes [7] and Spickard et al. extracted and cataloged concepts from students’ clinical notes to track their progress [8].

The aim of our project was to combine the rubric by Smith et al. with NLP approaches to test whether an automatic real-time assessment of summary statements can serve as a basis for providing structured qualitative feedback to learners without the need of manually training such a system on a VP-based level.


From January 2017 to July 2019 100 virtual patients in German and English were provided in two open-access courses in CASUS to healthcare students world-wide as a voluntary and self-directed training opportunity [2]. Each expert-reviewed VP included a clinical reasoning tool that was developed to specifically support the clinical reasoning skills acquisition with a concept mapping approach [3]. Additionally, in each VP learners were prompted to compose a short statement summarizing the patients history; a brief introductory video explained the purpose and main components of such a statement [9]. Feedback was provided in form of an exemplary summary statement composed by the VP author. Overall, during this period of data collection, learners created 1505 summary statements in German and English.

For the purpose of this project we selected five VPs covering a broad range of key symptoms, such as fever, abdominal pain, or dyspnea with acute or chronic courses of disease and covering different final diagnoses, such as asthma, colitis ulcerative, or pneumonia. From these five virtual patients we randomly selected 125 summary statements in both languages and collected them in an excel file. Two healthcare professionals (IK, IH) independently rated the 125 statements based on the assessment rubric published by Smith et al. (Table 1). Additionally, to emphasize a patient-centered approach, we included a new category to assess whether the patient was addressed with his or her name in the statements. After studying and discussing the assessment rubric (Table 1) the two healthcare professionals independently rated 25 statements followed by a discussion about any divergent codings. After reaching consensus in all categories the remaining 100 statements were coded. Disagreements among the raters were solved in a personal discussion and consensus was reached in all cases.

Table 1 Rating rubric suggested by Smith et al. (0 = None, 1 = Some, 2 = Appropriate) [5] and additional category “patient name”

Based on a focused internet research we evaluated potential NLP tools and software solutions, that could support the analysis of summary statements by creating a semantic structure of the written texts. We decided to try the python framework spaCy [10] because it is

  • applicable for summary statements in English, German, and potentially other languages

  • potentially applicable for real-time assessment via an API

  • open-source.

For optimal results, we followed a two-step approach combining available metadata of the VP for each category and the controlled vocabulary thesaurus MeSH ((Medical Subject Headings) and an analysis with spaCy.

First, we used the spaCy tree to assess the five components of the rubric and the additional patient category (see Table 2).

Table 2 Computer-based calculation of the scores in the six categories

Second, we created with spaCy a tree of entities, sentences, and tokens of the summary statements.

For both steps we applied general rules and no VP-specific algorithms to guarantee the applicability of our approach for a broad variety of VPs.

For real-time feedback the time needed to calculate the rating is an important factor, thus, we optimized the algorithm in terms if performance and recorded the time needed for the longest summary statement.

For comparing the manual and the automatic rating we calculated Cohen’s kappa using SPSS version 26, with values of 0.01 to 0.20 considered none/slight, 0.21 to 0.40 fair, 0.41 to 0.60 moderate, 0.61 to 0.80 substantial, and 0.81 to 1.00 almost perfect agreement.

We received ethical approval from the Ethical Committee of the University of Munich for the anonymous collection and analysis of summary statements.


The comparison of manual and computer-based rating in the six categories is shown in Table 3. The detailed results for 50 exemplary summary statement can be found in Additional file 1.

Table 3 Comparison of manual (columns) and automatic (rows) rating of summary statements in the six categories and Cohen’s kappa as measure of agreement between the manual and the automatic rating

Overall, Table 3 shows a substantial agreement (κ > = .61) between the manual and the automatic rating in the category “patient name”, a fair agreement for the category “factual accuracy” and moderate agreement (κ > =.41) for all other categories. Complete mismatches with a rating distance of 2 can be seen in two categories (appropriate narrowing and transformation) each showing one manual rating with a 2 and an automatic rating with 0.

When looking into the results of the analysis of German and English summary statements, we detected some issues in the “patient name” category. The NLP identified all 35 persons in the English statements, with two false hits, but for German statements none of the 10 patient names were identified.

The following shows an example of a summary statement for a VP with tuberculosis: “67 year old patient, presents with a cough that lasted 3 months. Has a smoking history. Has experienced weight loss and loss of appetite. Green sputum. Earlier diagnosed with hypertension, treated in China.”

The NLP tree of this statement is shown in Fig. 1.

Fig. 1
figure 1

NLP tree of an exemplary summary statement indicating the type of entity, such as noun, verb, or adjective and the type of dependencies between entities. For example, “3” is a numeric modifier (nummod) for “months”. The list of annotations can be found at

Our algorithm was then able to identify and classify the following terms:

“67 year old (date) patient, presents with a cough (finding) that lasted 3 months (duration). Has a smoking (finding) history. Has experienced weight loss (finding) and loss of appetite (finding). Green sputum (anatomical term). Earlier diagnosed with hypertension (diagnosis, hyper = SQ), treated in China (country).”

This leads to the following calculated scores:

  1. (1)

    SQ = 0 (1 SQ identified)

  2. (2)

    Appropriate narrowing = 1 (3 matching terms with expert statement or VP metadata)

  3. (3)

    Transformation = 1 (2 terms indicating a transformation)

  4. (4)

    Accuracy = 1 (no incorrect information identified)

  5. (5)

    Patient name = 0 (no patient name identified)

  6. (6)

    Global rating = 1 (sum = 5)

The exemplary statement was similarly assessed by the rater, only the transformation was rated with 0 instead of 1.

The measured time for the summary statement analysis was on average 1.3 s, with a maximum of 5.8 s for the longest statement.


The aim of our project was to test whether an automatic rating of summary statements based on the rubric provided by Smith et al. can be used for providing real-time feedback to learners by applying general rules without having to train a system specifically for a VP. Overall, we believe that the results of our combined approach for the six components are promising showing a moderate agreement between the manual and automatic agreement for most of the categories and only very few complete mismatches with a rating distance of 2.

For some components, we identified difficulties in achieving more reliable results: The main challenge in the category “patient name” were German statements in which we could not identify names or persons at all due to the limitations of the NLP model. This could be solved by providing the name of the VP as metadata and compare it directly with the statement.

With only a slight agreement (κ = .366) especially the category Factual Accuracy requires further refinement. From our 125 randomly selected summary statements only 17 were rated as not accurate in the manual assessment and only five of these were then correctly identified with our algorithm. This low number and the great variety of potential errors in statements makes it difficult to achieve a more reliable detection of non-accurate statements. To further improve our algorithm to detect errors, we will have to specifically collect and analyze more non-accurate statements. Despite the importance of accuracy for the rating of a statement, it seems a difficult category to rate, for which also in the study by Smith et al. interrater reliability was lowest [5]. Their plan for improvement was the further development of the rubric from a binary to a multiple-option category. Such a specification might also help to further develop our algorithm to categorize and detect potential error sources.

In contrast to the rating rubric by Smith et al. we calculated a more specific ratio for all categories except patient name, factual accuracy, and global rating, which was then translated by thresholds into the 0,1,2 - rubric. In doing so, we lost some information, that could give learners a better and more accurate understanding on their performance.

The analysis of the summary statement is a complex task, requiring an average of 1.3 s per statement, with 58 of the longer statements requiring more than 1 sec, which is according to Nielsen the limit for an uninterrupted flow of thought [12]. Hence, displaying the analysis results as real-time feedback to the learners in their learning analytics dashboard will require a pre-calculation in the background guaranteeing an uninterrupted user experience.


For our project, we randomly selected 125 statements from five VPs, which is quite a low number compared to the overall number of summary statements already collected and the number of VPs available in the CASUS system. When selecting the VPs for the project our aim was to cover a broad spectrum of findings and differential diagnoses, but we cannot exclude that for specific VPs the algorithm might return less accurate ratings. More testing with a higher number of summary statements of the five VPs and additional VPs has to be implemented to further validate our results. Finally, we cannot exclude that due to a volunteer bias the summary statements are more homogenous than without such a bias. However, assuming that volunteer learners tend to be more motivated and engaged [13], but also having only a few statements with a global rating of 2 (see Table 3) we believe that it is unlikely that such a bias had an influence on our results. Unfortunately, we do not have similar studies to compare our results to,


Overall, most of the categories show a moderate agreement between the manual and the automatic rating, which we think is a justifiable starting point for a careful feedback to the learners about their performance in summary statement composition as part of the learning analytics dashboard. However, we would refrain from displaying the absolute rubric scores (0,1, or 2), but the underlying ratio in each category. It will also be important to emphasize the possibility of false interpretations of the automatic rating and give learners the chance to provide feedback concerning the assessment of their statement. This feedback will also form an important step in further improving our algorithm.

Apart from analyzing summary statements, our approach might also be a first step for analyzing other texts composed by healthcare students, such as e-portfolio entries.

Availability and requirements

Project name: Effective clinical reasoning in virtual patients

Project home page:

Operating system(s): Platform independent

Programming language: Java, Python

Other requirements: none

License: e.g. MIT

Any restrictions to use by non-academics: none

Availability of data and materials

The dataset with the 125 summary statements in English and German including the results of the manual and automatic rating can be obtained from the authors, an exemplary dataset of 50 summary statements in English is included in this article as Additional file 1.



Natural language processing


Semantic qualifiers


Virtual patients


  1. Cook DA, Triola MM. Virtual patients: a critical literature review and proposed next steps. Med Educ. 2009;43(4):303–11.

    Article  Google Scholar 

  2. CASUS virtual patient system. Available from Accessed 12 Feb 2020.

  3. Hege I, Kononowicz AA, Adler M. A clinical reasoning tool for virtual patients: design-based research study. JMIR Med Educ. 2017;3(2):e21.

    Article  Google Scholar 

  4. Bowen JL. Educational strategies to promote clinical diagnostic reasoning. N Engl J Med. 2006;355:2217–25.

    Article  Google Scholar 

  5. Smith S, Kogan JR, Berman NB, Dell MS, Brock DM, Robins LS. The Development and Preliminary Validation of a Rubric to Assess Medical Students’ Written Summary Statements in Virtual Patient Cases. Acad Med. 2016;91(1):94–100.

    Article  Google Scholar 

  6. Chary M, Parikh S, Manini A, Boyer E, Radeous M. A Review of Natural Language Processing in Medical Education. Western J Emerg Med. 2018;20(1):78–86.

    Article  Google Scholar 

  7. Denny JC, Spickard A, Speltz PJ, Porier R, Rosenstiel DE, Powers JS. Using natural language processing to provide personalized learning opportunities from trainee clinical notes. J Biomed Inform. 2015;56:292–9.

    Article  Google Scholar 

  8. Spickard A, Ridinger H, Wrenn J, O’brien N, Shpigel A, Wolf M, et al. Automatic scoring of medical students’ clinical notes to monitor learning in the workplace. Med Teach. 2014;36(1):68–72.

    Article  Google Scholar 

  9. Video about a summary statement composition. Accessed 12 Feb 2020.

  10. spaCy natural language processing. Accessed 12 Feb 2020.

  11. Connell KJ, Bordage G, Chang RW. Assessing clinicians’ quality of thinking and semantic competence: a training manual. Chicago: University of Illinois at Chicago, Northwestern University Medical School, Chicago; 1998.

    Google Scholar 

  12. Nielsen Norman Group. Response Times: The 3 important limits. Accessed 12 Feb 2020.

  13. Callahan CA, Hojat M, Gonnella JS. Volunteer bias in medical education research: an empirical study of over three decades of longitudinal data. Med Educ. 2007;41(8):746–53.

    Article  Google Scholar 

Download references


We would like to thank the learners who worked in CASUS with the virtual patients and provided the summary statements that formed the basis for this project.


The virtual patients were part of a project that received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 654857. The project is part of the DID-ACT project, which is co-funded by the Erasmus+ Programme of the European Union (612454-EPP-1-2019-1-DE-EPPKA2-KA). Open Access funding enabled and organized by Projekt DEAL and publication funded by the DID-ACT project.

Author information

Authors and Affiliations



IK and IH manually rated the data. MA and IH developed the software to automatically rate the data. IH drafted the manuscript and IK and MA contributed significantly. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Inga Hege.

Ethics declarations

Ethics approval and consent to participate

We obtained ethical approval from the ethical committee at Ludwig-Maximilians Universität Munich, Germany (reference number: 260–15) for an anonymous analysis of the data. Consent to participate is not applicable. The virtual patients provided to the learners were fictional and not based on any real persons.

Consent for publication

Not applicable.

Competing interests

MA is CEO of the non-for-profit company Instruct, which develops and distributes the VP system CASUS, which was used to collect the summary statements. IH is a member of the editorial board of BMC Medical Education. Otherwise, the authors declare that they have no conflict of interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Exemplary 50 summary statements with manual and automatic rating. The appendix contains 50 analyzed English summary statements with the manual and automatic rating based on the rubric and the time of the automatic assessments.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hege, I., Kiesewetter, I. & Adler, M. Automatic analysis of summary statements in virtual patients - a pilot study evaluating a machine learning approach. BMC Med Educ 20, 366 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: