Skip to main content

AI in medical education: uses of AI in construction type A MCQs

A Correction to this article was published on 21 March 2024

This article has been updated

Abstract

Background

The introduction of competency-based education models, student centers, and the increased use of formative assessments have led to demands for high-quality test items to be used in assessments. This study aimed to assess the use of an AI tool to generate MCQs type A and evaluate its quality.

Methods

The study design was cross-sectional analytics conducted from June 2023 to August 2023. This study utilized formative TBL. The AI tool (ChatPdf.com) was selected to generate MCQs type A. The generated items were evaluated using a questionnaire for subject experts and an item (psychometric) analysis. The questionnaire to the subject experts about items was formed based on item quality and rating of item difficulty.

Results

The total number of recurrent staff members as experts was 25, and the questionnaire response rate was 68%. The quality of the items ranged from good to excellent. None of the items had scenarios or vignettes and were direct. According to the expert’s rating, easy items represented 80%, and only two had moderate difficulty (20%). Only one item out of the two moderate difficulties had the same difficulty index. The total number of students participating in TBL was 48. The mean mark was 4.8 ± 1.7 out of 10. The KR20 is 0.68. Most items were of moderately difficult (90%) and only one was difficult (10%). The discrimination index of the items ranged from 0.77 to 0.15. Items with excellent discrimination represented 50% (5), items with good discrimination were 3 (30%), and only one time was poor (10%), and one was none discriminating. The non-functional distractors were 26 (86.7%), and the number of non-functional distractors was four (13.3%). According to distractor analysis, 60% of the items were excellent, and 40% were good. A significant correlation (p = 0.4, r = 0.30) was found between the difficulty and discrimination indices.

Conclusion

Items constructed using AI had good psychometric properties and quality, measuring higher-order domains. AI allows the construction of many items within a short time. We hope this paper brings the use of AI in item generation and the associated challenges into a multi-layered discussion that will eventually lead to improvements in item generation and assessment in general.

Peer Review reports

Background

The introduction of competency-based education models and student centers and the increased use of formative assessment have led to demands for high-quality test items to be used in assessments [1]. Moreover, the popularity of progress and exit tests necessitates using many test items. MCQs are the most commonly used tools in assessment because their reliability and validity are approved, and they can cover a large range of knowledge and knowledge [2,3,4,5,6]. The construction of high-quality MCQs type A has been reported to be difficult and time-consuming [4, 7,8,9].

Many authors have addressed guidelines for constructing MCQs [2, 9,10,11]. These guidelines aim to construct high-quality MCQs. In general, these guidelines can be classified as pre- and during construction. Preconstruction guidelines include the presence of a valid blueprint and content material from which the MCQs will be constructed. The most important post-construction (use) guideline, is item analysis because it provides feedback to item constructors about their quality.

The growing need for test items requires new advancements in item construction and generation in addition to the traditional method [1]. Artificial intelligence (AI) can provide such advancements. Artificial intelligence (AI) is commonly applied to computer technologies that mimic or simulate processes supported by human intelligence. Some can perform tasks that involve human interpretation and decision-making [12]. Education is considered the most relevant field of AI application [13]. The use of AI in medical education gained early attention, and the guidelines set by UNESCO aimed to achieve excellence in education [14, 15]. Thus, in education, AI is embedded in many technological innovations that provide learning analytics, recommendations, and diagnosis tools for various ways and purposes [16]. In the field of education, artificial intelligence (AI) isn’t limited to traditional face-to-face teaching and smart learning environments. It is primarily utilized in e-learning to enable automated and personalized learning processes. These processes are based on adaptive learning, machine learning, ontologies, semantic technologies, natural language processing, and deep learning [17]. In medical education, AI applications have been linked to feedback, simulation-based training in medicine, adaptive learning systems, generated assessment tasks, self-assessment, automatic student work scores, and virtual operative assistant creation [18].

It was demonstrated that AI is useful in feedback, assessment, and formative evaluation [17]. AI and AI-driven applications for automated question generation automated questions (AIG)(AQG) are promising advancements. They can significantly simplify the process of generating meaningful and relevant questions from textual educational material. This would facilitate personalized and engaging learning experiences, efficient evaluation of students’ understanding, targeted feedback, and improved educational outcomes for students [19, 20]. In the AIGAQG, content experts are required to articulate the factors that would guide them down a series of different ways to solve a clinical problem [1].

The use of AI applications makes assessments more feasible for maintenance. Jia et al. (2021) proposed a two-step method to improve the quality of automated assessment construction [21]. Step one includes applying a Rough Answer and Key Sentence Tagging scheme. Step two captures the inter-sentence and intra-sentence relations using the answer-guided Graph Convolutional Network for question generation. It has been reported that success in such approaches requires large-scale and relevant datasets for training question-generators [22].

Using AI and AI-driven applications, tools, and techniques can decrease the challenge of traditional methods presented as item construction and item stability [23]. Through AI, it will be easier and more feasible to construct and update items (questions) and form item banks. Different applications and AI tools were used and recommended for item generation. Some of these are costly, need high technical support, and have good results [24,25,26,27]. This study used a free and simplified tool to generate items. This study aimed to assess the use of an AI tool to generate MCQs type A and evaluate itstheir quality.

Methods

The study design was cross-sectional analytic [28] and was conducted at the Department of Basic Medical Sciences (anatomy unit)Department of anatomy from June 2023 to August 2023. The sampling technique is total coverage for the student registered to the musculoskeletal course, College of Medicine (2022–2023). The total number of participants included in the study was 48. The utilized activity was team-based learning (TBL), and the number of items (questions) was 10 according to the regulation and recommendation of TBL conduction.

The study context

This study utilized formative TBL during the musculoskeletal course (2022–2023). The subject’s TBL content was anatomy, and the title was the posterior abdominal wall. The TBL and topic were chosen to ensure the one-dimensionality of the content material, which can affect internal consistency. TBL is a student-centered instructional methodology for large classes of students divided into small teams of five to seven students [29]. TBL comprises three parts: pre-class, in-class, and post-class [29, 30]. The pre-class part is divided into two teacher and student responsibilities.

Teacher responsibilities include selecting the content material objective, creating the student’s team, providing students with the TBL objectives and the recommended materials or textbooks, and creating the individual readiness test (i-Rat) and group readiness test (g-Rat). The student’s responsibility is to study according to the provided TBL objectives and reading materials. Teams’ creation and assignment are done by the medical education unit in the college, and teams are sustained for an academic year.

The in-class part is divided into two students’ and teachers’ responsibilities. Students’ responsibility is to answer the i-Rat and then, through the g-Rat, with group skills and dynamics, fill the knowledge gaps, if any. The teacher’s responsibilities are to lead the discussion, create new threads, and provide clarification through mini a lecture if needed. The last part of in-class is the application of the learned knowledge. The post-class included appeals or assignments to support knowledge applications. The i-Rat comprises 10 MCQs of type A (reused in the g-Rat). The TBL was conducted by a content expert (atomist) [29,30,31]. The TBL was managed and processed through institute regulations.

The selection of an AI tool

The AI tool was selected according to the following criteria:1) available tool, 2) free of charge, 3) easy to use, 4) the tool uses PDF file format (as all the reference books are available as PDF), 5) safe tool, and 6) confidentiality. When applying the criteria, two tools were selected: ChatPdf.com and ChatPDF.ai. ChatPdf.com (https://www.chatpdf.com/) was used for this study.

The ChatPDF.ai uses natural language processing algorithms and deep learning technology to enable users to ‘chat’ with PDF documents. It scans the PDF using optical character recognition and extracts text. The extracted text becomes a data source for the AI to analyze and respond accordingly.

Generation of question (item)

A PDF file on the posterior abdominal wall was prepared from a standard anatomy book. The PDF file was a section from the anatomy textbook, and it’s one of the recommended books for student teaching and learning.

Process of generation

  1. 1.

    The PDF file was uploaded to the tool.

  2. 2.

    The tool used one minute to read and formulate a small PDF file summary.

  3. 3.

    The tool was ordered to construct MCQs type A (Fig. 1).

  4. 4.

    The tool started to construct the questions in order.

Fig. 1
figure 1

Shows uploading of PDF file and construction of questions

Questions evaluation and assessment

The generated items were evaluated using a questionnaire for subject experts and an item (psychometric) analysis. The questionnaire was tested in a pilot study. The experts included in the pilot study, or the results of the pilot study were not included in the analysis.

The questionnaire to the subject experts about item construction and quality was based on and adapted from the work of Susan and David (2016) and other authors [2, 4, 32] (Table 1).

The rating of item difficulty was assessed based on modified Angoff methods (Table 1) [33, 34].

Inter-rater agreement for the expert rating was limited to 30%. In cases of deviation in ratings, the concerned expert is requested to rerate according to other expert ratings or keep his rates [33].

In the item difficulty part, experts were asked to rate the items as easy, moderately difficult, or difficult using a three-point Likert scale. The questionnaires were distributed to a target group of experts. The target group of experts was selected according to the following criteria: staff members in medical colleges (within the KSA), experience in teaching human anatomy, experience in medical education, and work in a medical college with a similar or equivalent curriculum (problem-based, SPICE). The total number of recurrent staff members was 25.

The generated (item) questions were used in the i-Rat of the TBL. The students’ responses on the i-Rats were verified, marketed, and analyzed, and then an item analysis was generated. Item analysis provides feedback about item construction and its validity and reliability after it appears in an examination [5, 35, 36]. The parameters of item analysis include KR20, difficulty index (p-value), discrimination index, and distractor efficiency (analysis) [5, 8, 37].

Cronbach’s alpha (KR20) was used to estimate exam reliability (internal consistency) [38]. It also describes the dimensionality of the exam and the extent to which the exam measures the same concept or construct [37, 39]. The value of KR20 is affected by the number of items in the exam, the difficulty index, the number of examinees, and performance [37, 39, 40]. A value of 0.7 was reported as acceptable for a short test (less than 50 items) and KR20 of 0.8 for a large exam (more than 50 items test) [41]. It was reported that values > 0.7, 0.6–0.7 is acceptable, 0.5–0.6. poor, < 0.5, and < 0.30, unreliable [42]. The difficulty index (Dif. index) was calculated as the percentage of students who scored the item correctly (absolute difficulty) [37]. The difficulty index ranges from 0 to 100, where higher values indicate easier questions and lower values represent the difficulty of hard items. The range of item difficulty can be categorized as difficult (< 39), moderate (40–80), and easy (> 80) [37, 43]. The item discrimination index (power) (Dis. index) was calculated as the difference between the upper and lower 27% divided by the number of participants in each group. The DIS ranged from 1.0 to − 1.0. A positive DIS indicates that high achievers answered the item more correctly than lower achievers and vice versa [35, 44]. The range of DIS can be categorized into excellent (≥ 0.35), Good (0.25–0.34), Acceptable (0.21–0.24), and poor (≤ 0.20) [45, 46]. For distractor analysis, any distractor selected by less than 5.0% of the students was considered non-functional (NFD), whereas a functional distractor (FD) is one selected by 5% or more [47]. According to the number of NFDs, items were classified as excellent (0NFDs scored 100), Good (1NFDs scored 66.6), moderate (2NFDs scored 33.3), and poor (3NFDs scored 0).

Table 1 Shows the checklist for MCQs item quality

Ethical consideration

The study was approved by the Research and Ethics Committee of the College of Medicine, University of Bisha (KSA). The students were informed about the study, and written consent was obtained from the participating students as part of the i-Rat. The students were informed that the TBL was formative and that their participation in the study did not impact their grades or grade point average (GPA)GPA. In addition, the students could attend TBL without including their responses in the study. Permission to use the AI tool was obtained through the support team for the tool.

Statistical analysis

Students’ responses to the i-Rat and questionnaires were tabulated in MS Excel 2016 and analyzed by SPSS version 25. KR20, difficulty index (absolute difficulty), discrimination index, and distractor efficiency (analysis) were calculated. The relationship between the difficulty index and discrimination index was evaluated by Pearson correlation, and a P-value of < 0.05 was considered statistically significant. The data is presented as a mean ± standard deviation in the form of a table of frequencies.

Results

The total number of student participants was 48. The mean age and GPA of students were 20.2 ± 0.4 and 3.25 ± 0.35, respectively. All the students were in level three.

The questionnaire to experts

The total number of participating staff members as experts was 25, and the questionnaire response rate was 68%. According to the questionnaire responses, the item quality was good to excellent. None of the items had scenarios or vignettes and were direct. All items tested knowledge and passed the hand cover test. The distractor and correct options followed the grammatical and logical forms. Only one item contained a repetition of words in the stem “clang, clue,” and its difficulty and discrimination indexes were within acceptable levels. The generated items are devoid of absolute words and “none of the above’ (Fig. 2).

According to the expert ratings, the average rating of the items was easy. Easy items represented 80%, and only two had moderate difficulty (20%) (Table 1). Only one item of the two moderate difficulties had the same difficulty index, while the rest had expert ratings that were different from the item difficulty index.

Item analysis

The total number of students participating in TBL was 48. The mean mark was 4.8 ± 1.7 out of 10. The maximum and minimum marks were 9 and 2, respectively. The KR20 is 0.68. Item analysis of the i-Rat is presented in Table 2. The average difficulty index was 47.7 ± 4.8 (moderate difficulty). Most items were moderately difficult (90%) and only one difficult (10%). The discrimination index of the items ranged from 0.77 to 0.15. The mean discrimination index of i-Rat was 0.41 ± 0.08 (Excellent discrimination). Items with excellent discrimination represented 50% (5), items with good discrimination were 3 (30%), and only one time was poor (10%), and one was none discriminating. The non-discriminating item had a moderately difficult index. The total number of distractors was set to 30. The non-functional distractors were 26 (86.7%), and the number of NFDs was 4 (13.3%). According to distractor analysis, 60% of the items were excellent, and 40% were good (Table 1). Items with deviated parameters in item analysis (one difficult item, one non-discriminating item, and one with poor discrimination power) were checked for possible content, technical, or grammatical flaws.

The Pearson correlation test revealed a significant correlation between the difficulty and discrimination indices (p = 0.4, r = 0.30).

Table 2 Shows item analysis of i-Rat and expert rating of items
Fig. 2
figure 2

Shows item quality (n = 10)

Discussion

Items in the study generated by the AI tool (ChatPDF.ai) showed good psychometrical analysis. Most items were moderately difficult, and more than half had excellent discrimination and function distractors. The generated items have acceptable average levels of difficulty index and discrimination index. The average rating of items by experts was easy. Despite the expert rating, these findings represented the core objective of the items’ psychometrical analysis for educators. Such findings were reported in anatomy exam [48], community Medicine [49], and multidisciplinary summative exams [50, 51] that were constructed traditionally.

The mean score on the i-Rat was 4.8 ± 1.7. The standard deviation of i-Rat was small. This finding suggests a low degree of variation in the student’s scores on the i-Rat. The students’ scores tended to cluster around the mean, with low dispersion between the minimum and maximum scores. Such clustering of students’ scores may indicate that students’ performance and competence are equal or that the items evaluate students’ knowledge well.

Despite the few items (10), KR20 is 0.68, which is acceptable for in-class assessment [52]. The KR20 is reported to be affected by many factors such as exam time, the number and inter-relation of the items (dimensionality), item difficulty, variations in examinee responses, and the number of examinees [37]. The test time and number of items in the current study were in accordance with recommendations. The TBL content was a single anatomical topic and was conducted according to the standards of TBL conduction [29]. Most of the items were moderately difficult. The total number of participants was 48. As mentioned above, the low level of KR20 was considered to be due to the small number of items. This consideration was supported by the early work of Kehoe (1994). He reported that KR20 values as low as 0.5 are satisfactory for exams formed of 10–15 items [53].

Items with deviated parameters of item analysis were checked, and none contained any content, technical, or grammatical flaws. The mean difficulty index of the i-Rat was 47.7 (moderate difficulty), which is less than that obtained by Escudero et al., Rao et al., and Licona-Chávez et al. for a balanced exam of high-stake [5, 54, 55]. The recommended percentage of items for an ideal difficulty-balanced exam is 5% for easy items, 5% for difficult items, 20% for moderately easy, 20% for moderately difficult, and 50% for average ones [5]. In the current study, the percentage of the moderately difficult items was 90%; meanwhile, the recommended percentage for the balanced exam was 20%.

The mean discrimination index of i-Rat was 0.41, which was higher than that described by Licona-Chávez et al. [5]. Despite one item that is not discriminating, the test has excellent quality to discriminate between higher- and lower-achieving students.

The existing findings indicate a significant positive correlation between difficulty and discrimination indices (p = 0.4, r = 0.30). Other authors have reported a significant linear and dome-shaped relationship between difficulty and the discrimination index [6, 56]. These reports indicate that an increase in DI is associated with an increase in the discrimination index. Controversially to the current findings, Alareifi (2023) described a strong negative correlation between difficulty and discrimination indices (p = 0.00, r= -0.936) [57]. Difficult items with no flaws (content, technical, or grammatical flaws) were answered by good (high achievers) rather than low students (high achievers). Logically, difficult items can discriminate between students and have a high discrimination index. A negative correlation between difficulty and discrimination indices indicates an association between easy items and a high discrimination index. Meanwhile, the easy items answered by most examinees will not discriminate between them.

In the current study, experts rated the items as easy, and only two were moderately difficult. The difficulty index of the items was moderate, and only one item was difficult. According to the difficulty index, experts underestimated the item difficulty. There are variations and contradictions in experts’ judgments regarding item difficulty in the literature [58,59,60]. Experts estimate item difficulty better than the absolute difficulty of items [58]. Additionally, they reported and underestimated the percentage of ease. These discrepancies between the absolute difficulty index and rating could be due to differences in experts’ opinions about items, experiences, and the process of estimating the difficulty of test items [59].

Items generated through AIGAQG have psychometric properties similar to traditionally constructed [1, 61]. Currently, the items generated by AI have acceptable levels of difficulty index, discrimination, and distractor analysis. The KR20 was affected by a small number of items. According to the present findings, items generated with AI in less time need no review regarding language editing, construction, or cost. These findings represent the core objective of decreasing burdens on teachers and providing the requested number of good-quality items. Teachers or human interference will guide data entry by selecting the content materials according to the desired assessment and guaranteeing the quality of items by avoiding overlapping and related technical flaws such as hanging and the presence of technical claws. Teachers will still lead the distribution of items and their ability to assess or judge students.

Conclusion

Items constructed using AI had good psychometric properties and quality, measuring higher-order domains. AI allows the construction of many items within a short time. We hope this paper brings the use of AI in item generation and the associated challenges into a multi-layered discussion that will eventually lead to improvements in item generation and assessment in general.

The study limitations

The sample size was small (48 participants). A small number of items. The study context was TBL. We did not test whether the AI tool would generate the same questions (items) on other occasions.

Recommendation

Conduction of studies on a large group of students with many items. Conduction of a follow-up study to assess the psychometric stability of the generated items. Future research should use AI tools to generate items, including case scenarios and clinical reasoning. This study can form a base for other studies targeting different AI tools and applications to generate or construct different types of items and their psychometric analysis and properties.

All relevant data are within the paper and its Supporting Information files.

The authors have declared that no competing interests exist.

The funders had no role in study design, data collection and analysis, publication decisions, or manuscript preparation.

Data availability

The supporting data will be available upon request to the corresponding author.

Change history

Abbreviations

AQG:

Automated Question Generation

GPA:

Grade Point Average

AI:

Artificial Intelligence

MCQ:

Multiple Choice Questions

TBL:

Team-Based Learning

i-Rat:

Individual Readiness Test

g-Rat:

Group readiness test Group Readiness Test

KR20:

Kuder-Richardson 20

DIF:

Difficulty Index

DIS:

Discrimination Index

DE:

Distractor Efficiency

FD:

Functional Distractor

NFD:

Non-Functional Distractor

References

  1. Pugh D, De Champlain A, Gierl M, Lai H, Touchie C. Can automated item generation be used to develop high quality MCQs that assess application of knowledge? Res Pract Technol Enhanced Learn. 2020;15:1–13.

    Google Scholar 

  2. Naidoo M. The pearls and pitfalls of setting high-quality multiple choice questions for clinical medicine. South Afr Family Practice: Official J South Afr Acad Family Practice/Primary Care. 2023;65(1):e1–e4.

    Google Scholar 

  3. Scott KR, King AM, Estes MK, Conlon LW, Jones JS, Phillips AW. Evaluation of an intervention to improve quality of single-best answer multiple-choice questions. West J Emerg Med. 2019;20(1):11–4.

    Article  Google Scholar 

  4. Rahim MF, Bham SQ, Khan S, Ansari T, Ahmed M. Improving the quality of MCQs by enhancing cognitive level and using psychometric analysis: improving the quality of MCQs by enhancing cognitive level. Pakistan J Health Sci 2023:115–21.

  5. Licona-Chávez AL, Velázquez-Liaño LR. Quality assessment of a multiple choice test through psychometric properties. MedEdPublish. 2020;9(91):1–12.

    Google Scholar 

  6. Ramzan M, Khan KW, Bibi S, Imran SS. Difficulty and discrimination analysis of end of term multiple-choice questions at Community Medicine Department, Wah Medical College. Pakistan Armed Forces Med J. 2021;71(4):1308–10.

    Article  Google Scholar 

  7. Thorat S, Gupta M, Wasnik M. Item analysis–utility for increase in MCQ validity in biochemistry for MBBS students. J Educ Technol Health Sci. 2019;6(3):88–9.

    Google Scholar 

  8. Khare AS, Kadam AA, Verma A, Avachar K. Analysis of Difficulty Index of single best response type of multiple choice questions in Physiology by Post Validation. Lat Am J Pharm. 2023;42(3):50–5.

    Google Scholar 

  9. Kumar AP, Nayak A, Chaitanya KMS, Ghosh K. A Novel Framework for the generation of multiple choice question stems using semantic and machine-learning techniques. Int J Artif Intell Educ 2023:1–44.

  10. Suryono W, Harianto BB. Item analysis of multiple choice questions (MCQs) for dangerous Goods courses in Air Transportation Management Department. Technium Soc Sci J. 2023;41:44.

    Google Scholar 

  11. Uddin ME. Common item violations in multiple choice questions in Bangladeshi recruitment tests. Local Research and Glocal perspectives in English Language Teaching: teaching in changing Times. edn.: Springer; 2023. pp. 377–96.

  12. Matheny ME, Whicher D, Thadaney Israni S. Artificial Intelligence in Health Care: a Report from the National Academy of Medicine. JAMA. 2020;323(6):509–10.

    Article  Google Scholar 

  13. Hooda M, Rana C, Dahiya O, Rizwan A, Hossain MS. Artificial intelligence for assessment and feedback to enhance student success in higher education. Math Probl Eng. 2022;2022:1–19.

    Google Scholar 

  14. Dhara S, Chatterjee S, Chaudhuri R, Goswami A, Ghosh SK. Artificial Intelligence in Assessment of Students’ Performance. Artificial Intelligence in Higher Education. edn.: CRC; 2023. pp. 153–67.

  15. Miao F, Holmes W, Huang R, Zhang H. AI and education: a guidance for policymakers. UNESCO Publishing; 2021.

  16. Zhai X, Chu X, Chai CS, Jong MSY, Istenic A, Spector M, Liu J-B, Yuan J, Li Y. A review of Artificial Intelligence (AI) in education from 2010 to 2020. Complexity. 2021;2021:1–18.

    Google Scholar 

  17. González-Calatayud V, Prendes-Espinosa P, Roig-Vila R. Artificial Intelligence for Student Assessment: a systematic review. Appl Sci. 2021;11(12):5467–82.

    Article  Google Scholar 

  18. González-Calatayud V, Prendes-Espinosa P, Roig-Vila R. Artificial intelligence for student assessment: a systematic review. Appl Sci. 2021;11(12):1–15.

    Article  Google Scholar 

  19. Mirchi N, Bissonnette V, Yilmaz R, Ledwos N, Winkler-Schwartz A, Del Maestro RF. The virtual operative assistant: an explainable artificial intelligence tool for simulation-based training in surgery and medicine. PLoS ONE. 2020;15(2):e0229596.

    Article  Google Scholar 

  20. Turner L, Hashimoto DA, Vasisht S, Schaye V. Demystifying AI: current state and future role in Medical Education Assessment. Acad Med 2023:10–37.

  21. Jia X, Zhou W, Sun X, Wu Y. Eqg-race: Examination-type question generation. In: Proceedings of the AAAI conference on artificial intelligence: 2021; 2021: 13143–13151.

  22. Swiecki Z, Khosravi H, Chen G, Martinez-Maldonado R, Lodge JM, Milligan S, Selwyn N, Gašević D. Assessment in the age of artificial intelligence. Computers Education: Artif Intell. 2022;3:100075–85.

    Google Scholar 

  23. Circi R, Hicks J, Sikali E. Automatic item generation: foundations and machine learning-based approaches for assessments. Front Educ. 2023;8:858273–8.

    Article  Google Scholar 

  24. Wu F, Lu C, Zhu M, Chen H, Zhu J, Yu K, Li L, Li M, Chen Q, Li X. Towards a new generation of artificial intelligence in China. Nat Mach Intell. 2020;2(6):312–6.

    Article  Google Scholar 

  25. Sounderajah V, Ashrafian H, Golub RM, Shetty S, De Fauw J, Hooft L, Moons K, Collins G, Moher D, Bossuyt PM. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ open. 2021;11(6):e047709.

    Article  Google Scholar 

  26. Choi J. Automatic item generation with machine learning techniques. In: Application of Artificial Intelligence to Assessment edn. Edited by Jiao H, Cissitz R. USA: Information Age Publishing; 2020: 189–210.

  27. Kurdi G, Leo J, Parsia B, Sattler U, Al-Emari S. A systematic review of automatic question generation for educational purposes. Int J Artif Intell Educ. 2020;30:121–204.

    Article  Google Scholar 

  28. Rezigalla AA. Observational study designs: Synopsis for selecting an appropriate study design. Cureus. 2020;12(1):e6692–6700.

    Google Scholar 

  29. El-Ashkar A, Aboregela A, Alam-Eldin Y, Metwally A. Team-based learning as an inspiring tool for teaching parasitology in the integrated curricula. Parasitologists United J. 2023;16(1):64–72.

    Article  Google Scholar 

  30. Burgess A, Haq I, Bleasel J, Roberts C, Garsia R, Randal N, Mellis C. Team-based learning (TBL): a community of practice. BMC Med Educ. 2019;19(1):1–7.

    Article  Google Scholar 

  31. Burgess A, van Diggele C, Roberts C, Mellis C. Team-based learning: design, facilitation and participation. BMC Med Educ. 2020;20(2):1–7.

    Google Scholar 

  32. Case SM, Swanson DB. Writing one-best-answer questions for the Basic and Clinical sciences. Constructing written test questions for the Basic and Clinical sciences. 3 ed. Philadelphia: National Board of Medical Examiners; 2016. pp. 31–66.

    Google Scholar 

  33. Rezigalla AA. Angoff’s method: the impact of raters’ selection. Saudi J Med Med Sci. 2015;3(3):220–5.

    Article  Google Scholar 

  34. Joseph MN, Chang J, Buck SG, Auerbach MA, Wong AH, Beardsley TD, Reeves PM, Ray JM, Evans LV. A novel application of the modified Angoff method to rate case difficulty in simulation-based research. Simul Healthc. 2021;16(6):e142–50.

    Article  Google Scholar 

  35. Kumar D, Jaipurkar R, Shekhar A, Sikri G, Srinivas V. Item analysis of multiple choice questions: a quality assurance test for an assessment tool. Med J Armed Forces India. 2021;77:85–S89.

    Article  Google Scholar 

  36. Adiga MNS, Acharya S, Holla R. Item analysis of multiple-choice questions in pharmacology in an Indian Medical School. J Health Allied Sci NU. 2021;11(03):130–5.

    Article  Google Scholar 

  37. Rezigalla AA. Item analysis: Concept and application. In: Medical Education for the 21st Century edn. Edited by Firstenberg MS, Stawicki SP: IntechOpen; 2022: 105–120.

  38. Tavakol M, Dennick R. Making sense of Cronbach’s alpha. Int J Med Educ. 2011;2:53–5.

    Article  Google Scholar 

  39. Panayides P. Coefficient alpha: interpret with caution. Europe’s J Psychol. 2013;9(4):687–96.

    Article  Google Scholar 

  40. Reinhardt BM. Factors Affecting Coefficient Alpha: A Mini Monte Carlo Study. In: The annual meeting of the SouthwestEducationea Research Association San Antonio: ERIC; 1991: 1–32.

  41. Bell BA. Pretest–Posttest Design. In: Encyclopedia of research design. Volume 2, edn. Edited by Salkind NJ. Thousand Oaks: SAGE Publications, Inc.; 2014: 1087–1092.

  42. Hassan S, Hod R. Use of item analysis to improve the quality of single best answer multiple choice question in summative assessment of undergraduate medical students in Malaysia. Educ Med J. 2017;9(3):33–43.

    Article  Google Scholar 

  43. Kaur M, Singla S, Mahajan R. Item analysis of in use multiple choice questions in pharmacology. Int J Appl Basic Med Res. 2016;6(3):70–173.

    Article  Google Scholar 

  44. Date AP, Borkar AS, Badwaik RT, Siddiqui RA, Shende TR, Dashputra AV. Item analysis as tool to validate multiple choice question bank in pharmacology. Int J Basic Clin Pharmacol. 2019;8(9):1999–2003.

    Article  Google Scholar 

  45. Elfaki OA, Bahamdan KA, Al-Humayed S. Evaluating the quality of multiple-choice questions used for final exams at the Department of Internal Medicine, College of Medicine, King Khalid University. Sudan Med Monit. 2015;10(4):123–7.

    Article  Google Scholar 

  46. Rezigalla AA. Item analysis: Concept and application. In: Medical Education for the 21st Century edn. Edited by Firstenberg MS, Stawicki SP. London: Intechopen; 2022: 1–16.

  47. Shenoy V, Ravi P, Chandy D. A cross-sectional study on Item Analysis of Prevalidated and nonvalidated anatomy multiple-choice questions. Natl J Clin Anat. 2023;12(2):94–7.

    Article  Google Scholar 

  48. D’Sa JL, Visbal-Dionaldo ML. Analysis of multiple choice questions: Item Difficulty, discrimination index and distractor efficiency. Int J Nurs Educ. 2017;9(3):109–14.

    Article  Google Scholar 

  49. Gajjar S, Sharma R, Kumar P, Rana M. Item and test analysis to identify quality multiple choice questions (MCQs) from an assessment of medical students of Ahmedabad, Gujarat. Indian J Community Medicine: Official Publication Indian Association Prev Social Med. 2014;39(1):17–20.

    Article  Google Scholar 

  50. Mitra N, Nagaraja H, Ponnudurai G, Judson J. The levels of difficulty and discrimination indices in type a multiple choice questions of pre-clinical semester 1 multidisciplinary summative tests. IeJSME. 2009;3(1):2–7.

    Article  Google Scholar 

  51. Kumar D, Jaipurkar R, Shekhar A, Sikri G, Srinivas V. Item analysis of multiple choice questions: a quality assurance test for an assessment tool. Med J Armed Forces India. 2021;1(77):85–S89.

    Article  Google Scholar 

  52. Hassan S, Hod R. Use of item analysis to improve the quality of single best answer multiple choice question in summative assessment of undergraduate medical students in Malaysia. Educ Med J 2017, 9(3).

  53. Kehoe J. Basic item analysis for multiple-choice tests. Practical Assess Res Evaluation. 1994;4(10):1–3.

    Google Scholar 

  54. Escudero EB, Reyna NL, Morales MR. The level of difficulty and discrimination power of the Basic Knowledge and Skills Examination (EXHCOBA). In: Revista electrónica de investigación educativa vol. 2, 2000 edn; 2000: 1–16.

  55. Rao C, Kishan Prasad H, Sajitha K, Permi H, Shetty J. Item analysis of multiple choice questions: assessing an assessment tool in medical students. Int J Educational Psychol Researches. 2016;2(4):201–4.

    Article  Google Scholar 

  56. Sim S-M, Rasiah RI. Relationship between item difficulty and discrimination indices in true/false-type multiple choice questions of a para-clinical multidisciplinary paper. Annals-Academy Med Singap. 2006;35(2):67–81.

    Article  Google Scholar 

  57. Alareifi RM. Analysis of MCQs in summative exam in English: Difficulty Index, discrimination index and relationship between them. J Eduction Hum Sci. 2023;20:124–35.

    Google Scholar 

  58. Soraya S, Shabani A, Kamalzadeh L, Kashaninasab F, Rashedi V, Saeidi M, Seddigh R, Asadi S. Predictability of discrimination coefficient and Difficulty Index of Psychiatry multiple-choice questions. J Iran Med Council. 2021;4(3):165–72.

    Google Scholar 

  59. Hambleton RK, Jirka SJ. Anchor-based methods for judgmentally estimating item statistics. Handbook of test development. edn.: Routledge; 2011. pp. 413–34.

  60. Attali Y, Saldivia L, Jackson C, Schuppan F, Wanamaker W. Estimating item difficulty with comparative judgments. ETS Res Rep Ser. 2014;2014(2):1–8.

    Article  Google Scholar 

  61. Gierl MJ, Lai H, Pugh D, Touchie C, Boulais A-P, De Champlain A. Evaluating the psychometric characteristics of generated multiple-choice test items. Appl Measur Educ. 2016;29(3):196–210.

    Article  Google Scholar 

Download references

Acknowledgements

The authors acknowledge the students and experts who participated in this study. Great appreciation for Professor. Masoud Ishaq (College of Medicine, University of Bisha, Saudi Arabia), Dr. Montaner Mohammed Alhassan (Faculty of Applied Medical Sciences, Jizan University, Saudi Arabia), and Dr. Ammar Ibrahim (College of Medicine, University of Bisha, Saudi Arabia) for their valuable feedback. Special thanks and appreciation to the College of Dean and Administration of the College of Medicine (University of Bisha, Saudi Arabia) for helping and allowing the use of facilities. The authors are thankful to the Deanship of Scientific Research at the University of Bisha for supporting this work through the Fast-Track Research Support Program. The authors are thankful to the Deanship of Graduate Studies and Scientific Research at University of Bisha for supporting this work through the Fast-Track Research Support Program.

Funding

No funds received.

Author information

Authors and Affiliations

Authors

Contributions

The author confirms sole responsibility for the following: study conception and design, data collection, analysis and interpretation of results, and manuscript preparation.

Corresponding author

Correspondence to Assad Ali Rezigalla.

Ethics declarations

Ethics approval and consent to participate

The study was approved by the Research and Ethics Committee of the College of Medicine, University of Bisha (KSA). The students were informed about the study, and written consent was obtained from the participating students as part of the i-Rat. The students were informed that the TBL was formative and that their participation in the study did not impact their grades or GPA. In addition, the students could attend TBL without including their responses in the study.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original version of this article was revised: The affiliation of the author has been added to the PDF file.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rezigalla, A.A. AI in medical education: uses of AI in construction type A MCQs. BMC Med Educ 24, 247 (2024). https://doi.org/10.1186/s12909-024-05250-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12909-024-05250-3

Keywords