This article has Open Peer Review reports available.
Constructing a question bank based on script concordance approach as a novel assessment methodology in surgical education
© Aldekhayel et al.; licensee BioMed Central Ltd. 2012
Received: 26 March 2012
Accepted: 14 October 2012
Published: 24 October 2012
Script Concordance Test (SCT) is a new assessment tool that reliably assesses clinical reasoning skills. Previous descriptions of developing SCT-question banks were merely subjective. This study addresses two gaps in the literature: 1) conducting the first phase of a multistep validation process of SCT in Plastic Surgery, and 2) providing an objective methodology to construct a question bank based on SCT.
After developing a test blueprint, 52 test items were written. Five validation questions were developed and a validation survey was established online. Seven reviewers were asked to answer this survey. They were recruited from two countries, Saudi Arabia and Canada, to improve the test’s external validity. Their ratings were transformed into percentages. Analysis was performed to compare reviewers’ ratings by looking at correlations, ranges, means, medians, and overall scores.
Scores of reviewers’ ratings were between 76% and 95% (mean 86% ± 5). We found poor correlations between reviewers (Pearson’s: +0.38 to −0.22). Ratings of individual validation questions ranged between 0 and 4 (on a scale 1–5). Means and medians of these ranges were computed for each test item (mean: 0.8 to 2.4; median: 1 to 3). A subset of test items comprising 27 items was generated based on a set of inclusion and exclusion criteria.
This study proposes an objective methodology for validation of SCT-question bank. Analysis of validation survey is done from all angles, i.e., reviewers, validation questions, and test items. Finally, a subset of test items is generated based on a set of criteria.
KeywordsPlastic surgery Script concordance approach Question bank Surgical education
Research concerning the assessment of clinical reasoning skills has been extensive in the last few decades . Kreiter et al  suggest three potentially measurable aspects related to clinical reasoning: (1) to assess whether important information was collected and retained by the physician; (2) to assess diagnosis and management outcomes resulting from the integration of new clinical information with preexisting knowledge structures; and (3) to assess the development of those preexisting knowledge structures. According to Kreiter , the script concordance test (SCT), which was described originally by Charlin and collaborators in 2000 , is one method that reliably assesses those aspects of clinical reasoning. It has emerged from two theories of clinical reasoning: hypothetico-deductive and illness script theories [4, 5]. The hypothetico-deductive theory implies that when physicians encounter a problem in a real-life setting (a diagnostic, investigative, or therapeutic problem), they generate multiple preliminary hypotheses and then test each one to confirm or eliminate these hypotheses until a final decision is reached [6, 7]. The illness script theory provides one way of explaining this concept. It indicates that knowledge is organized in networks and that when a new situation is faced, one would activate prior networks to make sense of this new situation [6, 8, 9]. Schmidt et al  elaborate that these scripts emerge from expertise and hence are refined with experience as each new encounter is compiled into relevant mental networks.
The script concordance test (SCT) was designed to probe whether the organization of knowledge networks allows for competent decision-making processes . It places the examinees in a written and authentic environment that resembles their real-life situations. It is based on the following principles [3, 11–14]: (1) tasks should be challenging even for experts but still appropriate for the examinees’ level; (2) items should reflect authentic clinical situations and be presented in a vignette format; (3) each item is composed of a clinical scenario and followed by 3–5 questions related to diagnostic, investigative, or management problems; (4) judgments are measured on a 5-point Likert scale for each question; and (5) test scoring is based on an aggregate scoring method.
Over the last decade, extensive research has been conducted that confirms the validity and reliability of the SCT in various medical disciplines. However, to the best of our knowledge, the validity of the SCT has not yet been examined in plastic surgery, which is known for its controversies and uncertainty; therefore, clinical reasoning is a fundamental cornerstone in the assessment of plastic surgery residents.
Downing  discussed five sources of validity evidence based on the Standards for Educational and Psychological Testing : (1) content; (2) response process; (3) internal structure; (4) relationship to other variables; and (5) consequences. The current study aims to assess the content source of validity for two reasons: (i) not all sources of validity evidence are required in all assessments ; and (ii) at this phase of question bank construction, we do not have any sources of evidence other than the content validity. Other sources of validity evidence (e.g., internal structure and response process) can be assessed after applying this test to plastic surgery residents in the third phase.
All previous studies [3, 12, 13, 17–20] that examined the validity of the SCT have provided a brief description of question bank construction and a merely subjective method of validating it. Therefore, the present study aims to propose a novel objective methodology for the construction of a question bank in plastic surgery based on the script concordance approach, which will help in standardizing the test writing process of SCT across various disciplines. The construction of the SCT comprises three successive phases: (1) the construction and validation of a question bank; (2) the establishment of a scoring grid; and (3) the application of the test to examinees. This study represents the first phase: question bank construction. Subsequent phases will be conducted in future studies.
Number of items
Pediatric Plastic Surgery
The first step in writing the test items was to invite two academic plastic surgeons at King Saud University and King Saud bin Abdulaziz University for Health Sciences, Riyadh, to develop a pool of real-life clinical scenarios for use in the SCT. They answered the following questions: (i) describe authentic clinical situations that contain an element of uncertainty; (ii) specify for each situation: a) relevant hypotheses, investigation strategies, or management options; b) questions they ask when taking a patient history, signs they look for during the physical examination, and tests that they order to solve the problem; and c) clinical information, whether positive or negative, they would look for in these queries . Multiple drafts were generated and revised until the test writers have reached consensus on the final draft.
A validation survey (Figure 2) was established on the online survey software, SurveyMonkey™, to validate the question bank draft and the test blueprint. The survey was sent to seven academic plastic surgeons in Riyadh, Saudi Arabia and Toronto and Montreal, Canada who met the following inclusion criteria: (1) to be an academic, certified plastic surgeon involved in teaching plastic surgery residents; and (2) have a minimum experience of 10 years in practice. The selection of reviewers was based on a convenient sampling. Ethical approval was obtained from the Institutional Review Board at King Abdullah International Medical Research Center (KAIMRC), Riyadh. All reviewers have agreed to an informed consent before answering the online survey. The survey started by a Likert-type question to rate whether the test blueprint is representative of the educational objectives of Plastic Surgery residency training programs. Then, each test item (clinical scenario followed by 3 to 5 questions) was presented in the survey and followed by the five validation questions described previously (Figure 2).
Analysis of the reviewers’ ratings:
Analysis of the validation questions:
Analysis of the test items:
For ranking purposes, the overall scores of the test items were divided into percentiles: 75th, 50th, and 25th. Then, an item reduction process was carried out to reduce the number of test items from 52 items to a minimum of 20 items. The 20-item SCT was required to achieve a high reliability (Cronbach alpha > 0.75) . This subset of the test items was generated based on a set of inclusion and exclusion criteria which were set arbitrarily and validated with a sensitivity analysis by changing one criterion at a time and looking at the output of these criteria until we reached the optimal end results where the output items have the highest rating. This helps to decrease any margin of error with setting up these criteria arbitrarily. These criteria are:
□ All items above the 50th percentile (total score ≥ 86%);
□ All items with a mean of the range ≤ 2; and
□All items with a median of the range ≤ 2.
□ Any item with a range of 4 on any validation question.
These criteria were applied on each domain of the test blueprint separately, as not to disturb the structure of the test. The generated subset of test items will serve as the final draft of the question bank.
Statistical analysis was performed using SPSS version 18 (IBM; Chicago).
Five out of seven reviewers answered the validation survey completely (response rate 71%): two Saudis and three Canadians. They represented four different academic institutions in Riyadh, Toronto, and Montreal. Regarding the test blueprint (Table 1), three reviewers (60%) were in relative agreement that it was reasonably representative of the major instructional objectives of the plastic surgery residency program, one reviewer (20%) was uncertain, and one (20%) relatively disagreed, suggesting that more burn and reconstruction items must be added. Other comments suggested adding skin pathology as a separate entity in the blueprint, although there were few questions on this subject in the reconstruction domain.
Analysis of reviewers’ ratings:
The item scores given by the first reviewer ranged between 40% and 80% (mean 70% ± 10), for the second reviewer between 55% and 100% (mean 96% ± 8.4), for the third reviewer between 50% and 100% (mean 76.8% ± 16.7), for the fourth reviewer between 70% and 100% (mean 94% ± 8), and for the fifth reviewer between 60% and 100% (mean 93% ± 9.5).
Analysis of validation questions:
Pearson correlation coefficients of each reviewer against the average of the remaining reviewers for each validation question (VQ1-VQ5) and for the overall score
The scores for the first validation question (VQ1) ranged between 80% and 100% (mean 91% ± 5.6), for VQ2 between 60% and 100% (mean 91% ± 6.7), for VQ3 between 60% and 95% (mean 82% ± 7), for VQ4 between 50% and 95% (mean 81% ± 9.7), and for VQ5 between 60% and 95% (mean 85% ± 6.7).
Analysis of the test items:
Ranges of validation questions (VQ) ratings
No. of items
No. of items
No. of items
No. of items
No. of items
Ranges of the 4 th validation question (written quality of test items) for the Saudi and Canadian groups
No. of items
No. of items
The overall scores of the test items ranged between 76% and 95% (mean 86% ± 5). These scores were then divided into percentiles: 75th at 90%, 50th at 86%, and 25th at 82%.
The process of subset generation using the inclusion/exclusion criteria yielded 27 eligible items, which are considered to comprise the final draft of the question bank.
The script concordance test was developed in 2000 by Charlin and collaborators  who aimed to assess clinical reasoning skills. It places the examinees in a written and authentic environment that resembles their real-life situations. It utilizes an aggregate scoring method that is most suitable for such ambiguous situations . Meterissian  indicated that these situations can force a surgeon to deviate from his preoperative plan, and such decisions under pressure could negatively affect patients’ outcomes. Thus, the objective of this study was to address two gaps in the literature: the first goal was to conduct the first phase of a multistep validation study of SCT in the context of plastic surgery, and the second was to provide an objective method to establish and validate a question bank based on the SCT.
The first phase in a multistep validation process constitutes a question bank construction. It comprises four sub-steps: (1) developing a test blueprint; (2) writing test items; (3) validating the question bank draft by external reviewers; and (4) analyzing the validation survey results and generating a subset of the question bank that will be used in the second phase of the SCT validation process, i.e., the establishment of a scoring grid.
Fifty-two test items composed of 158 questions were written, representing the first draft of the question bank. Gagnon et al . found that a 25-item SCT with 3 questions / item achieved the highest reliability (Cronbach’s alpha > 0.80) with the minimum cognitive demand on examinees (test time of one hour) and a minimal workload for the test writers. However, when constructing the question bank, one must keep in mind that a significant number of items will be discarded or rewritten during the question bank reviewing process and score grid establishment. Meterissian  suggested an initial 100-question SCT to provide a margin for the item reduction process. Item reduction occurs at two levels: the first is based on reference panel comments , and the second occurs following an analysis of reference panel scores, where items with extreme variability should be discarded . The validation survey enabled us to select the best test items, and according to the set criteria, 27 items composed of 83 questions met those criteria. Moreover, a good margin was obtained for further reduction of the number of items in the second phase (establishing the score grid) while maintaining high reliability (Cronbach’s alpha > 0.75).
The question bank validation process is a crucial step in constructing the SCT. It assures the face validity (whether the questions test clinical reasoning skills) and content validity (whether the questions are relevant and representative of the training program objectives) . For the content validation purposes, we developed five validation questions (Figure 2) examining five different domains: (i) relevance to the training program objectives; (ii) cultural sensitivity; (iii) structural quality of test questions; (iv) written quality of questions; and (v) plausibility of provided options.
The analysis of the validation survey was approached from three angles: reviewers, validation questions, and test items. This was performed to determine whether all elements of the validation process had been examined because any element could be a threat to this process. For instance, one might consider that a reviewer who persistently under- or over-rates test items, or even a poorly written validation question, could affect the validation process if that situation is not taken into consideration and controlled.
The analysis of reviewers’ ratings aimed to identify an agreement between reviewers. Correlations between each reviewer against the pool of the remaining reviewers were poor. Surprisingly, even correlations between paired reviewers were poor. One would assume that such poor correlations could be attributed to any of the following assumptions: (a) small sample size (5 reviewers); (b) poorly written validation questions; (c) heterogeneity of the reviewers, i.e., different cultural and subspecialty backgrounds. However, the validation question analysis did not show a consistently poor VQ; although VQ4 demonstrated a high variability, further in-depth analysis (Table 4) provided an explanation for this finding. It is important to note that the sample size could have had a negative effect; however, we cannot ignore the possibility that the heterogeneity of the reviewers might have been the cause of such poor correlations. We decided to give an equal weight to all reviewers’ ratings, and we generated a subset of test items based on them. One strategy to address such poor correlation is the development of inclusion/exclusion criteria that aim to select the best rated items to be included in the second phase of the validation process (establishing the scoring grid).
This study has few limitations. In addition to the previous discussion concerning the poor correlations between reviewers, certain test items exhibited a high level of disagreement for certain validation questions. For instance, one test item provided one “strongly disagree” rating and four “strongly agree” ratings! These items were eventually excluded from the final draft of the question bank because they did not meet the inclusion criteria, but such an unexplainable disagreement between the reviewers is striking. Another limitation of the study is the lack of accessibility to the reviewers because they are from different institutions and countries. Ideally, a poorly rated test item can be rewritten and resubmitted to the reviewers for revalidation. Other sources of evidence for the construct and criterion validity and reliability will be collected in the future studies as the question bank undergoes the application phase. Finally, although the overall validation process could seem complicated, it enriches the test writers with a validated and objectified methodology to construct SCT-based question banks.
This study represents the first phase of SCT validation in the context of plastic surgery: the construction of a question bank. It proposes an objective methodology for validation of the question bank. Basically, after experts develop the test blueprint and write test items, a validation survey should be established and then sent to external reviewers. Analysis of the validation survey should be conducted from all possible angles, e.g., reviewers’ correlations, validation questions, and test items. Finally, a subset of test items should be generated based on a set of inclusion and exclusion criteria. Further studies will be conducted to complete the remaining phases of the SCT validation (establishing a score grid and application to plastic surgery residents).
The study has not received financial grants from any institution.
The study is a Master’s thesis done in partial fulfillment of the Masters of Medical Education program at College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia.
We express our sincere gratitude to all reviewers in Saudi Arabia and Canada for their significant contribution to this study by completing the validation survey. Ms. Shahla Althukair kindly assisted with the statistical analysis. Ms. Hala Alsaleem, Ms. Rahaf Abu Nameh, and other secretaries of plastic surgeons in Canada are appreciated for their administrative support.
- Norman G: Research in clinical reasoning: past history and current trends. Med Educ. 2005, 39 (4): 418-427. 10.1111/j.1365-2929.2005.02127.x.View ArticleGoogle Scholar
- Kreiter CD, Bergus G: The validity of performance-based measures of clinical reasoning and alternative approaches. Med Educ. 2009, 43 (4): 320-325. 10.1111/j.1365-2923.2008.03281.x.View ArticleGoogle Scholar
- Charlin B, Roy L, Brailovsky C, Goulet F, van der Vleuten C: The Script Concordance test: a tool to assess the reflective clinician. Teach Learn Med. 2000, 12 (4): 189-195. 10.1207/S15328015TLM1204_5.View ArticleGoogle Scholar
- Charlin B, Brailovsky C, Leduc C, Blouin D: The diagnosis script questionnaire: a new tool to assess a specific dimension of clinical competence. Adv Health Sci Educ Theory Pract. 1998, 3 (1): 51-58. 10.1023/A:1009741430850.View ArticleGoogle Scholar
- Gagnon R, Charlin B, Roy L, St-Martin M, Sauve E, Boshuizen HP, van der Vleuten C: The cognitive validity of the script concordance test: a processing time study. Teach Learn Med. 2006, 18 (1): 22-27. 10.1207/s15328015tlm1801_6.View ArticleGoogle Scholar
- Charlin B, Tardif J, Boshuizen HP: Scripts and medical diagnostic knowledge: theory and applications for clinical reasoning instruction and research. Acad Med. 2000, 75 (2): 182-190. 10.1097/00001888-200002000-00020.View ArticleGoogle Scholar
- Williams RG, Klamen DL, Hoffman RM: Medical student acquisition of clinical working knowledge. Teach Learn Med. 2008, 20 (1): 5-10. 10.1080/10401330701542552.View ArticleGoogle Scholar
- Collard A, Gelaes S, Vanbelle S, Bredart S, Defraigne JO, Boniver J, Bourguignon JP: Reasoning versus knowledge retention and ascertainment throughout a problem-based learning curriculum. Med Educ. 2009, 43 (9): 854-865. 10.1111/j.1365-2923.2009.03410.x.View ArticleGoogle Scholar
- Charlin B, Boshuizen HP, Custers EJ, Feltovich PJ: Scripts and clinical reasoning. Med Educ. 2007, 41 (12): 1178-1184. 10.1111/j.1365-2923.2007.02924.x.View ArticleGoogle Scholar
- Schmidt HG, Norman GR, Boshuizen HP: A cognitive perspective on medical expertise: theory and implication. Acad Med. 1990, 65 (10): 611-621. 10.1097/00001888-199010000-00001.View ArticleGoogle Scholar
- Charlin B, van der Vleuten C: Standardized assessment of reasoning in contexts of uncertainty: the script concordance approach. Eval Health Prof. 2004, 27 (3): 304-319. 10.1177/0163278704267043.View ArticleGoogle Scholar
- Fournier JP, Demeester A, Charlin B: Script concordance tests: guidelines for construction. BMC Med Inform Decis Mak. 2008, 8: 18-10.1186/1472-6947-8-18.View ArticleGoogle Scholar
- Carriere B, Gagnon R, Charlin B, Downing S, Bordage G: Assessing clinical reasoning in pediatric emergency medicine: validity evidence for a Script Concordance Test. Ann Emerg Med. 2009, 53 (5): 647-652. 10.1016/j.annemergmed.2008.07.024.View ArticleGoogle Scholar
- Lambert C, Gagnon R, Nguyen D, Charlin B: The script concordance test in radiation oncology: validation study of a new tool to assess clinical reasoning. Radiat Oncol. 2009, 4: 7-10.1186/1748-717X-4-7.View ArticleGoogle Scholar
- Downing SM: Validity: on meaningful interpretation of assessment data. Med Educ. 2003, 37 (9): 830-837. 10.1046/j.1365-2923.2003.01594.x.View ArticleGoogle Scholar
- American Educational Research Association., American Psychological Association., National Council on Measurement in Education., Joint Committee on Standards for Educational and Psychological Testing (U.S.): Standards for educational and psychological testing. 1999, Washington, DC: American Educational Research AssociationGoogle Scholar
- Meterissian SH: A novel method of assessing clinical reasoning in surgical residents. Surg Innov. 2006, 13 (2): 115-119. 10.1177/1553350606291042.View ArticleGoogle Scholar
- Sibert L, Darmoni SJ, Dahamna B, Hellot MF, Weber J, Charlin B: On line clinical reasoning assessment with Script Concordance test in urology: results of a French pilot study. BMC Med Educ. 2006, 6: 45-10.1186/1472-6920-6-45.View ArticleGoogle Scholar
- Cohen LJ, Fitzgerald SG, Lane S, Boninger ML: Development of the seating and mobility script concordance test for spinal cord injury: obtaining content validity evidence. Assist Technol. 2005, 17 (2): 122-132. 10.1080/10400435.2005.10132102.View ArticleGoogle Scholar
- Meterissian S, Zabolotny B, Gagnon R, Charlin B: Is the script concordance test a valid instrument for assessment of intraoperative decision-making skills?. Am J Surg. 2007, 193 (2): 248-251. 10.1016/j.amjsurg.2006.10.012.View ArticleGoogle Scholar
- Charlin B, Desaulniers M, Gagnon R, Blouin D, van der Vleuten C: Comparison of an aggregate scoring method with a consensus scoring method in a measure of clinical reasoning capacity. Teach Learn Med. 2002, 14 (3): 150-156. 10.1207/S15328015TLM1403_3.View ArticleGoogle Scholar
- Gagnon R, Charlin B, Lambert C, Carriere B, Van der Vleuten C: Script concordance testing: more cases or more questions?. Adv Health Sci Educ Theory Pract. 2009, 14 (3): 367-375. 10.1007/s10459-008-9120-8.View ArticleGoogle Scholar
- Charlin B, Gagnon R, Pelletier J, Coletti M, Abi-Rizk G, Nasr C, Sauve E, van der Vleuten C: Assessment of clinical reasoning in the context of uncertainty: the effect of variability within the reference panel. Med Educ. 2006, 40 (9): 848-854. 10.1111/j.1365-2929.2006.02541.x.View ArticleGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6920/12/100/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.