The script concordance test was developed in 2000 by Charlin and collaborators  who aimed to assess clinical reasoning skills. It places the examinees in a written and authentic environment that resembles their real-life situations. It utilizes an aggregate scoring method that is most suitable for such ambiguous situations . Meterissian  indicated that these situations can force a surgeon to deviate from his preoperative plan, and such decisions under pressure could negatively affect patients’ outcomes. Thus, the objective of this study was to address two gaps in the literature: the first goal was to conduct the first phase of a multistep validation study of SCT in the context of plastic surgery, and the second was to provide an objective method to establish and validate a question bank based on the SCT.
The first phase in a multistep validation process constitutes a question bank construction. It comprises four sub-steps: (1) developing a test blueprint; (2) writing test items; (3) validating the question bank draft by external reviewers; and (4) analyzing the validation survey results and generating a subset of the question bank that will be used in the second phase of the SCT validation process, i.e., the establishment of a scoring grid.
Fifty-two test items composed of 158 questions were written, representing the first draft of the question bank. Gagnon et al . found that a 25-item SCT with 3 questions / item achieved the highest reliability (Cronbach’s alpha > 0.80) with the minimum cognitive demand on examinees (test time of one hour) and a minimal workload for the test writers. However, when constructing the question bank, one must keep in mind that a significant number of items will be discarded or rewritten during the question bank reviewing process and score grid establishment. Meterissian  suggested an initial 100-question SCT to provide a margin for the item reduction process. Item reduction occurs at two levels: the first is based on reference panel comments , and the second occurs following an analysis of reference panel scores, where items with extreme variability should be discarded . The validation survey enabled us to select the best test items, and according to the set criteria, 27 items composed of 83 questions met those criteria. Moreover, a good margin was obtained for further reduction of the number of items in the second phase (establishing the score grid) while maintaining high reliability (Cronbach’s alpha > 0.75).
The question bank validation process is a crucial step in constructing the SCT. It assures the face validity (whether the questions test clinical reasoning skills) and content validity (whether the questions are relevant and representative of the training program objectives) . For the content validation purposes, we developed five validation questions (Figure 2) examining five different domains: (i) relevance to the training program objectives; (ii) cultural sensitivity; (iii) structural quality of test questions; (iv) written quality of questions; and (v) plausibility of provided options.
The analysis of the validation survey was approached from three angles: reviewers, validation questions, and test items. This was performed to determine whether all elements of the validation process had been examined because any element could be a threat to this process. For instance, one might consider that a reviewer who persistently under- or over-rates test items, or even a poorly written validation question, could affect the validation process if that situation is not taken into consideration and controlled.
The analysis of reviewers’ ratings aimed to identify an agreement between reviewers. Correlations between each reviewer against the pool of the remaining reviewers were poor. Surprisingly, even correlations between paired reviewers were poor. One would assume that such poor correlations could be attributed to any of the following assumptions: (a) small sample size (5 reviewers); (b) poorly written validation questions; (c) heterogeneity of the reviewers, i.e., different cultural and subspecialty backgrounds. However, the validation question analysis did not show a consistently poor VQ; although VQ4 demonstrated a high variability, further in-depth analysis (Table 4) provided an explanation for this finding. It is important to note that the sample size could have had a negative effect; however, we cannot ignore the possibility that the heterogeneity of the reviewers might have been the cause of such poor correlations. We decided to give an equal weight to all reviewers’ ratings, and we generated a subset of test items based on them. One strategy to address such poor correlation is the development of inclusion/exclusion criteria that aim to select the best rated items to be included in the second phase of the validation process (establishing the scoring grid).
This study has few limitations. In addition to the previous discussion concerning the poor correlations between reviewers, certain test items exhibited a high level of disagreement for certain validation questions. For instance, one test item provided one “strongly disagree” rating and four “strongly agree” ratings! These items were eventually excluded from the final draft of the question bank because they did not meet the inclusion criteria, but such an unexplainable disagreement between the reviewers is striking. Another limitation of the study is the lack of accessibility to the reviewers because they are from different institutions and countries. Ideally, a poorly rated test item can be rewritten and resubmitted to the reviewers for revalidation. Other sources of evidence for the construct and criterion validity and reliability will be collected in the future studies as the question bank undergoes the application phase. Finally, although the overall validation process could seem complicated, it enriches the test writers with a validated and objectified methodology to construct SCT-based question banks.