Fixed or mixed: a comparison of three, four and mixed-option multiple-choice tests in a Fetal Surveillance Education Program

Background Despite the widespread use of multiple-choice assessments in medical education assessment, current practice and published advice concerning the number of response options remains equivocal. This article describes an empirical study contrasting the quality of three 60 item multiple-choice test forms within the Royal Australian and New Zealand College of Obstetricians and Gynaecologists (RANZCOG) Fetal Surveillance Education Program (FSEP). The three forms are described below. Methods The first form featured four response options per item. The second form featured three response options, having removed the least functioning option from each item in the four-option counterpart. The third test form was constructed by retaining the best performing version of each item from the first two test forms. It contained both three and four option items. Results Psychometric and educational factors were taken into account in formulating an approach to test construction for the FSEP. The four-option test performed better than the three-option test overall, but some items were improved by the removal of options. The mixed-option test demonstrated better measurement properties than the fixed-option tests, and has become the preferred test format in the FSEP program. The criteria used were reliability, errors of measurement and fit to the item response model. Conclusions The position taken is that decisions about the number of response options be made at the item level, with plausible options being added to complete each item on both psychometric and educational grounds rather than complying with a uniform policy. The point is to construct the better performing item in providing the best psychometric and educational information.


Background
Several studies provide advice on the matter of the optimal number of multiple-choice question response options [1][2][3][4]. There are also several medical education assessment programs with long-standing policies and practices of their own [5,6]. What becomes apparent is that recommendations are generally conditional upon assumptions and contextual factors in each assessment setting. What follows is a review of a number of prominent approaches regarding the number of MCQ response options.
The first approach to specifying multiple-choice option numbers might be termed traditional, where four or five options are used by convention [3]. Tarrant, Ware and Mohammed [3] explained that in many organisations the number of response options is uniformly fixed across all questions, and that this policy has little if any psychometric grounding. This is prominent in medical education assessments despite studies advocating the benefits of using fewer or varied option numbers [5,7]. Common four-and five-option approaches have certain drawbacks, especially where plausible alternatives become difficult to construct.
The next approach is to take the emergent majority recommendation, which might be termed the meta-analytical convention, based on empirical and theoretical studies. The consensus here is that three options is optimal [4,8,9]. This recommendation is often based on assumptions that the time taken to respond to three-option items is proportionally less than the time taken to respond to four-option items, as determined by the number of options alone [8]. Therefore some of the advocates of a three-option approach base their standpoint in part on the benefits of being able to construct and administer a larger number of items per unit time, thereby increasing content coverage and potentially increasing test reliability [9]. However, this assumption has been refuted on several occasions with the recognition that several other features of an item will influence response time [6,10,11]. Another argument in favour of a three-option policy relates to plausibility of options. Several studies have concluded that four-or five-option items rarely contain a full set of plausible alternatives [3]. Options attracting fewer than 5% of total respondents (or non positively discriminating distractors) are often classified as 'non-functioning' and this has often led to the recommendation for their removal [12,13].
When the number of items is fixed and the number of options is manipulated, mixed results and recommendations are reported; while some studies have identified a small or negligible change in item difficulty or discrimination for different option numbers [7,[14][15][16][17], others have found decreases in difficulty and discrimination for smaller option numbers [3]. The results depend on the quality of the removed response options. From a theoretical perspective, the addition of distractor options which discriminate appropriately (negatively) should improve the overall item discrimination [18]. There may also be an increase in the difficulty of the item, because the additional option possibly increases cognitive load and the proportion of testtakers guessing the correct answer tends also to decrease with increasing option numbers.
Psychometric indices were not the only determinant of item quality. Educational value was also considered. Uniform response option policies (whether it is three or four or more) are at odds with the advice from Frary [19] and Swanson [20], who both argue that some questions invite particular sets of alternatives on curriculum grounds. By this rationale, educational or clinical related alternatives are included and irrelevant alternatives are not. For example, according to the RANZCOG Intrapartum Fetal Surveillance Guidelines [21] there are four broad categories of fetal heart rate deceleration. Any item assessing the ability of practitioners to distinguish the type of deceleration pictured in an accompanying cardiotocograph (CTG) might therefore have the four categories of deceleration as options. As cited earlier, there is no psychometric reason that all items must have the same number of options. In most content areas, it could be argued that there is no educational reason either. Tarrant, Ware and Mohammed provided the following synopsis.
So while in most circumstances, three options would be sufficient, item writers should write as many good distractors as is feasible given the content area being assessed. Additionally, when reviewing item performance on previous tests, test developers and item writers should not eliminate options that perform adequately simply to conform to a pre-set number of options [3].
This view, combined with the practice of including educationally important alternatives, might be termed an item-level approach to determining the number of response options.
For some items FSEP subject-matter experts reported difficulty in producing a plausible fourth option. It was reported to result in additional time being spent in itemwriting workshops for what were typically minimal gains in psychometric quality. Further, the risk of introducing problematic options is arguably increased when subjectmatter experts are required to add options which they would otherwise omit.
From the literature some basic conclusions emerge. Where the number of items is not fixed, maximising the number of items with three response options usually emerges as the best approach. This is conditional on gains such as time savings and improved content sampling. If the number of items is fixed or testing times are more invariant than the proportionality assumption predicts, the addition of plausible options can marginally improve the quality of the test, depending upon item content and option quality. Beyond psychometric considerations, the flexibility to add educationally or clinically important alternatives, in spite of low selection frequencies, provides an opportunity to bolster arguments in support of content validity.
We undertook the present study with the assumption that the number of questions will be held reasonably constant in a final version of the assessment system. While we did not have accurate information about response times, and acknowledge that such information would be useful, we did not see any potential reduction in the number of response options as an opportunity to increase the number of test items in the FSEP. Further, in the domain of fetal surveillance knowledge it has been reported that 25 to 50 questions might provide an adequate sampling of content [22]. On this basis, we believe that the FSEP 60 item test forms provide adequate scope for sampling critical content. The addition of extra items would have more influence on increasing reliability and decreasing measurement error, rather than addressing content shortcomings. As discussed in subsequent sections, decreasing measurement error remains an important objective in the FSEP context.

Method
Three versions of a 60-item FSEP multiple-choice assessment were compiled. The first version contained four options per item, one correct option and three incorrect options. The second version contained three options per item, with one less incorrect option. This three-option version was constructed in the following way: for items that had been used in previous four-option assessment forms the least frequently chosen option was removed to construct a three-option item. If a four-option item contained a positively discriminating incorrect option with a reasonable selection frequency, then this was removed instead. These two criteria (less than 5% selection frequency and the sign of the distractor discrimination) appear to be the most commonly reported in studies concerned with identifying non-functioning distractors [3]. For a smaller number (11) of new four-option items that had yet to be trialled, subject-matter experts eliminated what they perceived to be the least plausible incorrect option, based on experience with similarly styled/structured items, to produce a threeoption item. Cizek and O'Day [14] showed that subjectmatter experts' selections can be highly consistent with empirical data about these relative frequencies. In compiling the first two test forms, exactly the same items in terms of the item stem appeared in each test. The item order was also preserved across test forms to avoid order effects.
The third test was constructed as a mixed-option test form. This test was largely constructed from the items trialled in the fixed-option test forms. A total of ten threeoption items were retained. A total of 38 items were sourced from the four-option test. These retained items represented the version of each item which discriminated best in the trial. Also, 12 new items were introduced. Of these, only three were completely new, whilst nine were items which underwent minor amendment. In order to avoid contamination by the new items only data from the subset of 48 common items are used to derive test and item indices.
Details about the test content and the target population are described in Zoanetti et al [23]. The three-option test was administered to 646 practitioners and the four-option test was administered to 763 practitioners. Test administration took place in a number of testing sessions. Fixedoption test booklets were distributed randomly across testing sessions so that an assumption of equivalent ability distributions could be supported.
The mixed-option test was administered to a different sample of 1044 practitioners from the FSEP target population. A comparison of values for various test statistics across the three 48 item subsets was then made.
A variety of indices of test and item quality were computed. These included: the mean difficulty of the tests in terms of classical test theory (CTT) item facility values, the mean of the item discrimination values, and the internal consistency index Cronbach's [24] Alpha. Item indices included: CTT item discrimination (point biserial), item facility as the percentage of correct responses for an item, the number of non-functioning options, the item fit following scaling with the Rasch model, and the standard error of measurement (SEM) following scaling with the Rasch model. These latter two Rasch-based statistics were included given the intended scaling of FSEP assessment forms onto a common latent metric (refer to [23] for more details). Statistical tests were also conducted to analyse differences between a number of these indices including mean facility, discrimination and reliability across test forms. These tests included paired sample t-tests to evaluate facility and discrimination differences, and Feldt and Kim's [25] test for comparing reliability coefficients from different test forms.
The Rasch measurement error for person scores is a function of the number of items and the targeting of each item's difficulty to the estimated person ability. Unlike CTT, where a SEM is a property of the test and is assumed to be constant across all test takers, the SEM for person scores under the Rasch paradigm varies with a person's scaled score (the estimate of a person's latent ability). Our interest in this assessment context is that the SEM for a person should be minimized. More specifically, when a pass standard is established for the FSEP assessment, our interest will be in minimising SEM for practitioners scoring near pre-determined cut-scores. Small SEM values reduce the uncertainty surrounding decisions about whether test takers either exceed or do not meet specified standards. It also means that the measurement process will support decisions about a greater proportion of the test takers. Test takers for whom it cannot be determined with high likelihood whether they exceed the pass standard will require additional evidence to be considered about their competencies before any high-stakes decision can be made. For additional explanations of Rasch measurement error, we refer the interested reader to Schumacker and Smith [26].

Results
The results for this study are presented in three sections. The first two sections detail differences in test and item statistics between the three-option and four-option test forms. The third section compares the statistical characteristics of the mixed option version against the fixed option versions of the test.

Test statistics for the fixed-option forms
The first statistical comparison concerned the relative difficulty of items from the two test forms. In the present study, the mean CTT item facility on the threeoption test was higher than the mean item facility on the four-option test by 5.7%. This difference was significant when calculated via a paired sample t-test of item facility across all items (t=-6.358, df=59, p<0.001). A total of 46 of the 60 items became easier upon removal of the least attractive incorrect option.
Also noted was a modest difference in the internal consistency index, Cronbach's Alpha. The four-option test had an Alpha value of 0.791 while the three-option test had an Alpha value of 0.775. The difference in Cronbach Alpha values illustrates that the four-option test had slightly superior internal consistency. Feldt and Kim [25] developed a test for comparing reliability coefficients from similar tests. Their W statistic approximates to a central F-distribution with N 1 -1 and N 2 -1 degrees of freedom, where N 1 and N 2 are the two sample sizes, the critical F value was calculated as F crit (645, 762) = 1.13 at α = 0.05. It was then determined that the test statistic W ≈ F = 1.08 < 1.13, indicating that the four-option form did not have a statistically significant higher reliability coefficient than the three option-form (p= 0.16).
Nonetheless, while there were no significant differences between the two fixed item forms, we determined how many additional three-option items of comparable quality might be needed to obtain an increase in Cronbach's Alpha. Assuming equivalence of ability distributions of the practitioner samples, the implication of a larger Cronbach Alpha value is that the four-option test would provide more reproducible estimates of the relative ranking of practitioners. Importantly, for the three-option test to rival the fouroption test in terms of this index, it is estimated by use of the Spearman-Brown [27,28] prophecy formula that an additional 6 items of equivalent quality to those already on hand would have been required.
The mean item discrimination of the two tests was also compared using a paired sample t-test. The mean of the differences was less than 0.01 and not significant (t=0.867, df=59, p=0.389). Interestingly, the non-parametric correlation of item discrimination values between the two forms was modest at 0.72. This suggested some re-ordering of the relative discriminating power of items had occurred following removal of the least functioning distractor. The standard deviation of the differences was 0.07, also suggesting the presence of reasonable variation at the item level. This highlighted the importance of examining changes across test forms at the level of individual items. The next section takes this approach.

Item statistics for the fixed-option forms
The items were analysed and the following statistics were calculated: facility (percentage of correct respondents), item discrimination (r), response option (A, B, C or D) frequencies and missing response frequencies (percentage of respondents) and response option point biserial (Pt Bis) values. These values are displayed in Table 1.
Several things were evident from an inspection of the item analysis. Three items had not functioned well. These were item 1, item 9 and item 54. These were flagged for qualitative review and replacement in subsequent versions of the assessment. Net decreases in CTT item difficulty were observed for a total of 50 of the 60 items, supporting the theory that removal of negatively discriminating options will increase item facility. Only two items had increased in difficulty by what was considered a substantive, though arbitrary, amount (>5%) upon reduction to three options. These were item 26 (6.45% lower facility) and item 58 (14.70% lower facility). At the other extreme, the facility of item 37 increased 26.74% upon removal of the least functioning option. Net decreases in discrimination were observed for a total of 31 of the 60 items, representing a fairly even split. However, in some individual cases the reduction in discrimination was as large as 0.18 (item 53) and the increase as large as 0.16 (item 45).
The average number of non-functioning options per item in the four-option test form was 1.18. The definition of non-functioning distracters used by Tarrant, Ware and Mohammed [3] was used in deriving these figures. This definition requires that an incorrect option is selected by less than 5% of test takers, or that an incorrect option discriminates positively. This latter criterion should be met with caution given that such statistics are sensitive to small numbers of respondents in low-response categories. Nonetheless these criteria are mirrored here. Item 45 is one example of an item for which the removal of the least effective distractor (this time on the basis of discrimination) resulted in a marked improvement in item quality. Item characteristic curves were produced for both versions of this item using ConQuest [29] item response modelling software (refer to Figure 1 and Figure 2). In a good item, the category curves for distracters should decrease in probability value with increasing test taker ability. In Figure 1 it can be seen that option D does not behave in this way. Its removal resulted in a sharp increase in item discrimination (refer to Table 1) and an improvement in the fit of the data to the item response model (refer to Figure 1 and Figure 2). An evaluation of this result and the specific item is outlined in the discussion section of this article.

Comparing the mixed-option and fixed-option test forms
In the following comparisons, statistics were derived for a subset of 48 out of the 60 items. These items remained unchanged in terms of content and ordering across test forms.
The results in Table 2 suggest that, in this case, the mixed-option format was superior in terms of reliability and mean discrimination. Interestingly, it was found to be easier than the two fixed-option tests. Feldt and Kim's [25] test for determining whether reliability coefficients from independent tests are equal was again applied to test the alternative hypothesis that the reliability of the  Figure 1 Relationship between candidate ability and probability of correct answer for a single test item with three distractors. mixed-option test form was greater than that of the four-option test form. The critical F value for this test was calculated as F crit (762, 1043) = 1.12 at α = 0.05. It was then determined that the test statistic W ≈ F = 1.17 > 1.12, illustrating that the difference is significant at the 5% level (p-value = 0.02). Next we examined whether there are differences in the measurement error surrounding estimated person scores. The mixed-option test form reduces measurement error, albeit slightly ( Figure 3). In this context, given the consequence of the test score interpretation, even small reductions are important but it is likely to be more efficient to increase the number of items in the instrument.

Discussion
It appears that there are advantages to be had in using a mixed option number mode of testing. At least this appears to be the case when comparing three-option and fouroption alternatives. At the item level it is possible to increase the discrimination and content validity of individual items by adding plausible options and avoiding problematic options. This provides a basis for an evidence-based approach to item development where the number of response options for each question is determined independently. The approach can be applied at several important junctions during the assessment design and analysis process. First, item writers can apply the policy during item construction. Second, subject-matter experts can apply the policy during item panelling. Third, the policy can also be referred to during item analysis review. The following discussion examines how this policy might be implemented in the FSEP context and more broadly. It is evident that a rigid adherence to a fixed number of options regardless of the quality of options is counter-productive in terms of the quality of the psychometric information to be obtained from test administration.
Instructions to item writers might explain that only options which are educationally important and plausible should be written until a minimum of three are produced. If more are immediately forthcoming they may also be added. The minimum of three is chosen based on studies revealing that reliability increases tend to be more pronounced between two-option and three-option tests than between three-option and four-or five-option tests [18]. Preventing item writers from labouring over a fourth or fifth option is one way to extract the benefits of time savings afforded by using fewer options.
The FSEP item writing process is presently conducted using a round table audit of each newly constructed item. During this process options which are implausible are challenged and replacements are suggested. The  item-level policy promoted in this article would result in a subtle change to the present process: If the task of replacing a challenged option for a four-option item became unfeasible, it could be abandoned and the item accepted as a three-option item. Alternatively, additional plausible options could be put forward by panel members at this time. The process would result in items with three or more response options. Finally, item analysis data could be used to affirm that response options are at the very least discriminating appropriately. For low frequency options, caution with regard to sample sizes will be needed. Small, positive biserial values may emerge by chance if the numbers of test takers selecting certain response options is small. Recommendations that options selected less frequently than 5% should be removed from items should be made conditional on item facility. In some examination contexts, where mastery of particular knowledge or skills is expected and therefore included in the corresponding assessment blueprint, there may be a reasonable number of items with high facility. In these cases it is recommended that options not be discarded on the basis of frequency data alone. As reported in the results section, 16 items (26.7%) had zero non-functioning distracters, 22 items (36.7%) had one, 17 items (28.3%) had two, and five items (8.3%) had three. These results are not considered meaningful without first examining the facility of the items from which they arose. For example, the minimum facility of the five items with three non-functioning distractors is 87.68%. It is completely reasonable that items with very high facility cannot support many functioning distractors [6]. Yet these items are still necessary for fulfilling the content coverage requirements of the assessment blueprint. Another implication of these results is that for approximately three quarters of the four-option items, at least one non-functioning distractor was available for exclusion.
As identified in the results section and depicted in Figure 1 and Figure 2, the discrimination and model fit of item 45 improved markedly upon removal of a nonfunctioning distractor. Fit to the Rasch model is usually determined in two ways. The first is called infit and is the value of the mean squared deviation from the expected response pattern weighted by the item variance. The second is called outfit and it is determined by the unweighted mean squared deviation from the expected response pattern. The unweighted fit statistic is more sensitive to outliers within the data. The lower infit (weighted mean square) value indicates that the responses to the item have become less random and instead are better aligned with test taker ability as predicted by the measurement model. In this case, the three-option version of the item is of acceptable quality and need not be further modified. Qualitative review of this item revealed that option D ("This CTG is not reflective of the fetal condition") could be considered an 'easy out' option. As discussed, this option appealed to a reasonable number of test-takers irrespective of ability. Removal of this option effectively forced test-takers to choose from options which better revealed their level of understanding of the fetal physiology indicated by the CTG. Item 32 similarly contained this 'easy out' option as its option D, and also exhibited improved discrimination upon its removal (Table 1). This information has since been fed back into item writing workshops, where inclusion of 'opt out' options to make up option numbers has been discouraged. Generally, the results presented in this article are consistent with those reported by Cizek, Robinson and O'Day [14] where test-level variation is modest but item-level variation can be considerable, following a reduction of one response option.
Given that items would routinely be stored in an item bank, and potentially rotated in and out of test forms, their optimisation is an important component of the FSEP. This is an important point which suggests that test or aggregated statistics like mean facility or discrimination should not form the basis of item writing or test construction policies alone. Cizek et al. [14] have made similar remarks.
This study has provided some useful empirical information from which to determine a policy for FSEP test item writing. That stated, a number of assumptions are made and a number of limitations have been identified. These are outlined in the following paragraphs.
One assumption made throughout the study is that the samples of practitioners taking the different assessment forms are representative of each other and of the FSEP target population. The large sample sizes and the relatively random distribution of three-and four-option booklets to different testing groups would go some way to ensuring this. Nonetheless, this was recognised as a source of error when making comparisons between item and test performance indices across test forms. This would also lead to some uncertainty concerning the generalisability of these results.
The second part of this study comparing the fixedoption and mixed-option forms has the design limitation that a subset of 12 questions was not common across test forms. Despite restricting the analysis to the common subset and majority, any influence of the smaller disparate subset on test taker responses to the common questions cannot be accounted for. Based on efforts during test construction to avoid inter-item dependencies, it is hoped that any influence would be small.
A further limitation of this study is that the approach taken is post-hoc. That is, options were removed that were non-functioning. This is not the same as recommending writing just three options, since there is likelihood that item writers will not purposefully include non-functioning options. In other words, during item writing, it is often unknown which option will be non-functioning. Therefore the impact of writing only three options could be more severe if items were constructed that way. This is another limitation of the study design, in that it produced threeoption items which, by design, should be of a higher quality than those constructed by item writers aware that three-option items are sufficient.
The interplay between option number, test length and test reliability deserves further attention. The theoretically predicted result that approximately 6 additional threeoption items might be needed to compensate for the reliability reduction from four-option items is rather inconclusive. Whether test takers could reasonably answer 66 three-option questions in the time it would take to answer 60 four-option questions is an empirical question for this context. Based on a meta-analysis by Aamodt and McShane [30] it could go either way. They estimated that in the time it would take to complete a test with 100 fouroption questions an additional 12 three-option questions could be completed (so 112 three-option items in total). Another estimate reported in Rogausch, Hofer and Krebs [6] suggests that about three or four extra items per hour of testing time could be answered based on removal of one option per item. The FSEP test duration is one hour, suggesting that an extra 6 three-option items may not necessarily be accommodated in the testing period. A follow up study looking at the time taken to complete the various test forms would provide useful additional information in this context. Following this, formulae for projecting the increase in reliability owing to increased item numbers could be used to model how mixed-option test forms with different proportions of three-option and four-option items might perform.
Finally, the empirical components of this study apply to a particular sample of items from a broader item bank. It is not known to what extent the results would generalise to other contexts within medical education and beyond.

Conclusions
In this study we sought to determine a policy for item development and test construction for the FSEP assessment. A review of literature and existing assessment practices identified a number of feasible approaches, each underpinned by various traditions, assumptions and empirical findings. The commonly reported finding that test difficulty decreases slightly and mean item discrimination remains unchanged when the least functional distractor is removed from all items was supported in the FSEP context. The finding that items perform no worse with three options than with four options when the least functional distractor is removed was not supported in the FSEP context. Instead, there was appreciable variation at the individual item level. These findings mirror those in another medical education assessment context [14], and contribute to the idea that these trends are generalisable. This discouraged the recommendation of a blanket policy for the number of options. The view was taken that where plausible and educationally important options could be included in an item they should be, without regard for the total option number. Indeed, for other items, where specifying more than two plausible options would be difficult, item writers would not be obliged to spend excessive time trying to construct additional, potentially poor quality options. These policies were seen as the most evidence-based approach for maximising the quality of the FSEP test.