It appears that there are advantages to be had in using a mixed option number mode of testing. At least this appears to be the case when comparing three-option and four-option alternatives. At the item level it is possible to increase the discrimination and content validity of individual items by adding plausible options and avoiding problematic options. This provides a basis for an evidence-based approach to item development where the number of response options for each question is determined independently. The approach can be applied at several important junctions during the assessment design and analysis process. First, item writers can apply the policy during item construction. Second, subject-matter experts can apply the policy during item panelling. Third, the policy can also be referred to during item analysis review. The following discussion examines how this policy might be implemented in the FSEP context and more broadly. It is evident that a rigid adherence to a fixed number of options regardless of the quality of options is counter-productive in terms of the quality of the psychometric information to be obtained from test administration.
Instructions to item writers might explain that only options which are educationally important and plausible should be written until a minimum of three are produced. If more are immediately forthcoming they may also be added. The minimum of three is chosen based on studies revealing that reliability increases tend to be more pronounced between two-option and three-option tests than between three-option and four- or five-option tests . Preventing item writers from labouring over a fourth or fifth option is one way to extract the benefits of time savings afforded by using fewer options.
The FSEP item writing process is presently conducted using a round table audit of each newly constructed item. During this process options which are implausible are challenged and replacements are suggested. The item-level policy promoted in this article would result in a subtle change to the present process: If the task of replacing a challenged option for a four-option item became unfeasible, it could be abandoned and the item accepted as a three-option item. Alternatively, additional plausible options could be put forward by panel members at this time. The process would result in items with three or more response options.
Finally, item analysis data could be used to affirm that response options are at the very least discriminating appropriately. For low frequency options, caution with regard to sample sizes will be needed. Small, positive biserial values may emerge by chance if the numbers of test takers selecting certain response options is small. Recommendations that options selected less frequently than 5% should be removed from items should be made conditional on item facility. In some examination contexts, where mastery of particular knowledge or skills is expected and therefore included in the corresponding assessment blueprint, there may be a reasonable number of items with high facility. In these cases it is recommended that options not be discarded on the basis of frequency data alone. As reported in the results section, 16 items (26.7%) had zero non-functioning distracters, 22 items (36.7%) had one, 17 items (28.3%) had two, and five items (8.3%) had three. These results are not considered meaningful without first examining the facility of the items from which they arose. For example, the minimum facility of the five items with three non-functioning distractors is 87.68%. It is completely reasonable that items with very high facility cannot support many functioning distractors . Yet these items are still necessary for fulfilling the content coverage requirements of the assessment blueprint. Another implication of these results is that for approximately three quarters of the four-option items, at least one non-functioning distractor was available for exclusion.
As identified in the results section and depicted in Figure 1 and Figure 2, the discrimination and model fit of item 45 improved markedly upon removal of a non-functioning distractor. Fit to the Rasch model is usually determined in two ways. The first is called infit and is the value of the mean squared deviation from the expected response pattern weighted by the item variance. The second is called outfit and it is determined by the unweighted mean squared deviation from the expected response pattern. The unweighted fit statistic is more sensitive to outliers within the data. The lower infit (weighted mean square) value indicates that the responses to the item have become less random and instead are better aligned with test taker ability as predicted by the measurement model. In this case, the three-option version of the item is of acceptable quality and need not be further modified. Qualitative review of this item revealed that option D (“This CTG is not reflective of the fetal condition”) could be considered an ‘easy out’ option. As discussed, this option appealed to a reasonable number of test-takers irrespective of ability. Removal of this option effectively forced test-takers to choose from options which better revealed their level of understanding of the fetal physiology indicated by the CTG. Item 32 similarly contained this ‘easy out’ option as its option D, and also exhibited improved discrimination upon its removal (Table 1). This information has since been fed back into item writing workshops, where inclusion of ‘opt out’ options to make up option numbers has been discouraged. Generally, the results presented in this article are consistent with those reported by Cizek, Robinson and O’Day  where test-level variation is modest but item-level variation can be considerable, following a reduction of one response option.
Given that items would routinely be stored in an item bank, and potentially rotated in and out of test forms, their optimisation is an important component of the FSEP. This is an important point which suggests that test or aggregated statistics like mean facility or discrimination should not form the basis of item writing or test construction policies alone. Cizek et al.  have made similar remarks.
This study has provided some useful empirical information from which to determine a policy for FSEP test item writing. That stated, a number of assumptions are made and a number of limitations have been identified. These are outlined in the following paragraphs.
One assumption made throughout the study is that the samples of practitioners taking the different assessment forms are representative of each other and of the FSEP target population. The large sample sizes and the relatively random distribution of three- and four-option booklets to different testing groups would go some way to ensuring this. Nonetheless, this was recognised as a source of error when making comparisons between item and test performance indices across test forms. This would also lead to some uncertainty concerning the generalisability of these results.
The second part of this study comparing the fixed-option and mixed-option forms has the design limitation that a subset of 12 questions was not common across test forms. Despite restricting the analysis to the common subset and majority, any influence of the smaller disparate subset on test taker responses to the common questions cannot be accounted for. Based on efforts during test construction to avoid inter-item dependencies, it is hoped that any influence would be small.
A further limitation of this study is that the approach taken is post-hoc. That is, options were removed that were non-functioning. This is not the same as recommending writing just three options, since there is likelihood that item writers will not purposefully include non-functioning options. In other words, during item writing, it is often unknown which option will be non-functioning. Therefore the impact of writing only three options could be more severe if items were constructed that way. This is another limitation of the study design, in that it produced three-option items which, by design, should be of a higher quality than those constructed by item writers aware that three-option items are sufficient.
The interplay between option number, test length and test reliability deserves further attention. The theoretically predicted result that approximately 6 additional three-option items might be needed to compensate for the reliability reduction from four-option items is rather inconclusive. Whether test takers could reasonably answer 66 three-option questions in the time it would take to answer 60 four-option questions is an empirical question for this context. Based on a meta-analysis by Aamodt and McShane  it could go either way. They estimated that in the time it would take to complete a test with 100 four-option questions an additional 12 three-option questions could be completed (so 112 three-option items in total). Another estimate reported in Rogausch, Hofer and Krebs  suggests that about three or four extra items per hour of testing time could be answered based on removal of one option per item. The FSEP test duration is one hour, suggesting that an extra 6 three-option items may not necessarily be accommodated in the testing period. A follow up study looking at the time taken to complete the various test forms would provide useful additional information in this context. Following this, formulae for projecting the increase in reliability owing to increased item numbers could be used to model how mixed-option test forms with different proportions of three-option and four-option items might perform.
Finally, the empirical components of this study apply to a particular sample of items from a broader item bank. It is not known to what extent the results would generalise to other contexts within medical education and beyond.