 Research article
 Open Access
 Published:
Comparison of formula and numberright scoring in undergraduate medical training: a Rasch model analysis
BMC Medical Education volume 17, Article number: 192 (2017)
Abstract
Background
Progress testing is an assessment tool used to periodically assess all students at the endofcurriculum level. Because students cannot know everything, it is important that they recognize their lack of knowledge. For that reason, the formulascoring method has usually been used. However, where partial knowledge needs to be taken into account, the numberright scoring method is used. Research comparing both methods has yielded conflicting results. As far as we know, in all these studies, Classical Test Theory or Generalizability Theory was used to analyze the data. In contrast to these studies, we will explore the use of the Rasch model to compare both methods.
Methods
A 2 × 2 crossover design was used in a study where 298 students from four medical schools participated. A sample of 200 previously used questions from the progress tests was selected. The data were analyzed using the Rasch model, which provides fit parameters, reliability coefficients, and response option analysis.
Results
The fit parameters were in the optimal interval ranging from 0.50 to 1.50, and the means were around 1.00. The person and item reliability coefficients were higher in the numberright condition than in the formulascoring condition. The response option analysis showed that the majority of dysfunctional items emerged in the formulascoring condition.
Conclusions
The findings of this study support the use of numberright scoring over formula scoring. Rasch model analyses showed that tests with numberright scoring have better psychometric properties than formula scoring. However, choosing the appropriate scoring method should depend not only on psychometric properties but also on selfdirected testtaking strategies and metacognitive skills.
Background
Progress testing is a systematic, longitudinal assessment method, by which students are periodically assessed at endofcurriculum level. Research has shown that the progress test is a valid and reliable tool for measuring knowledge growth [1,2,3], it reduces examination stress, and it positively influences student learning [4].
Over the past few decades, test scores on assessment tools based on multiplechoice questions (MCQs) have been calculated in two ways: “numberright scoring” and “formula scoring.” Numberright scoring implies that only the number of correct answers is taken into account when calculating the total score, and that incorrect answers are not subtracted from the total score. Numberright scoring has frequently been applied for a number of reasons. First, its simplicity allows for an uncomplicated interpretation of the results for both students and professionals. Second, numberright scoring allows students to answer all questions, and their partial knowledge is included in the outcomes. If students have partial knowledge about an item and can rule out alternatives with more or less certainty, they will obtain higher scores [5]. Third, under the presumption that the test tries to measure the knowledge a student has and not just the knowledge that they are confident in using, the willingness to guess is not accounted for in numberright scoring, which reduces bias regarding constructirrelevant sources of variance due to riskavoidance behavior.
Although formulascoring method tests are not frequently used, except for progress tests in medicine, it gives students the opportunity to acknowledge that they do not know the correct answer instead of forcing them to guess [6]. It is important to realize that students cannot know everything. Due to the different knowledge levels of the participating students in the case of progress testing, the inclusion of an “I don’t know” option becomes a logical choice. In progress tests using formula scoring, an “I don’t know” option – which does not lead to a penalty – is included. When such a scoring method is applied, junior students tend to answer a smaller percentage of the questions than senior ones. Formula scoring offers an individualized way of correction for guessing and may reduce random guessing to as low as 2% of the items [7].
Comparisons between numberright scoring and formula scoring have been the subject of study for many years. Data comparing the reliability of both methods have yielded conflicting results. Formula scoring has shown an increase [6, 8] and a decrease [9] in the reliability coefficient as compared to numberright scoring. This increase in reliability, however, might be related to other constructs that are reflected in the final score [10,11,12], such as risktaking strategies [6, 13,14,15], gender [16,17,18,19], selfefficacy beliefs, and metacognitive skills, instead of students’ medical knowledge alone [6, 20, 21]. From a practical perspective, one could argue that knowledge is only useful if the student is willing to use it and that focusing only on the knowledge in the ‘heads’ of students might be a case of constructunderrepresentation. Furthermore, students have differed in their tendency to choose the “I don’t know” option [17, 19, 22].
This study aims to answer the following research questions:

a)
Which scoring method provides fewer dysfunctional items?

b)
Which scoring method provides the most reliable score?
Traditionally, Classical Test Theory (CTT) and Generalizability Theory analyses have been used to investigate differences between numberright and formula scoring [6, 8, 9]. In contrast to these previous studies, we have based our data analyses on Item Response Theory (IRT). IRT was chosen because it allows for an estimate of student ability (theta) that is independent of item selection; moreover, item difficulty (b) can be estimated in a way that is independent of the sample of students. These two properties are called parameter invariance. Additionally, IRT provides an estimate of the measurement error at each point of the theta (ability), which allows for an estimation of the reliability of each student’s performance. Despite evidence of the advantages of IRT models over CTT [23], it is only possible to take full advantage of IRT if two assumptions are met. The first assumption is unidimensionality, which implies that a single underlying trait accounts for the performance of the student. The second assumption is local independence, which implies that test items cannot be related to each other [24]. For more information about IRT and the comparison between IRT and CTT, see Downing (2003) [25] and De Champlain (2010) [26]. Since IRT models are more sensitive to constructirrelevant sources of variance, we expected that the tests taken using the numberright scoring condition would be more reliable and have better validity. In addition, fewer dysfunctional items should emerge for the tests that use the numberright scoring condition.
Methods
To answer our research questions, an experiment was designed comparing the numberright and formulascoring methods using a 2 × 2 crossover design (Table 1). For the first test of group A, formula scoring was used and, for the second, numberright scoring, whereas group B was tested the other way around. This design avoided cueing and priming effects, and ensured similar student knowledge levels.
Participants and procedure
Medical students from years 2, 3, and 4 were invited to participate in the experiment. Unlike yearone students, their knowledge levels were expected to be sufficient to provide useful information, and they would then be more likely to make an educated guess instead of not answering an item (the “I don’t know” option). Additionally, years 2, 3, and 4 medical students were chosen because they were in a structured learning environment, where there was likely to be more homogeneity in the cohorts in terms of educational experience. Two hundred ninetyeight students from four Dutch medical schools participated in the experiment (Table 1).
In this particular research field, it is important for the participating students to already be acquainted with the blueprint and the test format. Our participants were familiar with both types of questions and scoring methods, since they had taken both kinds of tests at least five times. This provided a methodological advantage that enabled us to better establish construct validity through the comparison of scores, minimizing measurements of other traits [5, 11].
Instruments
The Dutch progress test covers the whole domain of medical knowledge at end level, based on the Dutch National Blueprint for the Medical Curriculum. The progress test is simultaneously administrated four times a year to all medical students who take part in the consortium. At that time, roughly 10,000 students take the progress test. Each progress test consists of 200 multiplechoice questions. Since 2005, the Dutch Interuniversity Progress Test has comprised items with a varying number of response options, ranging from 2 to 5. The penalty for guessing for each item varies according to the number of distracters (−1/[the number of answer options1]), ranging from −1.00 to −0.25.
We selected 250 questions out of seven progress tests that had been administered between 2005 and 2007. Subsequently, we reduced the number of questions to 200 items with a pvalue > .25, indicating the probability of the question being answered correctly in a cohort of students. We created two equal tests of 100 multiplechoice questions, based on the progress test blueprint. Both sets of 100 questions were equally distributed in terms of mean pvalues, based on the results of graduate level students, through use of the sum of pvalues, the sum of pcorrected, the total of “I don’t know” options chosen, and the total number of distractors per question (2, 3, or 4). All those statistics are based on Classical Test Theory and were gathered from the quality control of the Dutch progress test consortium.
Students were divided into two groups: Group A took the first set of 100 items under formulascoring conditions and group B the same items under numberright scoring conditions. For the second set of 100 items, it was the other way round: group A under numberright scoring conditions and group B under formulascoring conditions. For the test using formulascoring, students could choose an “I don’t know” option. For the test using the numberright scoring, the “I don’t know” option was not available, and students had to give an answer. An example of a question in the formulascoring test is:
In patients with hydrocephalus, the cerebrospinal fluid is in most cases rerouted through a shunt system from the lateral ventricles

a)
To the venous system

b)
To the thoracic duct

c)
To the peritoneal cavity

d)
To the spinal cord

e)
I don’t know
The same question was in the numberright test.
In patients with hydrocephalus, the cerebrospinal fluid is in most cases rerouted through a shunt system from the lateral ventricles

a)
To the venous system

b)
To the thoracic duct

c)
To the peritoneal cavity

d)
To the spinal cord
Data analysis based on item response theory (IRT)
There are several IRT models available, but the Rasch model was used for several reasons. First, it is a simpler and stricter model than the 2parameter and the 3parameter logistic models, which means that the Rasch model is more susceptible to a violation of the data than the 2parameter and the 3parameter logistic models [26, 27], thus allowing dysfunctional items to be identified. The Rasch model requires a smaller sample size. For a twotailed 99% confidence interval, the minimum sample size is 108 subjects [28]. Furthermore, it is widely used in medical education [29,30,31,32,33].
Preliminary analysis
Unidimensionality was tested using the PrincipalComponents Analysis of Residuals (PCAR) and a fitonly approach [34]. The latter has two fit parameters for person and item. Whereas infit excludes the outliers from the analysis, outfit includes the outliers from the analysis. Both infit and outfit were calculated using the mean square (MS). The optimal fit value is 1.00 [35] with a range from .50 to 1.50 [36] for both the person and the item. However, violations of the fit parameter for a person are better tolerated and expected, whereas items with infit and outfit higher than 2.0 are a threat to the validity of the test [36] and are recommended for exclusion.
For the PrincipalComponents Analysis of Residuals, we first considered whether another dimension would have more than two items. If so, we further investigated the amount of explained variance. Correlation of the standardized residual was calculated to check the local independency. If items present a correlation lower than 0.7, the local independency assumption holds.
Linking and equating
Linking and Equating was not deemed necessary, because both groups answered the same multiplechoice questions. Our 2 × 2 crossover design (Table 1) ensured similar student knowledge levels in both scoring methods, which controlled for guessing and discrimination of the items throughout the groups. Furthermore, a post analysis of the level of students’ ability revealed no significant difference between students in Tests 1 and 2 (t = 1.803, p = 0.07 and t = 1.771, p = 0.08, respectively). Since the data were analyzed using the Rasch model, which has the property of parameter invariance, all four groups were comparable.
Calibration of the Rasch models
The four tests were analyzed and calibrated separately, since we were interested in comparing the psychometric properties of both scoring methods. Because of that, the most appropriate Rasch model for each condition needed to be chosen. For formula scoring, we used the Rasch Partial Credit model for polytomous categories, since the categories follow an ordinal arrangement with the right answer having the highest (5), the question mark having the second highest (4), and the penalties having the lowest values, representing the amount of penalty (3, 2, and 1). The penalty was recoded according to the number of distractors. Items with twooptions answers were recorded as one; threeoption items were recoded as two; and four options as three, since the penalty is higher in cases of fewer distractors. For the numberright scoring, we used the Rasch dichotomous model. All data were analyzed using Winsteps 3.70.1.1 (Winsteps Rasch Measurement 2009).
To answer our first research question, the responseoption analysis was conducted to evaluate the average ability for each response option. This analyzes the appropriate category order (whether the category of polytomous items is ordered as expected).
To answer our second research question, we calculated two reliability coefficients based on the Rasch, one for the person and another for the item. The latter is an indication of sample size. Low item reliability means that the sample size is not large enough to estimate the parameters. The person reliability is equivalent to the traditional test reliability (e.g., KuderRichardson20, Cronbach’s alpha); low values can indicate a small number of items or a narrow range of person measurements. The person reliability coefficient is calculated using measurement standard errors.
Results
First, we will describe the analyses of dimensionality, fit parameter, and local independence. After that, we will present the Rasch reliability coefficients for person and item. Finally, we will describe the dysfunctional items.
Preliminary analysis
The four tests had three or four items in the first contrast, which could indicate a second dimension. The variance explained by the items in the numberright scoring condition was higher than five times the variance explained by the first contrast: 17.9% vs. 3.3%. In addition, the explained variance in the first contrast was smaller than the variance explained by persons and items. Comparable values were found for the items in the formulascoring condition: The explained variances were 17.9 and 3.7% for the first contrast.
Regarding the items, the fit parameters were in the optimal interval from 0.50 to 1.50 [36], and the means were near 1.00, which is the optimal value for the infit and outfit. Mean, standard deviation, minimum and maximum of measurement, infit, outfit, and error based on Rasch outcomes are shown in Table 2.
There was only one item in the formulascoring condition of group B that had outfit higher than 2.00. Regarding the person parameters, there were some violations of the maximum and minimum value of the recommended interval, especially in the formulascoring condition.
Regarding local independency, the highest correlation of the standardized residual was 0.35. If items present a correlation lower than 0.7, the local independency assumption holds. Locally dependent items are considered as threats to unidimensionality [24, 25].
Which scoring method provides fewer dysfunctional items?
There was a clear difference in numbers of dysfunctional items between the formulascoring and numberright tests. Most dysfunctional items were found (1) when participants in the questionmark category had higher or equal ability versus those in the rightanswer category (n = 7) and (2) when participants in the penalty category had higher ability versus those with a correct answer or a question mark (n = 25). For both groups in the numberright condition, (1) 5 items had the higher ability in the wrong category, and (2) one item had the same ability between the right and wrong categories. Table 3 summarizes the dysfunctional items in terms of the relationship between ability and category.
Based on these findings, all dysfunctional items were excluded from the model in terms of further analysis. After the exclusion of items, the variance explained by the items increased, and the fit parameters were in the optimal interval. There was no item with an infit or outfit above 2.0.
Which scoring method provides the most reliable score?
Interestingly, the reliability coefficients for person were higher after the exclusion of the items, whereas the reliability coefficients for the items were similar for both scoring methods. After the exclusion, the Rasch reliability coefficients for person and item for each test are shown in Table 4. The reliability coefficients ranged from 0.73 to 0.82 for the persons and from 0.94 to 0.96 for the items. The item reliability coefficients were comparable in both conditions. However, the person reliability coefficients were higher in the numberright (0.80 and 0.82) than in the formulascoring condition (0.73 and 0.77) on Tests 1 and 2, respectively.
In Figs. 1 and 2, the influence of both the scoring methods on the same items is visualized in Tests 1 and 2. As is visualized at the left side, the items using the formulascoring method ranged from −2 to 2 logit for both tests, while the items using the numberright scoring method ranged from −5 to 3 and −3 to 3 logit. The items using formula scoring varied less in terms of difficulty than the items using numberright scoring, resulting in lower discrimination regarding student ability. Because of that, the students subjected to numberright scoring could be better differentiated in both tests than those students subjected to formula scoring. The difference in variability also explains why the reliability for numberright scoring was higher than for formula scoring.
Discussion
In this study, the Rasch model methodology was used to investigate whether numberright or formula scoring should be preferred for progress testing. The outcomes of the Rasch model analysis showed that itemreliability coefficients were comparable. Numberright scoring presented higher person reliability coefficients and fewer dysfunctional items than formula scoring.
Our methodology and findings differ from previous studies in several ways. The 2 × 2 crossover design is especially useful for avoiding cueing and priming effects during data collection. Moreover, we ensured that all students answered different tests in both conditions, which allowed us to assume similar knowledge levels in both conditions. Another methodological difference was the use of the Rasch model. To our knowledge, this has not been done in previous studies. Regarding our results, two main findings emerged. First, person reliability coefficients, which are similar to CTT reliability coefficients, were clearly higher for numberright scoring for both tests, which contradicts some previous studies [6, 8]. Higher person reliability indicates that the test can differentiate better between levels of student ability and that obtaining the same ordering of students using repeated measurements is more likely [35]. This study shows that it is possible to obtain higher reliability coefficients with fewer items when the Rasch model is used. Further studies are necessary to investigate whether our findings are transferable to other years in medical school.
Second, the response options analysis showed clear differences between numberright scoring and formula scoring. The formulascoring tests produced around three times more dysfunctional items. In theory, the questionmark category could have higher ability averages, since students who know the content would also be aware of what they do not know. However, the highest number of dysfunctional items emerged when students in the penalty category had a higher average ability than students in the right or questionmark categories. At the same time, our results showed that there were only two items that were dysfunctional in both scoring conditions. Therefore, we believe that formula scoring could be a possible source of dysfunctionality. To our knowledge, this is the first study to indicate that formula scoring may possibly be a contributing factor in this phenomenon. Further studies are necessary to investigate whether formula scoring contributes to item misfit.
Some limitations have to be considered. Students’ testtaking strategies may change after a series of tests. In this particular study, however, students were already acquainted with both scoring methods. The second limitation may be that the experimental setting is somewhat artificial. In reality, the progress test is a mix of summative and formative formats, so the scores in our study may be biased by the students’ willingness to participate. The formative format allows students to receive feedback without the risk of being categorized. A summative decision is only made after a serious of progress tests. Third, there may be small recognition effects due to our item sample. Some of the students may have answered some of the questions three or more years earlier. The final limitation may be that the reliability estimates could not be compared between years of medical school separately.
Despite the importance of the psychometrics properties of a test, other aspects should be taken into consideration, especially because the progress test is just one of the many assessment tools that are used to evaluate student learning. Since we do not expect junior students to be able to answer all questions, the inclusion of an “I don’t know” option becomes a logical choice. However, a recent study has demonstrated that students in the later years are more likely to guess and actually answer a question incorrectly than firstyear medical students [37], which raises the question of the educational purpose of the “I don’t know” option. At the same time, formula scoring may penalize students with more knowledge, since they are less likely to guess and so do not answer items that they only have partial knowledge about [11]. Additionally, the use of formulascoring causes bias due to both itemspecific and systematic willingness to guess. Itemspecific means that students weigh the penalty for an incorrect answer against the probability of a correct answer [38]. Systematic willingness to guess means that some students are more willing to guess than others, for example, male students appear to guess more often than female students [16]. Formula scoring may encourage students to use selfdirected testtaking strategies. This may happen, for example, if an item has a higher penalty, because it has fewer response options. Whether a student will answer an item will therefore not just depend on the student’s estimate of the probability of answering the item correctly but also on the riskavoidance behavior of the student [14]. This may introduce noise into the test, since the score variance may also be influenced by selfefficacy beliefs and metacognitive skills instead of students’ medical knowledge alone [6, 20, 21]. Our finding that the person reliability coefficient is lower in the formulascoring condition supports these considerations. It is, however, encouraging that the item reliability coefficients of both conditions were similar in terms of the impact of formula scoring on students’ learning behavior. Future studies are necessary in order to investigate whether the use of the “I don’t know” option leads to increased selfefficacy beliefs. Further research on the use of Rasch analysis for progress testing is still necessary, especially taking into account the longitudinal character of the test.
Conclusions
Rasch model analyses showed that numberright tests have better psychometric properties than formula scoring. Based on our psychometric analysis alone, the use of the numberright scoring method seems logical for multiplechoice question tests.
Abbreviations
 ?:

Question mark
 CTT:

Classical test theory
 FSA:

Formulascoring group A
 FSB:

Formulascoring group B
 IRT:

Item response theory
 NA:

Not applicable
 NRA:

Numberright scoring group A
 NRB:

Numberright scoring group B
 P:

Penalty
 R:

Right
 SD:

Standard deviation
 W:

Wrong
References
 1.
Muijtjens AM, Schuwirth LT, CohenSchotanus J. Differences in knowledge development exposed by multicurricular progress test data. Adv Health Sci Educ. 2008;13:593–605.
 2.
Wrigley W, Van der Vleuten CPM, Freeman A, Muijtjens A. A systemic framework for the progress test: strengths, constraints and issues: AMEE guide no. 71. Med Teach. 2012;31:683–97.
 3.
De Champlain AF, Cuddy MM, Scoles PV, Brown M, Swanson DB, Holtzman K, et al. Progress testing in clinical science education: results of a pilot project between the National Board of medical examiners and a US medical school. Med Teach. 2010;32:503–8.
 4.
Schuwirth LWT, Van der Vleuten CPM. The use of progress testing. Perspect Med Educ. 2012;1(1):24–30.
 5.
Lord FM. Formula scoring and numberright scoring. J Educ Meas. 1975;12(1):7–11.
 6.
Muijtjens AMM, Van Mameren H, Hoogenboom RJI, Evers JLH, Van der Vleuten CPM. The effect of a “don’t know” option on test scores: numberright and formula scoring compared. Med Educ. 1999;33:267–75.
 7.
Van Til, CT. Voortgang in voortgangstoetsing: studies naar de aansluiting van de voortgangstoets op probleemgestuurd onderwijs [in Dutch]. [S.l.: s.n.] 1998.
 8.
Keislar ER. Test instructions and scoring method in truefalse tests. J Exp Educ. 1953;21(3):243–9.
 9.
Traub RE, Hambleton RK, Singh B. Effects of promised reward and threatened penalty on performance of a multiplechoice vocabulary test. Educ Psychol Meas. 1969;29(4):847–61.
 10.
Diamond J, Evans W. The correction for guessing. Rev Educ Res. 1973;43:181–91.
 11.
Bliss LB. A test of Lord’s assumption regarding examinee guessing behavior on multiplechoice tests using elementary school students. J Educ Meas. 1980;17(2):147–52.
 12.
Albanese MA. The projected impact of the correction for guessing on individual scores. J Educ Meas. 1988;25:149–57.
 13.
Lord FM. Formula scoring and validity. Educ Psychol Meas. 1963;23:663–72.
 14.
Espinosa MP, Gardeazabal J. Optimal correction for guessing in multiplechoice tests. J Math Psychol. 2010;54(5):415–25.
 15.
Messick S. Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. Am Psychol. 1995;50:741–9.
 16.
Budescu D, BarHillel M. To guess or not to guess: a decisiontheoretic view of formula scoring. J Educ Meas. 1993;30(4):277–91.
 17.
Byrnes JP, Miller DC, Schafer WD. Gender differences in risk taking: a metaanalysis. Psychol Bull. 1999;125:367.
 18.
Kelly S, Dennick R. Evidence of gender bias in truefalseabstain medical examinations. BMC Med Educ. 2009;9:32.
 19.
Ravesloot CJ, Van der Schaaf MF, Muijtjens AMM, Haaring C, Kruitwagen CLJJ, Beek FJA, Bakker J, Van Schaik JPJ, Ten Cate TJ. The don’t know option in progress testing. Adv Health Sci Educ. 2015;20(5):1325–38.
 20.
Rowley GL, Traub RE. Formula scoring, numberright scoring, and testtaking strategy. J Educ Meas. 1977;14(1):15–22.
 21.
Kubinger KD, Wolfsbauer C. On the risk of certain psychotechnological response options in multiplechoice tests: does a particular personality handicap examinees? EJPA. 2010;26(4):302–8.
 22.
Kampmeyer D, Matthes J, Herzig S. Lucky guess or knowledge: a crosssectional study using the bland and Altman analysis to compare confidencebased testing of pharmacological knowledge in 3rd and 5th year medical students. Adv Health Sci Educ. 2014;20(2):431–40.
 23.
Magno C. Demonstrating the difference between classical test theory and item response theory using derived test data. TIJEPA. 2009;1(1):1–11.
 24.
Baghaei P. Local dependency and Rasch measures. Rasch Meas Trans. 2008;21(3):1105–6.
 25.
Downing SM. Item response theory: applications of modern test theory in medical education. Med Educ. 2003;37:739–45.
 26.
De Champlain AF. A primer on classical test theory and item response theory for assessments in medical education. Med Educ. 2010;44:109–17.
 27.
Masters GN. Item discrimination: when more is worse. J Educ Meas. 1988;25(1):15–29.
 28.
Linacre J. Sample size and item calibration stability. Rasch Meas Trans. 1994;7(4):328.
 29.
Schulman JA, Wolfe EW. Development of a nutrition selfefficacy scale for prospective physicians. J App Meas. 1999;1(2):107–30.
 30.
Bhakta B, Tennant A, Horton M, Lawton G, Andrich D. Using item response theory to explore the psychometric properties of extended matching questions examination in undergraduate medical education. BMC Med Educ. 2005;5(1):9.
 31.
McManus IC, Thompson M, Mollon J. Assessment of examiner leniency and stringency (“hawkdove effect”) in the MRCP (UK) clinical examination (PACES) using multifacet Rasch modelling. BMC Med Educ. 2006;6(1):42.
 32.
Lange R, Verhulst SJ, Roberts NK, Dorsey JK. Rasch analysis of professional behavior in medical education. Adv Health Sci Educ. 2015;20(5):1–16.
 33.
MalauAduli BS, Teague PA, Turner R, Holman B, D'souza K, Garne D, Van Der Vleuten C. Improving assessment practice through crossinstitutional collaboration: An exercise on the use of OSCEs. Med Teach. 2015;38(3):1–9.
 34.
Tennant A, Pallant JF. Unidimensionality matters! (a tale of two Smiths?). Rasch Meas Trans. 2006;20(1):1048–51.
 35.
Bond TG, Fox CM. Applying the Rasch model: fundamental measurement in the human sciences. Mahwah: Erlbaum; 2001.
 36.
Wright B, Linacre J. Reasonable meansquare fit values. Rasch Meas Trans. 1994;8(3):370.
 37.
CecilioFernandes D, Kerdijk W, Jaarsma ADC, Tio RA. Development of cognitive processing and judgments of knowledge in medical students: analysis of progress test results. Med Teach. 2016;38(11):1125–9.
 38.
Maguire T, Skakun E, Harley C. Setting standards for multiplechoice items in clinical reasoning. Eval Health Prof. 1992;15(4):434–52.
Acknowledgements
The authors would like to thank Mrs. Tineke BouwkampTimmer for her feedback on the final version of the article and her editorial help. The authors would also like to thank the Dutch Interuniversity Progress Test group for their support in organizing this study.
Funding
This research was partially funded by CAPES – Brazilian Federal Agency for Support and Evaluation of Graduate Education – grant 9568131, awarded to Dario CecilioFernandes.
Availability of data and materials
All the supporting data is included as tables and figures.
Author information
Affiliations
Contributions
HM, LS, and JCS conceived the original idea of the experiment. All authors contributed substantially to the conception and design of the study. HM gathered the data and previously analyzed the data under the supervision of LS and JCS. DCF further analyzed the data and wrote the first draft of the manuscript under the supervision of JCS and RT. All authors contributed to the interpretation of the data and revised it critically in terms of major intellectual content. All authors approved the final manuscript for submission. DCF and HM contributed equally to this manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The data were collected for another study at a time when there was no formal ethical approval process for such studies, and ethical approval was not sought. At the moment, there is an ethical approval committee, but a reanalysis of historical data is automatically ruled exempt. Our work was carried out in accordance with the Declaration of Helsinki and the privacy policy of the University of Groningen. Before the analysis, all data were anonymized and handled with confidentiality.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
CecilioFernandes, D., Medema, H., Collares, C.F. et al. Comparison of formula and numberright scoring in undergraduate medical training: a Rasch model analysis. BMC Med Educ 17, 192 (2017). https://doi.org/10.1186/s1290901710518
Received:
Accepted:
Published:
Keywords
 Assessment
 Multiple choice questions
 Formula scoring
 Numberright scoring
 Rasch model
 Reliability
 Validity
 Constructirrelevant variance