Practice effects in medical school entrance testing with the undergraduate medicine and health sciences admission test (UMAT)

Background The UMAT is widely used for selection into undergraduate medical and dental courses in Australia and New Zealand (NZ). It tests aptitudes thought to be especially relevant to medical studies and consists of 3 sections – logical reasoning and problem solving (UMAT-1), understanding people (UMAT-2) and non-verbal reasoning (UMAT-3). A substantial proportion of all candidates re-sit the UMAT. Re-sitting raises the issue as to what might be the precise magnitude and determinants of any practice effects on the UMAT and their implications for equity in subsequent selection processes. Methods Between 2000 and 2012, 158,909 UMAT assessments were completed. From these, 135,833 cases were identified where a candidate had sat once or more during that period with 117,505 cases (86.5%) having sat once, 14,739 having sat twice (10.9%), 2,752 thrice (2%) and 837, 4 or more times (0.6%). Subsequent analyses determined predictors of multiple re-sits as well as the magnitude and socio-demographic determinants of any practice effects. Results Increased likelihood of re-sitting the UMAT twice or more was predicted by being male, of younger age, being from a non-English language speaking background and being from NZ and for Australian candidates, being urban rather than rurally based. For those who sat at least twice, the total UMAT score between a first and second attempt improved by 10.7 ± 0.2 percentiles, UMAT-1 by 8.3 ± 0.2 percentiles, UMAT-2 by 8.3 ± 0.2 percentiles and UMAT-3 by 7.7 ± 0.2 percentiles. An increase in total UMAT percentile score on re-testing was predicted by a lower initial score and being a candidate from NZ rather than from Australia while a decrease was related to increased length of time since initially sitting the test, older age and non-English language background. Conclusions Re-sitting the UMAT augments performance in each of its components together with the total UMAT percentile score. Whether this increase represents just an improvement in performance or an improvement in understanding of the variables and therefore competence needs to be further defined. If only the former, then practice effects may be introducing inequity in student selection for medical or dental schools in Australia or NZ.

Results: Increased likelihood of re-sitting the UMAT twice or more was predicted by being male, of younger age, being from a non-English language speaking background and being from NZ and for Australian candidates, being urban rather than rurally based. For those who sat at least twice, the total UMAT score between a first and second attempt improved by 10.7 ± 0.2 percentiles, UMAT-1 by 8.3 ± 0.2 percentiles, UMAT-2 by 8.3 ± 0.2 percentiles and UMAT-3 by 7.7 ± 0.2 percentiles. An increase in total UMAT percentile score on re-testing was predicted by a lower initial score and being a candidate from NZ rather than from Australia while a decrease was related to increased length of time since initially sitting the test, older age and non-English language background. Conclusions: Re-sitting the UMAT augments performance in each of its components together with the total UMAT percentile score. Whether this increase represents just an improvement in performance or an improvement in understanding of the variables and therefore competence needs to be further defined. If only the former, then practice effects may be introducing inequity in student selection for medical or dental schools in Australia or NZ.

Background
Practice effects to improve performance in tests of cognitive ability are well established [1] but the magnitude and determinants of such effects may well vary according to the test. The Undergraduate Medicine and Health Sciences Admission Test (UMAT) is a cognitive ability test designed and delivered by the Australian Council for Educational Research (ACER) on behalf of the UMAT Consortium of Australian and New Zealand universities [2]. It has been devised to determine aptitude for study in medicine, dentistry and other courses for health professionals. The rapid expansion and increasing popularity of commercially available preparation courses for the UMAT [3] has resulted in a reciprocal increase in interest in the potential magnitude and determinants of any practice effects that may be improving test performance.
The UMAT is not a test of personality, values or attitudes, nor is it a measure of previous learning in academic domains. It comprises three sections: UMAT-1 -Logical Reasoning and Problem Solving, UMAT-2 -Understanding People and UMAT-3 -Non-verbal Reasoning and is a 3-hour paper-and-pencil test. Each section is assessed using multiple choice questions with 4 or 5 optional responses, of which only one is correct. The three sections in 2012 comprised 48 items over 70 minutes in UMAT-1, 44 items over 55 minutes in UMAT-2 and 44 items over 55 minutes in UMAT-3. In Section 1 (UMAT-1) candidates are required to exercise logical reasoning and problem solving skills using both inductive and deductive reasoning, with an emphasis on logical argument in working to a solution. In Section 2 (UMAT-2) the emphasis is on assessing empathy and emotional intelligence with candidates required to show an understanding of the thoughts, feelings, behaviour and intentions portrayed within each question. Section 3 (UMAT-3) evaluates a candidate's non-verbal reasoning skills and aims to measure cognitive ability independent of language ability and specific cultural knowledge. ACER provides example questions in an information booklet and informs potential candidates that some practice in answering questions of a similar type, and under similar test conditions to those in the actual test, may be a useful preparation. They also provide two practice tests online for candidates who are registered to sit the UMAT. Whether preparation or actual coaching for the test makes a difference to test scores, however, remains controversial [3,4].
Practice effects for the UMAT may be of substantial magnitude as recently reported in a relatively small study from New Zealand [4]. In that study results from 263 students who sat the UMAT both in 2010 and 2011 were analysed. Approximately 85% of the cohort improved their overall score, 11% decreased and 4% obtained exactly the same score. The mean increase averaged approximately 9.3% for the total score, 9.4% for UMAT-1, 7% for UMAT-2 and 10.6% for UMAT-3. When considered in terms of relative percentile score the study cohort improved from a median total UMAT score percentile of 52% on the first attempt to 73% on the second. ACER in their annual reports have also consistently reported an increase in standardised scores in each section of the UMAT in those who sit in consecutive years [2]. In 2012 they reported that the majority of students who were re-sitting the test improved their scores but with estimated mean increases that were lower than those seen in the NZ study (UMAT-1, 6.4%, UMAT-2, 5.5% and UMAT-3, 3.1%).
As a cognitive ability test, the construct validity and predictive validity of the UMAT have been subjected to recurrent careful analysis [5,6]. If practice effects are indeed substantive they could potentially influence either the construct validity or predictive validity of the test or both. We have therefore ascertained all cases between 2000 and 2012 where candidates re-sat the UMAT and now report the magnitude of the practice effects observed on re-testing and discuss some of their possible determinants.

Methods
Between 2000 and 2012, 158,909 UMAT assessments were completed. After generating a unique ID for each candidate based on name, date of birth and gender 135,833 cases were identified where a candidate had sat once or more during that period with 117,505 cases (86.5%) having sat once, 14,739 having sat twice (10.9%), 2,752 thrice (2%) and 837, 4 or more times (0.6%). Subsequent analyses determined predictors of multiple re-sits as well as the magnitude and socio-demographic determinants of any practice effects.
Practice effects were examined in relation to total UMAT percentile scores and the percentile scores in each of its 3 subsections -UMAT-1 -Logical Reasoning and Problem Solving, UMAT-2 -Understanding People and UMAT-3 -Non-verbal Reasoning. Each year ACER reports candidate results as scores on a separate scale for each of UMAT-1, UMAT-2 and UMAT-3 with an overall UMAT score calculated as the average of these three scores. Over the years 2000 to 2012, the scale has changed several times as the test has developed. As a result scores are not necessarily comparable between years. However, percentile ranks enable a measure of the relative standing of a candidate within each cohort. Given the large numbers in each cohort it is likely that there is reasonable comparability in the competence of candidates at specified percentile ranks across cohorts. Hence the use of percentile ranks as the measure in this study.
The practice effects were further evaluated by dividing the cohort into quartiles of initial test performance and analysing the upper and lower quartiles in relation to number of times the UMAT was sat. Predictors of the magnitude of any practice effects were evaluated from a number of socio-demographic indices collected on enrolment for the UMAT, including age, gender, postal address, language spoken at home, type of secondary school, country of origin and self-identification as being of Aboriginal or Torres Strait Islander origin (ATSI). Language spoken at home was classified according to the Australian Standard Classification of Languages (ASCL), 2011 [7]. For multivariate analysis this was collapsed into 4 groups -English, European languages, Asian languages and all other languages. Type of secondary school was divided into one of 5 groupsgovernment (publicly funded), independent (fee paying), Catholic, Technical and Further Education institutions (TAFEpublic provider of predominantly vocational tertiary education courses) and Other. For those with an Australian address, an index of rurality was generated by linking each candidate's postcode to the Accessibility/Remoteness Index of Australia (ARIA) [8]. ARIA values are grouped into one of five categories: Highly Accessible (ARIA score 0-1.84), Accessible (ARIA score >1.84 -3.51), Moderately Accessible (ARIA score >3.51 -5.80), Remote (ARIA score >5.80 -9.08) and Very Remote (ARIA score >9.08 -12).
The project has been approved by the Human Research Ethics Committee at the University of Western Australia (file reference RA/4/1/2178).

Statistics
The magnitude of change between the first test and first re-sit were compared by paired T-tests. Comparisons of scores between those who sat once versus those who resat the test were made by unpaired T-tests. Univariate comparisons of performance during multiple re-sits of the UMAT were made by repeated measures analysis of variance (with post-hoc comparisons by Bonferroni correction). Multivariate analyses utilised linear regression to assess the independent relationships of change in total UMAT, UMAT-1, UMAT-2 and UMAT-3 score between the first and second tests with age, gender, type of secondary school, language spoken at home, country of origin, ARIA score and self-identification as ATSI as independent predictors. Logistic regression analysis was utilised to ascertain potential independent predictors of the likelihood of sitting the UMAT more than once. All analyses were carried out utilising IBM SPSS Statistics Version 20.0.

Socio-demographic data
Of the 135,833 cases where a candidate had sat once or more during the period 2000 to 2012, 1.1% were aged less than or equal to 16 yr, 35.6% were 17 yr, 43.3% were 18 yr, 8.4% were 19 yr, 9.9% were aged 20 to 30 yr and 1.7% were greater than 30 yr. Females comprised 57.8% of all cases and males 42.2%. English was spoken at home by 69.2%, Asian languages by 27.1%, European languages by 2% and other languages by 1.7%. School of origin was a government school for 47%, independent school for 32.8%, Catholic school for 17.6%, TAFE college for 0.4% and Other school for 2.2%. The majority of the population were Australian (86.9%) with 12.1% from New Zealand and 0.9% from other countries. Only 0.4% self-declared as ATSI in origin. For Australian residents (N = 118,086), 93.4% were living in Highly Accessible areas, 4.9% in Accessible areas, 1.2% in Moderately Accessible areas, 0.4% in Remote areas and 0.1% in Very Remote areas. The socio-demographic profile by year in which the UMAT was sat is outlined in Additional file 1. It demonstrates an increasing number of 18 and 19 yo sitting the test over time with a decreasing number of those either 17 yo or less or those greater than 20 yo sitting the test. The proportion of females has decreased slightly while that of males has increased. The proportion of those from an Asian language background has almost doubled from 18% in 2001 to 32% in 2012. The proportion from a government school background has slightly increased while those from an independent or Catholic school background has commensurately decreased. The numbers of New Zealand cases increased substantially from 2005 onwards with introduction of the UMAT as a selection tool for the 2 NZ medical schools, but have remained relatively steady since at approximately 13-15% of the cohort.

Magnitude of practice effects
The changes in total UMAT percentile score and each of its subsections are depicted in Figure 1. The predominant increment for total UMAT occurred between the first attempt and the first re-sit. RANOVA indicated a further significant increase in percentile score at the second re-sit (P < 0.001, Bonferroni correction) but no further increment with subsequent re-sits. For UMAT-1 further increases in relative performance extended out to the 4 th re-sit (P = 0. 024 compared to the 4 th re-sit, Bonferroni correction). For UMAT-2 increases in relative percentile extended to only the 3 rd re-sit (P = 0. 001 compared to the 2 nd re-sit, Bonferroni correction) while for UMAT-3 there was only an increase in relative percentile between the first attempt and the first re-sit with no further increment at any subsequent re-sit.
For those who sat at least twice (N = 18,328), the total UMAT percentile score between a first and second attempt improved in 71.1%, decreased in 28.7% and was identical in 0.2% of candidates. Total UMAT score improved by a mean of 10.7 ± 0.2 percentiles (Paired T-test P < 0.001). The UMAT-1 percentile score improved in 66.6%, decreased in 33.2% and was identical in 0.2% of candidates with a mean increase of 8.3 ± 0.2 percentiles (Paired T-test P < 0.001). The UMAT-2 percentile score improved in 62.6%, decreased in 37.3% and was identical in 0.1% of candidates and increased overall by 8.3 ± 0.2 percentiles (Paired T-test P < 0.001). The UMAT-3 percentile score improved in 61.1%, decreased in 38.7% and was identical in 0.2% of candidates with a mean overall increase of 7.7 ± 0.2 percentiles (Paired T-test P < 0.001).
Total UMAT percentile score and percentile score in each of its 3 subsections by number of times candidates re-sat the test between 2000 and 2012 are depicted for those in the lowest performance quartile in their first test in Figure 2 and for those in the highest performance quartile in Figure 3. In those in the lowest quartile of initial performance the increases were by 16.2 ± 0.3 percentiles for total UMAT score (Paired T-test P < 0.001), 16.8 ± 0.3 percentiles for UMAT-1 (Paired T-test P < 0.001), 21.1 ± 0.3 percentiles for UMAT-2 (Paired T-test P < 0.001) and 24.8 ± 0.4 percentiles for UMAT-3 (Paired T-test P < 0.001). For those in the highest quartile of initial performance there were decreases by 1.4 ± 0.3 percentiles for total UMAT score (Paired T-test P < 0.001), 3.6 ± 0.3 percentiles for UMAT-1 (Paired T-test P < 0.001), 9.3 ± 0.3 percentiles for UMAT-2 (Paired T-test P < 0.001) and 8.5 ± 0.3 percentiles for UMAT-3 (Paired T-test P < 0.001).

Predictors of magnitude of practice effects
Linear regression analyses of potential predictors of the change in overall performance in total UMAT percentile score and each of its sub-sections on re-testing are outlined in Tables 1, 2, 3 and 4. Performance in UMAT-1, UMAT-2 and UMAT-3 were first forced into the model to correct for regression to the mean. Lower percentile scores in UMAT-1, UMAT-2 or UMAT-3 predicted higher increments in each of these scores on re-testing as anticipated. Separate models (data not shown) indicated that for those in the lowest quartile of total UMAT percentile score at initial testing, the predicted increment on re-testing was 18.8 percentiles (CI 18.0, 19.7). For those in the lowest quartile of UMAT-1 it was 22.5 percentiles (CI 21.6, 23.4), for those in the lowest quartile of UMAT-2 it was 33.5 percentiles (CI 32.5, 34.6) and for those in the lowest quartile of UMAT-3 it was 35.7 percentiles (CI 34.7, 36.7). The number of years between the first test and the first re-sit was then forced into the model, adjusting for a predicted negative association between time interval and any measured practice effects. The predicted negative association was highly significant for UMAT-2 and UMAT-3 but was not seen for UMAT-1 performance. Together these predictors accounted for 9.7% of the variance in performance between the first and second tests in the total UMAT, 12.7% in UMAT-1, 18.5% in UMAT-2 and 24.5% in UMAT-3. Finally, the socio-demographic predictors were entered into the model. The significant predictors of an increase in overall performance were being male or being a candidate from New Zealand rather than from Australia, while performance on re-sit was lower for older candidates and for those from Asian language, European language or Other language backgrounds compared to an English language background. Together these predictors only Sat UMAT once Sat UMAT twice Sat UMAT three times Sat UMAT four times Sat UMAT five times Sat UMAT six times increased the variance for the full model for total UMAT to 12.2%, UMAT-1 to 15.3%, UMAT-2 to 22.7% and UMAT-3 to 27% (Tables 1, 2, 3 and 4). The increase in performance for those from NZ was primarily driven by an increase in the UMAT-3 percentile score while the decrease in percentile scores in those with a non-English language background was not evident for UMAT-3 in those from Asian languages background (Table 4).
Predictors of re-sitting the UMAT When comparing those who sat once versus those who re-sat, the percentile scores in UMAT-1 (52.3 ± 0.09 vs 47.4 ± 0.20, Unpaired T-test, P < 0.001) and UMAT-2 (52.7 ± 0.09 vs 48.6 ± 0.21, Unpaired T-test, P < 0.001) were initially lower in those who re-sat while those in UMAT-3 were initially higher (51.2 ± 0.8 vs 53.7 ± 0.20, Unpaired T-test, P < 0.001). The results from binomial logistic regression of predictors of the likelihood of sitting the UMAT twice or more are outlined in Table 5.
For those in the lowest quartile for UMAT-1 or UMAT-2 percentile score at the first test, the likelihood of re-sitting was increased (OR 2.  The relative amount of variance explained by the full model was small (6.6%). In a separate model for Australian students only, being rurally based rather than urban was also a significant predictor with rurally based candidates less likely to re-sit (OR, 0.78, 95% CI, 0.71, 0.85, P < 0.001). Finally, logistic regression models were run in those who sat 3 or more times (Table 6) or 4 or more times ( Table 7). The pattern of predictor variables for multiple re-sits were largely unchanged in these models, while the relative strength of the odds ratios was increased for each predictor variable together with the relative amount of variance explained by each model.

Discussion
This study has identified substantial practice effects for the UMAT which serve to enhance performance, especially between the first and second occasion of testing. The UMAT is now delivered to approximately 15,000 students each year in Australia and New Zealand [2] and hence a more comprehensive appreciation of practice effects will better inform its application in the selection of students for medical and dental schools. Practice effects in cognitive ability tests are now well described, with a meta-analysis [1] estimating that the magnitude of the improvement in test scores from the first test to the first re-sit would be approximately a quarter of the standard deviation from the first test, and a further fifth of a standard deviation to the second re-sit. With respect to high stakes selection such as for medical or dental school selection, re-sitting of admission tests is common. Approximately 40% of students who sat a combination cognitive ability test and science knowledge test for admission to medical schools in Belgium re-sat the test in a 4-year period [9]. Analysis of re-test effects in that study showed a one third of a standard deviation increment in score on the knowledge test and a one half of  standard deviation increase for the cognitive ability test. For the UMAT we now report quantitatively comparable results, with an estimated increment at the first re-sit of approximately two fifths of a standard deviation from the first test and a further one fifth of a standard deviation at the second re-sit. For a candidate at the 50 th percentile in the first test this would have translated to an increase to the 60 th percentile at the second test and to the 63 rd percentile at the third test. Our results complement and extend previous reports which have been conducted on a substantially smaller scale [2,4]. Given that 13.5% of students over the 13 years of observation were re-sitting the test these practice effects may be of sufficient magnitude and prevalence to warrant reconsideration in current approaches to medical student selection in Australia and New Zealand. Practice effects that improve performance in cognitive tests may arise from a number of different potential sources. The authors of the 2012 ACER report on the UMAT [2] favoured issues associated with repetition, such as less confusion because of test preparation and increased familiarity with the test, which may reduce anxiety and improve performance (so called construct irrelevant factors). This understanding of practice effects would lead to the conclusion that the first test score was not a true reflection of the candidate's ability because of confounding by these factors, and that the construct validity of the test was therefore unchanged on re-testing. However, this cannot be assumed as the entire explanation without further scrutiny. Indeed when item level data were carefully analysed in the cognitive test used for selection of medical students in Belgium [10] it was concluded that re-testing actually led to a change in the measurement properties of the test, such that the predictive validity of the re-test score in relation to academic performance in the first 3 years at medical school was compromised. Evidence for the hypothesis that retesting and practice effects can alter the psychometric properties and subsequent predictive validity in cognitive tests has been reported by others [11,12] but not by all observers [13]. If practice effects influence the psychometric properties of a test of cognitive ability then the underlying principle of stability of construct relevant variance of the test is violated and calls into question its subsequent predictive validity and utility in selection [13]. Item level analysis for questions delivered in consecutive tests is therefore now required to better delineate the extent to which the practice effects in the UMAT described here reflect construct irrelevant factors versus construct relevant factors. A separate analysis of the high achieving candidates in the 2012 ACER report [2] identified an actual decrease in UMAT performance rather than an increase, raising the possibility that a substantial proportion of any practice effect may simply represent regression to the mean. This result is similar to the outcome we have seen when comparing candidates in the highest quartile of performance in the UMAT and each of its subsections to those in the lowest quartile. However, when 2 previous studies were re-analysed to better understand the potential influence of regression to the mean on practice effects in cognitive ability tests [1], it was estimated that less than 10% of the practice effect size could be attributed to regression to the mean.
The practice effects identified on re-sitting the UMAT may at least in part represent improvements secondary to either coaching and/or practice between attempts. They may also have been confounded by coaching and/ or practice before the first attempt itself which would serve to mitigate any improvement seen at a subsequent attempt. Such coaching in Australia is reported to be as high as 56% of all candidates [3] while for the Medical College Admission Test (MCAT), widely used for medical student selection in North America, a review identified 4 studies in the area where the prevalence of coaching was 22%, 25%, 38% and 72% of candidates respectively [14]. We have not been able to measure such potential influences in this study. However, Griffin et al. [15], in a study of 287 applicants to the School of Medicine at the University of Western Sydney, reported that in the 51.4% of the cohort who had prior UMAT coaching, there was no coaching effect on UMAT-1 and UMAT-2. However, there was evidence of a small coaching effect on UMAT-3 (approximately 3.8% increase in score). This was not significant after adjustment for prior academic performance, age and gender. In a subsequent study [3] they assessed UMAT performance in 402 students from 31 Australian secondary schools who completed the UMAT in their final secondary school year. Again only the UMAT-3 score exhibited an improvement in the 56.2% of the cohort who had been coached, with an increment in mean score of approximately 7.7%. After controlling for prior academic performance, gender and school type, the effect of coaching was again no longer significant. Of interest, however, there was an interaction between prior academic performance and coaching, in that those with higher academic performance performed better with coaching on UMAT-3, while those with weaker academic performance performed worse. Similar results have been reported from New Zealand [4] where a commercial preparation course was shown to have no effect on UMAT-1, UMAT-2 or total UMAT score but UMAT-3 scores increased by 7.1 % in those with no preparation course compared to 16.5% in those who participated in a preparation course. The overall conclusion from these studies is that coaching and practice between sitting and re-sitting the UMAT have had little if no effect on UMAT-1 and UMAT-2 while the effect on UMAT-3 has been relatively small and insufficient to influence the overall UMAT score utilised for medical student selection.
While not always explicitly acknowledged, and not a part of classical test theory, tests of aptitude or proficiency are composed of items differing in difficulty, with the easier items generally at the beginning of a test. The expectation is that the more able students will answer correctly, not only the easier items which the less able students will be able to answer correctly, but in addition, more difficult items which the less able will not be able to answer correctly. Answering the more difficult items, in addition to the easier items, implies a greater understanding on the variable of assessment. In the terminology of Chomsky [16], who distinguished between competence and performance where the latter is an imperfect indicator or manifestation of the former, not only do the students with a higher score have a better performance, but the higher performance implies greater competence. The practice effects in test administration, especially in longitudinal research design, and perhaps in high stakes testing, have been viewed in the literature as detrimental to the validity of the testing [17]. In particular, using Chomsky's terminology again, it is implied that the student's score on a second occasion is inflated in performance relative to the student's competence. In other more colloquial terms, there has been learning/teaching to the test, rather than improved understanding of the variable to be assessed by the test. Taking the perspective that greater competence implies answering correctly, not only the easy items, but also the more difficult items, this detrimental perspective implies that the effect of practice is to be able to answer the more difficult items, as well as the easier items, without having improved in competence. For the UMAT we identified that the major practice effect was between the first and second test, with a smaller but still significant improvement by the third test. No further improvement was seen with further retesting. Others who have investigated the nature of practice effects of re-testing in other settings have also observed the largest increment with successive testing is between the first and second occasion [18]. The key observation in that study, however, was that on the second occasion of testing, the increased total score was a result of students answering correctly relatively easy items and not the difficult items. Thus they increased their performance, as evidence by their total score, but not their competence. An increase in performance, rather than competence, has implications for understanding the effects of practice on repeated testing, and further, may have implications for the kind of practice that is provided to ensure that students who sit only once are not disadvantaged relative to those who sit a second time. On the other hand, if not only performance, but apparent competence is improved, then it would indicate that not only has practice eliminated errors in easy items, but that it has permitted answering correctly the more difficult items. This concept needs evaluation through item level analysis of the UMAT, appreciating that the average difficulty of test items has not always been linked to the magnitude of practice effects in other settings [19]. The other confounding factor in understanding practice effects in tests like the UMAT is the background influence of guessing. Waller [20] made the case that guessing on a multiple choice item results from the item being too difficult for the student, and not from some property of the item. Taking this perspective, Andrich et al. [21] studied the effects of guessing by considering that guessing makes difficult items appear relatively easier than if there is no guessing. To study this effect, they were able to eliminate the effect of guessing by a scaling of the items which estimated their relative difficulties. The most surprising result from that study was that it was not so much that the less proficient students, who might have guessed some items correctly, who were advantaged in the original analysis, but that the more proficient students were disadvantaged. This is understood by recognising that the more proficient students answer difficult items correctly at a greater rate than the less proficient students, even when allowing that the less proficient might have guessed. As a result the more proficient students are not rewarded as much as they should be for answering the difficult items correctly. Adjusting the scale for guessing, and then estimating person proficiencies based on all responses of each student, provides a stronger linearisation of the total scores than an original analysis. This more robust approach should also be incorporated in individual item level analysis in the UMAT if further studies of observed practice effects are to more fully elucidate their nature and implications. A final confounding factor in relation to practice effects may be that applicants who actually choose to re-sit the UMAT are different from those who give up after a single test, or from those who are successfully selected into medical or dental school at their first attempt, and that these differences may have an impact on the magnitude of subsequent practice effects. There have been very few attempts to look at correlates of withdrawal or persistence in relation to multiple re-sits in aptitude tests and none in relation to the UMAT. The single strongest predictor in this study was UMAT percentile score on initial testing. Being male, being younger, being from NZ and being from non-English language backgrounds (although only of borderline significance for those of European language backgrounds) all predicted a greater likelihood of a re-sit even after adjustment for the influence of initial UMAT percentile score. Of interest, it was having a low score in UMAT-1 -Logical reasoning and Problem Solvingthat was the single strongest predictor of a re-sit while in contrast, those with a weaker performance in UMAT-3 -Non-verbal Reasoningwere less likely to re-sit. The profile of initial under-performance in UMAT-1 and over-performance in UMAT-3 in those who choose to re-sit recalls the profile we have previously reported in UMAT performance for those from Asian language backgrounds [22]. Given that those from an Asian language background were also nearly 20% more likely to re-sit the test this may have, at least in part, contributed to this finding in a cohort where approximately 27% of the sample reported speaking an Asian language at home. The second strongest predictor was being a student from New Zealand, with a 3.6 fold increase in the likelihood of a re-sit. In New Zealand a larger number of students sit the UMAT twice because entrance into medicine occurs after a one year open entry biomedical course and so they sit both in the last year of secondary school and again in the first year at university [4]. However, our analyses of those who sat four or more times still showed that being from New Zealand continued to be an even stronger predictor, with a 6.4 fold increase in the odds for New Zealanders of 4 or more re-sits. Moreover, the correlates of a re-sit (being male, being from a non-English background, being younger, and being from NZ) were also correlates of a weaker practice effect at the first re-sit, with the exception of being from NZ, which predicted a relatively greater practice effect than being from Australia. Whether this persistence by those from NZ is linked to higher motivation or un-measured socio-cultural differences remains speculative. A contributing factor, however, may be that the 2 medical schools in NZ are both undergraduate entry alone whereas in Australia there are now 12 graduate entry medical programs and so failure to be selected into an undergraduate course may preferentially lead to subsequent sitting of the Graduate Australian Medical Schools Admissions Test (GAMSAT) for graduate school entry rather than repeated re-sits of the UMAT as seen in NZ.

Study limitations
There are potential limitations to fully understanding the effect of practice from our data. These limitations are associated with having only total percentile scores in the analyses, rather than item level data. First, it is not possible to tell whether the improvement was uniform across the items or whether it was in the easy or difficult items. Second, with multiple choice items, there is a potential for students to guess the difficult items correctly with an impact on the scale of assessment. Guessing also interacts with possible practice effects in the responses that are analysed. Third, the analyses were based on average percentile locations and therefore changes in the percentile locations between years. Because these are therefore relative improvements, it is difficult to assess finer details of the basis for improvement from repeated sitting of the examination. The use of a unique identifier based on name, date of birth and gender to identify subjects who had sat once or more from 2000 to 2012 may have failed to identify some subjects who changed their name over the period. However, these numbers are likely to have been small in this overwhelmingly young cohort of predominantly high school students. Finally, the UMAT has been delivered since the early 1990's and so given that this data-set was confined to all who sat from 2000 onwards it is possible that for some subjects data analysed as results from the first test during this period may have in fact come from second or subsequent re-sits of the test. This would have if anything weakened the magnitude of the practice effects seen.