Skip to main content

Table 3 Summary of Reviewed Articles

From: Systematic review of specialist selection methods with implications for diversity in the medical workforce

Article (bolded authors claimed evidence of bias)

Description

Main findings

Diversity conclusions

Strengths/limitations

MERSQI Scorea

(11.3/18 over all articles)

Canada (1 article)

9

MacLellan et al. (2010) [25]

Compared IMG and DMG performance on in- and end-training exams

End-training exam pass rate: IMG 56% versus DMG 93.5% (p < .0001)

IMG: IMG low pre-selection scores consistent with low pass rates on certification exams

Strengths: Multiple year, large sample

Limitations: Exploratory, single program, single specialty

9

UK (7 articles)

15.2

Esmail et al. (2013) [26]

Compared IMG with DMG performance on end-training exams (GP/Family medicine)

URM failed first attempt more than white DMG (OR 3.5, p < .001)

IMG failed first attempt more than white DMG (OR 14.7, p < .001)

URM/IMG: Higher failure rates in domestic and foreign URM/IMG are partly explained by lower pre-selection academic achievement, and may also reflect bias during clinical OSCE-based exams

Strengths: Complete cohort, large sample, multiple years, end-training outcome

Limitations: Exploratory, single specialty

15.8

McManus et al. (2014) [27]

Compared IMG with DMG performance on end-training exams (GP/Family medicine & Internal medicine)

IMG performed worse than DMG on end-training exams (~ 1.25 SD)

IMG: Lower pre-selection scores are an accurate measure of suitability for training

Raising cutoffs is needed for equivalence with DMG but would affect workforce

Strengths: Follow-up study, multiple programs, large sample, multiple years

Limitations: Two specialties

15.2

Patterson et al. (2018) [28]

Measured factors associated with differences in performance of IMG and DMG on end-training exams (GP/Family medicine)

Clinical skill performance better predicted by SJT than CPST (beta 0.26 v 0.17)

SJT mediated relationship between English fluency and clinical skills performance

IMG: IMG performance on end-of-training exams is predicted by socio-linguistic factors not clinical knowledge and skills

Strengths: National cohort, large sample, multiple years, end-of-training follow-up

Limitations: Exploratory study, single specialty

14.6

Tiffin et al. (2014) [29]

Measure IMG performance during residency

IMG more likely to receive unsatisfactory ARCP than DMG (OR 1.63, p < .05)

IMG: PLAB language exam does not establish linguistic equivalence of IMG and DMG

Thresholds would need to be increased to achieve equivalence, but would affect workforce and decrease diversity

Strengths: National cohort, large sample

Limitations:

14.6

Tiffin et al. (2018) [30]

Measure bias against IMG in resident selection comparing pre-training academic attainment with in-training assessment

UK overseas graduates more likely deemed appointable than IMG (OR 1.29, p < .05) but more likely to later receive less satisfactory ARCP (OR 1.20, p < .05)

IMG: Bias favouring UK born graduates trained overseas versus IMGs may be due to excessive weight given to interview

Strengths: National cohort, large sample, all specialties,

Limitations: Incomplete data set

15.8

Wakeford et al. (2015) [31]

Measure correlation between GP/Family medicine and Internal medicine exam performance by ethnicity

High correlation between GP/IM exam performance, suggesting validity of each assessment (and does not suggest bias against URM)

URM performed less well

URM: No evidence of bias against URM; differences in assessment likely to reflect true differences in ability

Strengths: National cohort, multiple years, large sample

Limitations: Exploratory, two specialties

15.8

Woolf et al. (2019) [32]

Identified by specific search terms

Measure effect of gender on specialty training selection

Across all specialties female applicants had:

• No difference in applications

• Increased offers (OR 1.4, p < .001)

• Increased acceptance (OR 1.43, p < .001)

2 specialties had significant gender differences in applications (both favouring women):

• Paediatrics (OR 1.57, p < .05)

• GP (OR 1.23, p < .05)

Gender: Gender segregation in specialties is due to differential application rates, not instrument bias; research is needed on why men are less likely to apply for GP/Paediatric training, and less likely to accept GP training if offered

Strengths: Follow-up study, national cohort, large sample, multiple specialties

Limitations: 1–2 years intake, incomplete data set

14.6

US (27 articles)

10.4

Aisen et al. (2018) [33]

Identified by specific search terms

Examine effect of gender on urology applicant academic achievement and selection into specialty

Higher % of males matched (73% v 67%)

Among matched applicants:

• Males less honors (2.8 v 2.2, p < .021)

• Males higher USMLE1 (245.9 v 240.8, p < .001)

Gender: Male/Female candidates had similar pre-selection results and no evidence of bias in selection

Strengths: Moderate size

Limitations: Exploratory, single program, single specialty, 1–2 years intake

11.3

Brandt et al. (2013) [34]

Examine effect of gender on O&G applicant academic achievements and selection into specialty

No gender difference on USMLE

Females more likely to have honors (51% v 41%, p < .021) and published (87% v 79%, p < .01)

Gender: Male/Female candidates had similar USMLE1 scores, higher female honors may explain lower rate of M applications for O&G training

Strengths: Large sample, multiple years

Limitations: Exploratory, single program, single specialty, incomplete data set

11.3

Chapman et al. (2019) [35]

Identify factors associated with under-representation of women across medical specialties

Female representation higher in specialties with lower mean USMLE1 entry score (p < .017)

1% increase in female faculty prevalence associated with 1.45% increase in female trainees in specialty (p < .001)

Gender: No evidence of USMLE 1 bias against females

Association between female faculty and female trainees suggests mentoring may increase diversity

Strengths: National cohort, large sample, all specialties

Limitations: Exploratory, 1–2 years intake, incomplete data set

9

De Oliveira et al. (2012) [36]

Identified by specific search terms

Measure factors associated with selection to anaesthetics residency including gender, age, country of training

Factors associated with selection:

• Female

• Younger

• Higher USMLE 2

• DMG

Gender/Age: Bias favouring selection of female and younger applicants

Strengths: Large sample

Limitations: Exploratory, single program, single specialty, 1–2 years intake, inferences made without statistical test

12.4

Dirschl et al. (2006) [37]

Identified by specific search terms

Measure whether gender and academic scores can predict orthopaedic end-of-training exams

12.5% female applicants

Faculty ratings of training were not associated with academic scores

Gender: No gender bias detected

Strengths: Follow-up study, large sample, multiple years

Limitations: Single program, single specialty

9

Driver et al. (2014) [38]

Identify factors associated with ophthalmology selection including IMG status

Increased % of selection associated with:

• Higher USMLE1 (OR 3.22, p < .05)

• Letters of recommendation (OR 6.2, p < .05)

• Publications (OR 3, p < .05)

IMG: Design prevented conclusions about bias

Strengths: National cohort, large sample, multiple years

Limitations: Exploratory, single specialty

11.3

Durham et al. (2018) [39]

Measure effect of gender on selection into neurosurgical training

13.8% female applicants

USMLE1 higher for selected (233 v 211, p < .001)

Females had lower OR of matching (0.59, p < .001)

Females had lower mean USMLE1 scores (222 v 230, p < .001)

Gender: USMLE 1 is best predictor of selection

Reduced female selection partially explained by lower USMLE 1 scores

Possible bias remains after multivariate analysis

Strengths: Statewide cohort, large sample, multiple years

Limitations: Exploratory, single specialty

11.3

Edmond et al. (2001) [40]

Identified by specific search terms

Measure bias against African Americans due to USMLE 1 in internal medicine residency selection

Mean USMLE1 of African Americans was 200, non-AA was 216

OR for rejection of AA varied from 3 to 6 (p < .05)

Race: USMLE 1 reduces selection of African Americans

Strengths: Large sample

Limitations: Exploratory, single program, single specialty, 1–2 years intake, uncontrolled confound

12.4

Filippou et al. (2019) [41]

Measure gender bias in letters of recommendation for urology resident applicants

LoR for males had:

• More authentic tone

• More references to personal drive, work, and power

LoR referring to power more likely to be associated with selection

Gender: Gender bias in letters of recommendation may reduce selection of females

Strengths: Moderate sample

Limitations: Exploratory, single program, single specialty, 1–2 years intake

9

French et al. (2019) [42]

Measure gender bias in LoR for general surgery resident applicants

Female authors wrote longer letters

Gender: No gender bias detected in letters of recommendation

Strengths: Large sample, adequate power

Limitations: Exploratory, single program, single specialty, 1–2 years intake

7.9

Friedman et al. (2017) [43]

Measure gender bias in standardised versus narrative LoR for otolaryngology surgery residents

No difference in ranking of male/female applicants

Female writers produce LoRs different to male writers (p < .05)

LoRs written for female applicants less positive than those written for male applicants (p < .05)

Gender: Standardised letters of recommendation have reduced but not eliminated biases that contribute to reduced selection of females

Strengths: Moderate sample

Limitations: Exploratory, single program, single specialty, 1–2 years intake

7.9

Gardner et al. (2019) [44]

Measure effect of USMLE cutoffs on underrepresented minorities in general surgery training

Reducing USMLE1 cutoffs and adding SJT screening increased URMs offered interview by 8%

Gender/URM: USMLE 1 screening reduces selection of URMs for interview

Does not claim bias

Strengths: Multiple program sample, large sample

Limitations: Exploratory, single specialty, 1–2 years intake

9

Girzadas et al. (2004) [45]

Measure effect of gender on SLoR for emergency medicine residency

Female author with female applicant OR 2 to get highest ranking on LoR (p = .023)

Gender: No gender bias detected in letters of recommendation

Strengths: Large sample

Limitations: Exploratory, single program, single specialty, 1–2 years intake, selection process changed during study

7.9

Hewett et al. (2016) [16]

Measure gender bias in radiology residency selection

24% female applicants

Females were

• 30% of offered interviews

• 38% of top quartile (p < .001)

• 25% of selected

Female applicants average USMLE1 score was 5 points lower (p < .05)

Female applicants had higher mean interview scores (p < .05)

Gender: Bias favouring female applicants

Associated with lower female USMLE1 scores

Associated with higher female interview scores

Strengths: Multiple years intake, large sample

Limitations: Exploratory, single program, single specialty, variable selection/scoring methods

11.3

Hoffman et al. (2020) [46]

Measure gender bias in LoR for pediatric surgery residency selection

Female LoR had more communal phrases (p < .01)

Gender: Gender biases against females in LoRs may affect selection into training

Strengths: Multiple years intake

Limitations: Exploratory, single program, single specialty, small sample, ad-hoc measures

7.9

Hoffman et al. (2019) [47]

Measure gender bias in LoR for transplant surgery resident applicants

Male applicant LoR had more agentic terms (p < .05)

LoR written by senior staff more likely to describe female applicants with communal terms (p < .05)

Gender: Gender biases in LoRs against females may affect selection into training

Strengths: Moderate sample size, multiple years intake

Limitations: Exploratory study, single program, single specialty, limited power

7.9

Hopson et al. (2019) [48]

Identified by specific search terms

Measure influence of gender on outcome of emergency medicine selection interviews

No significant difference on standardised video interview

Gender: No gender bias detected on standardised video interview

Strengths: Multiple program cohort, large sample size, adequate power reported

Limitations: Exploratory study, single specialty, 1–2 years intake, aggregates heterogenous groups, ad-hoc measures

10.1

Kobayashi et al. (2019) [49]

Measure influence of gender on LoR in orthopaedic surgery residency

Female applicants had:

• Longer LoR (p < .003)

• More “achieve” words (p < .0001)

No differences for male v female authors

Gender: No gender bias detected on letters of recommendation

Strengths: Large sample

Limitations: Exploratory study, single program, single specialty, 1–2 years intake, ad-hoc measures

11.3

Lin et al. (2019) [50]

Measure gender bias in LoR for ophthalmology residency

M/F applicants had similar:

• USMLE1

• Academic achievement

LoR for male applicants had:

• Less feel words (p < 041)

• Less biological words (p < .028)

Gender: Gender biases in LoRs against females may affect selection into training

Strengths: Moderate sample size

Limitations: Exploratory, single program, single specialty, 1–2 years intake, ad-hoc measures

11.3

Lypson et al. (2010) [51]

Identified by specific search terms

Measure correlation between USMLE scores and clinical competence at beginning of residency across specialties

USMLE1 scores lower for URM (212 v 230, p < .001)

URM not significantly worse than non-URM on OSCE stations at beginning of residency

URM: USMLE 1 scores are biased against URMs, revealed by similar OSCE scores at beginning of residency

Strengths: Multiple specialties, multiple years intake

Limitations: Exploratory, single program, small sample, limited power

7.9

Norcini et al. (2014) [52]

Predict patient outcomes of IMGs from USMLE scores across specialties

Increased USMLE2 CK score associated with decreased mortality as a physician

1 SD on USMLE 2 CK associated with 4% improvement in mortality

IMG: USMLE2 CK scores are a valid measure of suitability for IMG selection/certification

Strengths: Follow-up study, statewide sample, large sample, multiple specialties, multiple years intake, patient outcomes

Limitations: Unmeasured confounds

14.5

Poon et al. (2019) [53]

Identified by specific search terms

Compare orthopaedic residency enrolment rates and academic metrics of applicants and matriculated residents by race/ethnicity

URM were 29% of applicants and 25% of enrolments

White/Asian applicants had higher USMLE1 than Black applicants (234 v 218, p < .05)

URM: USMLE1 screening may contribute to lower rates of application of URMs

Bias not evaluated

Strengths: National cohort, large sample, adequate power

Limitations: Important variables not measured

13.5

Quintero et al. (2009) [54]

Measure effect of personality similarity to bias the selection of orthopaedic residents

Clinicians rated candidates more favourably when they shared personality characteristics (p = .044)

Personality: Increased awareness of implicit biases may reduce inequity of current selection processes

Strengths: Moderate sample size

Limitations: Exploratory, single program, single specialty, 1–2 years intake, limited power, follow-up to selection, protocol variations

12.4

Scherl et al. (2001) [55]

Measure gender bias in orthopaedic resident selection

No significant difference in selection of male and female charts

Gender: No gender bias detected based on gendered versions of applicant charts

Strengths: Experimental design

Limitations: Exploratory, single program, small sample, selection bias, partial blinding

11.3

Stain et al. (2013) [56]

Identified by specific search terms

Measure attributes of top-ranked applicants to general surgery residency

Males had higher USMLE1 (238 v 230, p < .001)

Males/Females had similar USMLE2 scores (245 v 244, p = .54)

Highly competitive programs associated with

• USMLE1 (RR 1.36)

• Publications (RR 2.2)

• Asian (RR 1.7 v white)

Gender: No gender bias detected based on pre-selection academic achievements

Strengths: National cohort, moderate sample size

Limitations: Single program, single specialty, ad-hoc measures

12.4

Unkart et al. (2016) [57]

Measure reduction in general surgical residency applications among candidates self-identified as “disadvantaged”

URM were:

• Older at entry (24 v 23, p < .001)

• Lower MCAT (30 v 33, p < .001)

• More likely to choose a less competitive specialty (p < .03)

URM/Gender: No bias detected based on USMLE 1

Strengths: National cohort, multiple years intake, large sample

Limitations: Aggregates heterogenous groups, limited follow-up

12.4

Villwock et al. (2019) [58]

Identified by specific search terms

Measure effect of STAR tool for selecting otolaryngology residency candidates to interview

USMLE scores significantly increased after STAR tool

No differences in gender/URM before/after introduction of STAR selection tool

URM/Gender: STAR selection tool did not increase representation of URM/Gender

Strengths: Moderate sample size

Limitations: Single program, exploratory

7.9

  1. ARCP Annual Review of Competence Progression, CPST Clinical Problem Solving Test, DMG Domestic Medical Graduate, IMG International Medical Graduate, LoR Letter of Recommendation, PLAB Professional and Linguistic Assessment Board, SJT Situational Judgement Test, URM Underrepresented minority
  2. a MERSQI scores include subscales which are not applicable for all articles; scores are scaled after removal of these subscales to allow comparison with a maximum score of 18 for all articles (Reed et al, 2007) [17]