Open Access
Open Peer Review

This article has Open Peer Review reports available.

How does Open Peer Review work?

Construct validation of judgement-based assessments of medical trainees’ competency in the workplace using a “Kanesian” approach to validation

BMC Medical Education201515:237

Received: 1 October 2015

Accepted: 16 December 2015

Published: 30 December 2015



Evaluations of clinical assessments that use judgement-based methods have frequently shown them to have sub-optimal reliability and internal validity evidence for their interpretation and intended use. The aim of this study was to enhance that validity evidence by an evaluation of the internal validity and reliability of competency constructs from supervisors’ end-of-term summative assessments for prevocational medical trainees.


The populations were medical trainees preparing for full registration as a medical practitioner (74) and supervisors who undertook ≥2 end-of-term summative assessments (n = 349) from a single institution. Confirmatory Factor Analysis was used to evaluate assessment internal construct validity. The hypothesised competency construct model to be tested, identified by exploratory factor analysis, had a theoretical basis established in workplace-psychology literature. Comparisons were made with competing models of potential competency constructs including the competency construct model of the original assessment. The optimal model for the competency constructs was identified using model fit and measurement invariance analysis. Construct homogeneity was assessed by Cronbach’s α. Reliability measures were variance components of individual competency items and the identified competency constructs, and the number of assessments needed to achieve adequate reliability of R > 0.80.


The hypothesised competency constructs of “general professional job performance”, “clinical skills” and “professional abilities” provides a good model-fit to the data, and a better fit than all alternative models. Model fit indices were χ2/df = 2.8; RMSEA = 0.073 (CI 0.057-0.088); CFI = 0.93; TLI = 0.95; SRMR = 0.039; WRMR = 0.93; AIC = 3879; and BIC = 4018). The optimal model had adequate measurement invariance with nested analysis of important population subgroups supporting the presence of full metric invariance. Reliability estimates for the competency construct “general professional job performance” indicated a resource efficient and reliable assessment for such a construct (6 assessments for an R > 0.80). Item homogeneity was good (Cronbach’s alpha = 0.899). Other competency constructs are resource intensive requiring ≥11 assessments for a reliable assessment score.


Internal validity and reliability of clinical competence assessments using judgement-based methods are acceptable when actual competency constructs used by assessors are adequately identified. Validation for interpretation and use of supervisors’ assessment in local training schemes is feasible using standard methods for gathering validity evidence.


Internal validity Psychometrics Workplace-based assessment Medical education Competency constructs Clinical competence


The evaluations of judgement-based clinical performance assessments have consistently shown problems with reliability and validity [1, 2]. Documentation of the varying influences of context on assessment ratings [3], including the effect of rater experience [4], the type of assessor [5] and variability in understanding about the meaning and interpretation of competency domain constructs [6], highlight some of the issues about these important types of assessments. The validation of workplace-based assessments (WBAs) remains an area of ongoing improvement as identified by Kogan and colleagues: “Although many tools are available for the direct observation of clinical skills, validity evidence and description of educational outcomes are scarce” [2].

An argument-based approach to validation followed by evaluation, an approach long championed by Michael Kane [79], provides a framework for the evaluation of claims of competency based on assessment scores obtained from many different forms of assessment [10]. Within this framework, the educator states explicitly and in detail the proposed interpretation and use of the assessment scores, and these are then followed by evaluation of the plausibility of the proposals [10]. Such a framework is also supported by R L Brennan who argues validation simply equates to using interpretative/use arguments (IUAs) plus evaluations: “What is required is clear specifications of IUAs and careful evaluation of them” [11]. If claims of interpretation and use from an assessment cannot be validated, then “they count against the test developer or user” [11]. This theory framework for validation is potentially useful for the evaluation of new but also established methods of the assessment of postgraduate medical trainees. It should be noted that this approach is one of a number of validity theory proposals that continue to evolve [1215].

Previously we have identified concerns about the validity of a former supervisor-based end-of-term assessment for pre-vocational trainees in one institution in Australia [16, 17]. A face-value claim for these supervisor assessments is the eligibility of a trainee for full registration as a competent medical practitioner. The pre-existing domains meant to be assessed were Clinical Competence, Communication Skills, Personal and Professional Abilities, and Overall-rating. If a trainee received an assessment indicating competence in these domains, as identified by the supervisor in each term, then they were suitable for full and unconditional registration. A further face-value claim from the assessment relates to the original concept of formative assessment. The trainee is given the same assessment half-way through a term as a feedback and learning assessment. Thus the feedback “score” with associated advice is provided as an improvement process. The basic assessment format continues in Australia although the competency items and domains identified have changed. Our previous observations questioned these face-value assumptions and raised the possibility of an alternate dominate competency domain with acceptable reliability, namely a general professional job performance competency construct [16, 17].

Validation of judgement-based assessments ideally should proceed systematically and iteratively within a theory base. Using Kane’s validation framework [10], an IUA can be provided that adequately represents the intended interpretation and use of the assessment, and how it will be evaluated, including checking its inferences and assumptions. The assessment of a general professional job performance competency construct is a potential valuable construct that can be used in any broader assessment program, though as one of many competencies expected in a well-trained medical practitioner. The presence of a general factor in performance independent of halo and other common method biases has theoretical support from observations in organisational psychology literature [18].

Confirmatory factor analysis (CFA) is commonly used to evaluate internal construct validity of assessments. CFA is a structural equation modelling (SEM) method that uses directional hypothesis testing to evaluate the validity of non-directly observable (latent) constructs which are identified by observable variables or items. For example in Fig. 1, the competency domain General Professional Job Performance (Factor 1) is a latent competency concept that is hypothesised to be measurable by a number of observable behaviours and activities. CFA tests the directional hypothesis that an individual’s competency for this construct results in particular activities such a good medical record management, among other observable behaviours. That is, the presence of a high standard General Professional Job Performance competency results in the good medical record behaviour. If the directional relationship is confirmed in a CFA construct validation process, the measurable behaviours can then be used to confirm the presence and quality of a General Professional Job Performance competency for the trainee.
Fig. 1

Optimal Model, Parameter Estimates and Error Estimates (Residual variances). (See Model Structure in the text for an explanation of the diagram)

The aim of this study was to evaluate the internal validity and reliability of competency constructs for prevocational medical trainees, in particular to determine whether a potentially useful competency construct defined as a “general professional job performance” competency is valid and reliable for the particular context in which it was measured [17]. Individual training programs need to validate their own assessments, judgement-based assessments in particular, because such assessments relying on an individual’s judgement have no inherent transferrable reliability and validity. In Kane’s framework the assessment outcome measure needs to be valid for the context in which it is applied and what the result are used for [10].


Population and educational context

The population and context have previously been described [16, 17]. In brief, the populations are medical trainees preparing for unconditional registration and their supervisors who also undertake the assessment. Supervisors are specialty level consultants in a hospital network including secondary and tertiary level hospitals. The assessments used in this study were end-of-term and summative. Trainee scores for each assessment for each individual competency item are considered the primary unit of analysis. The assessment pro forma has been previously provided [16, 17]. A total of 74 trainees provided assessments with 64 trainees having 5, 12 had 4, and 2 had 3 assessments. Analysis was for supervisors with 2 or more assessments and only 6.3 % of all assessments involved only 1 supervisor leaving 349 usable assessments. Otherwise there were no exclusion criteria and all other assessments performed were included for all trainees, all supervisors and for all competency items assessed, as previously described [17].

Exploratory factor analysis, as a first-order model with correlated factors, provided the proposed constructs to be considered in the second-order factor model analysis using CFA [17]. The second-order model represents the hypothesis that the multiple seemingly distinct individual competency items, as described on an assessment form can be accounted for by one or more common underlying higher order constructs or domains. The individual competency items (observed variables) are the first-order variable and the factors (competency domains or constructs) are the second order variable in the model (Fig. 1).


CFA is a form of structural equation modelling (SEM). SEM is used to test complex relationships between observed (measured) and unobserved (latent) variables and also relationships between two or more latent variables [19]. The purpose of the CFA is to examine a proposed measurement model and compare the model fit to other alternative models to ensure the proposed model is the most consistent with participants’ responses.


Each assessment competency item is the unit of analysis for each assessment (n = 349 assessment) and the reliability study has a single facet design with rater nested in trainee. The variance component for each observed competency item, the percent of variance for each trainee competency score and the individual item reliability coefficient (R-value) were estimated as previously described [16, 17]. Consistency of the item scores for the factors identified (competency domain constructs) was estimated by Cronbach’s alpha. The number of assessments to achieve a minimum acceptable reliability (NAAMAR) coefficient of ≥0.8 was calculated as a potential benchmarking statistic as previously described [16, 17].

Sample size

An a priori evaluation indicated that the sample size is sufficient for a CFA analysis. Using an anticipated effect size of 0.1 as the minimum absolute anticipated effect size for the model; a statistical power level of 0.90; the number of latent variables of 3; the number of observed (indicator) variables of 11; and a probability level <0.05, then the minimum sample size for model structure is 129, and the minimum sample size to detect effect is 149 assessments.

Missing data

Only 2.6 % of all scores (127 of 4886) contained missing values, an amount which normally would be considered low and be dealt with by simple methods such as trimming. However, the competency items Emergency Skills, Teaching and Learning and Procedural Skills accounted for 93 % (118/127) of all the missing values. Although Little's MCAR test [20] was non-significant (Chi-Square = 180.441, DF = 172, Sig. = .314) the pattern of distribution of the missing values indicated a non-random occurrence of missing values. Therefore these items were removed and analysis was with the remaining 11 competency items. Automatic imputation of missing score values was performed (IBM SPSS version 19). A repeat factor analysis using the subsequent values after imputation demonstrated the same factor structure and similar factor loadings.


The assumption of non-normality was made for the CFA in view of the possibility of range restriction and other common method biases such as halo, leniency and stringency. The estimation method was the Mean- and Variance-adjusted Maximum Likelihood (MLMV).

Model fit

Common fit indexes are Chi-square (χ 2), the significance of χ 2, the ratio of χ 2 to degrees of freedom, Akaike information criterion (AIC), Bayes information criterion (BIC), Tucker–Lewis index (TLI), Comparative fit index (CFI), root mean square error of approximation (RMSEA with 95%CI), standardised root mean square residual (SRMR) and the weighted root mean residual (WRMR) [19, 21].


The coefficients of hypothesized relationships and the significance of individual structural path relationships using z values associated with structural coefficients with the standard errors (SE) for standardised and unstandardized estimates are provided as an Mplus software Version 7.11 default.

Sensitivity analysis by model comparisons

After examination of parameter estimates, fit indexes, and residuals, model comparisons and model modifications to the original hypothesized model were a priori planned to identify any possible better fitting and more parsimonious models [21].

Measurement invariance

Evidence of whether construct validity is the same across 2 or more population groups will be evaluated by traditional methods to identify measurement invariance across groups [19, 2224]. Demonstrating measurement invariance supports the use of the assessment across gender, race, and other demographically different subgroups that can be tested [25].

Common method variance (CMV) analysis

CMV is common error variance shared among variables measured with and introduced as a function of the same method and/or source [26, 27]. The causes of CMV in rater-based assessments relate to issues such as leniency, stringency, range reduction of scores and halo effect. CMV was estimated using the correlation marker method and the unmeasured latent method construct (ULMC) approach. Since an a priori marker variable was not included in the original assessment, the variable with the smallest positive correlation in the data set was used as the maker [26] [27].


The original EFA was performed using IBM SPSS version 19 and the follow-on CFA was performed using Mplus Version 7.11 Muthen & Muthen. The path diagram was created with IBM AMOS version 21 which was also used as a sensitivity analysis for replicating the analysis and for measuring measurement invariance with an ML estimator.

Ethics approval and consent

As only retrospective analyses of routinely collected and anonymised data were performed, the study was approved by ACT Health Human Research Ethics Committee’s Low Risk Sub-Committee approval number ETHLR.15.027. The ethics committee did not require consent to be obtained or a waiver of consent. The study was carried out in accordance with the Declaration of Helsinki. The anonymity of the participants was guaranteed.


Descriptive statistics

Table 1 displays descriptive statistics and zero-order correlations for variables measuring trainee competence by their supervisor. Due to the large number of inter-correlations and the increased risk of a type I error, an adjusted a level of 0.001 was used to indicate significant bivariate relationships and model fit statistics. Correlations between items varied from 0.353 to 0.697, and all were significantly associated (p < 0.001).
Table 1

Descriptive statistics, correlations, and reliability results for the competency items, and the standardised estimates and reliability results of the modelled constructs

The diagonal cells contain percent variance for the score due to the trainee; all remaining variance is considered error variance; p < 0.001 for all correlations

All 2-tailed p-values <0.000; (see Fig. 1 for factor structure)

aStandardised Estimates of constructs with the items defining those constructs (SE) in shaded areas

EFA Factor structure

The total variance accounted for increased to 71.9 % of total variance (full results available on request). Following imputation of the missing values the 3-Factor model accounted for approximately 73 % of the variance.

Measurement models

Confirmatory factor analysis

The hypothesised model tested was the factor structure identified after removal of potentially biasing competency items (Emergency Skills, Procedural Skills and Teaching and Learning), imputation of missing data, and the consolidation of Overall Rating, Time Management Skills, Medical Records, Communication skills, Teamwork Skills and Professional Responsibility attitude as the dominant first construct (Factor 1) called a “general professional job performance” competency construct. Factor 2 and Factor 3 were named “clinical skills” competency and “professional abilities” competency respectively. The standardised parameter-estimates with the standard error are presented in Table 1. All item loadings exceeded 0.60 and all differed reliably from zero (p < .0001).

Model structure

The hypothesised CFA model with continuous factor indicators is shown in the diagram (Fig. 1). The model has 3 correlated factors, with the first factor being measured by 6 continuous observed variables, the second measured by 3 and the third with 2 observed variables.

The ellipses represent the latent constructs (Factors). The rectangles are the observed variables (competency items). The circles are the error terms for each competency item. Bidirectional arrows between the factors indicate correlation with an assigned correlation coefficient (eg the correlation coefficient between factor 1 and factor 2 is 0.87). Unidirectional arrows indicate relationships that are predictive. For example, each of the first 6 observed variables are predicted by the latent variable (Factor 1), and the associated numbers are the standardised regression coefficients.

The directed arrows from the factors (latent variables) to the items (observed variables) indicate the loadings of the variable on the proposed latent factor. Each of the observed variables for the 3 latent competency domains has an associated error term (residual) which indicates that each observed variable is only partially predicted by the latent factor it is trying to measure. The rest is error. The numbers to the right of the observed variables are R-squared values (communalities in factor analysis), which is the proportion of variance explained by the latent competency factor for the individual item. An example of the interpretation of these numbers is that a one standard deviation increase on Factor 1 (job performance competence) is associated with a 0.89 standard deviation increase in the “overall rating” score, and is equivalent to a correlation of 0.89 between the factor and the observed variable. The amount of variance for the overall rating score explained by the competency construct (Factor 1) is 0.79 or 79 %. The same interpretation can be made for the results provided in Fig. 1 for all the individual item-Factor relationships.

Model fit

Parameter estimates obtained for the hypothesized measurement model are presented in Table 2, along with the model fit for other contending models available from the data and the context. The 3 Factor Model Factor structure from the EFA identifying a possible general job performance factor as described in Table 1 has the best model fit.
Table 2

Model Fit Indexes for alternative non-nested models


Chi-squared (χ2)

Ratio of χ2 to df

Akaike information criterion (AIC)

Bayes information criterion (BIC)

Tucker–Lewis index (TLI)

Comparative fit index (CFI)

Root mean square error of approximation (RMSEA) (95%CI)

Standardised root mean square residual (SRMR)

Weighted root mean residual (WRMR)

Ideal Benchmarka




useful for nested models

Smaller the better; for model comparison (non-nested)

Smaller the better; for model comparison (non-nested)

≥ 0.95 ideal

≥ 0.95 ideal

<0.06 ideal;

≤ 0.08

< 0.90

<0.90 reject

<0.90 reject

<0.08 acceptable; and with narrow 95 % confidence intervals

3 Factor Model 1b

116.563 p-value <0.00










3 Factor Model 3c

223.258 p-value <0.00










3 Factor Model 4d

121.571 p-value <0.00










3 Factor Modele

211.42 p-value <0.00










1 Factor


170.483 p-value <0.00










2 Factor


139.489 p-value <0.00










1 Factor OC Modelh

46.586 p-value <0.00










aFrom (Schreiber et al., 2006)

b3 Factor Model 1 = Factor structure from SPSS EFA identifying a possible general job performance factor as Factor 1

c3 Factor Model 3 = Factor structure from EFA using the a priori defined competency domains as 3 proposed Factors

d3 Factor Model 4 = Factor structure from SPSS EFA using the a priori defined competency domains as 3 proposed Factors but with potentially redundant items removed (Procedural, emergency and teach and learn)

e3 Factor model from original EFA with all 14 items

f1 Factor model with all 14 items

g2 Factor model with all 14 items

h1 Factor model with only those items within the “operational competence” construct and no other items

Model fit comparative analysis

As briefly stated in the introduction, the assessment was originally defined into 3 domains plus an “overall rating” item [17]. The original domains consisted of items thought to measure “clinical skills”, “communication skills”, and “professional competencies”. This original domain structure was analysed by CFA for a sensitivity analysis as a proposed explanatory structure, first with all the competencies and then again with the poorly performing items removed. Both model fit indices were less optimal than for the hypothesised model. When forced 1 and 2 factor models were evaluated, again the model fit indices were less optimal (Table 2). The parsimonious model with only 11 items and 3 factors, but with a factor 1 construct reflecting competencies consistent with general professional job performance had the best model fit.

Model parameters

The parameter indices for the optimal model reported in Table 1 are also illustrated by the standardized loadings (Fig. 1). The items’ loadings confirm that all of the 3 factors are well defined by the items. All the unstandardized variance components of the factors are statistically significant which indicates that the amount of variance accounted for by each factor is significantly different from zero. The R 2 estimates which provide the amount of variance explained by the competency item are only moderate. The standardised variance explained by each item are all >0.50, except “knowledge”, indicating adequate although not ideal convergent validity. Also all residual correlations were low, ranging between 0 and 0.028, without any tendency to a positive and negative value (data not shown but available on request).

Reliability of the model

Sufficient internal consistency to use a composite of the scores as a measure of the different constructs was shown. Within a single level analysis, Cronbach’s alpha for Factor 1 was 0.899 (standardised alpha also 0.899), which indicates a high level of “internal consistency” for the scale with this specific sample within the context. Removal of any item results in a lower Cronbach's alpha. Cronbach’s alpha for Factor 2 was 0.786 (standardised 0.788) and for Factor 3 Cronbach’s alpha was 0.745 (standardised 0.745).

As an a posteriori evaluation a second-order factor analysis model was investigated with the first-order factors used as indicators of a second-order factor, that is, an overall latent variable at a higher level in a model structure with a third level. The model fit was not improved (Ratio of χ2 to df = 2.8; RMSEA = 0.073 (CI 0.057-0.088); CFI = 0.946; TLI = 0.927; SRMR = 0.039; WRMR = 0.93; AIC = 3879; and BIC = 4018).

The number of assessments needed to achieve an acceptable minimum reliability level of ≥ 0.80 remains essentially unchanged from previous observations [17] (Table 3). Only 6 assessments for construct 1 are needed to provide a reliable composite score for the construct expressed by the items.
Table 3

Reliability for Competency Items

Competency Item

Variance Components

Variances SEMa

Percent of Total Variance of trainees’ scores

Individual item Reliability Coefficient (R)


Overall Rating












Teamwork Skills






Professional Responsibility






Time Management Skills






Medical Records






Knowledge Base






Clinical Skills






Clinical Judgement






Awareness of Limitations






Professional Obligations






Competency Domain Construct 1






Competency Domain Construct 2






Competency Domain Construct 3






a Standard Error of the Measurement

bNAAMAR = Number (rounded to digit) of assessments for adequate minimum acceptable reliability level of R = 0.80 with the NAAMAR calculated form the formula: R (reliability coefficient) = {σ2 subjects /(σ2 subjects + σ2 error /n)}, where n = assessments needed per trainee to attain the desired reliability coefficient

Measurement invariance

The model fit for all subgroups analysed as separate but nested groups was acceptable (Table 4). Testing for statistical invariance across nested sub-group comparisons (using AMOS and maximum likelihood estimator) indicated acceptable to moderately good model fit for all subgroups. This can be taken as support for configural invariance, i.e., equality in the number of latent factors across the major subgroups analysed. Testing for practical invariance across the subgroups also indicated acceptable comparisons with negligible difference in the CFI, TLI and SRMR between the respective groups, supporting the presence of full metric invariance (Table 4).
Table 4

Measurement invariance for nested model comparisons of major sub-groupsa





χ 2/df






p-value for ∆χ2




(90 % CI)

Female and Male Supervisors











All factor loadings constrained equal














Female and Male Trainees











All factor loadings constrained equal














Overseas (OTDs) and Australian Trained Doctors (ATDs)











All factor loadings constrained equal














aAssuming models unconstrained to be correct

bAll p-values <0.000 for the model χ2

χ2 minimum fit function chi-square, RMSEA root mean square error of approximation, CFI comparative fit index, TLI Tucker-Lewis index, SMSR standardized root mean square residual, Δ parameter difference between constrained and unconstrained model

CMV analysis

The CMV analysis indicated that method bias was probably present. Partial Correlational marker method controlling for CMV using lowest item-item correlation (0.353) and the lowest item-factor (0.653) as the marker both demonstrated a reduction in the correlation although the correlations remained significant indicating that the relationships were still valid despite the CMV bias (results available on request). This was supported by the observations from the ULMC method with a reduction in all item-factor correlations after using a common factor ULMC analysis. Model-fit was also less optimal when adjusted for CMV (Ratio of χ2 to df = 4.6 with a change (Δ) = 1.3; AIC =2393; Δ χ2 = 49; TLI = 0.093; CFI = 0.095; RMSEA =0.095; and the SRMR -0.043). These observations indicate a probable confounding problem from CMV, but not enough to explain all the observed relationships.


This report provides further evidence that competency domain constructs identified by supervisors can be different to the competency domains presumed to have been assessed. The alternative constructs have internal validity and show measurement invariance between important subgroups of trainees. However, only one competency construct, defined as a “general professional job performance” competency, has a level of reliability that can be pragmatically applied, needing only 6 supervisor assessments to achieve an acceptable level of reliability. For the competency of “general professional job performance” trainees can be confident that their score interpretation is both precise and accurate if 6 assessments are obtained over a year.

A person competent in general professional job performance would be considered valuable in any very complex work context, especially when the health of other individuals is involved. In the workplace all the characteristics required for Factor 1 would be invaluable, namely: (1) communication: the “ability to communicate effectively and sensitively with patients and their families”; (2) teamwork skills: the “ability to work effectively in a multidiscipline team”; (3) professional responsibility: demonstrated through “punctuality, reliability and honesty”; (4) time management skills: ability to “organize and prioritize tasks to be undertaken”; (5) medical records: the ability to “maintain clear, comprehensive and accurate records”; and (6) linked to overall rating.

That these characteristics are identified by supervisors and are aggregated together as indicated in the correlative factor analysis, are identified as a theoretical possibility in the organisational literature, and confirmed in the internal validity analysis is not surprising. They are all characteristics of competency behaviours, when displayed by an individual could lead to positive effective outcomes within an organisational context, and be noticed by a supervisor. They would make work-life easier for the supervisor if applied optimally. These are also behavioural constructs that are not specific to medical practice or training, and would be expected to be identifiable in any complex professional workplace. They are also behavioural constructs that are commonly associated with professionalism in general [28].

Exploratory factor analysis has commonly been used as part of the evaluation of validity for global ratings of trainee competences in the past. Comparable evaluations from the past of supervisors who rated trainees’ competencies have made similar observations to those of this current study, as identified in our previous review [17]. Indeed, another more recent study of a similar Australian junior doctor population also found variation in the domain constructs of what was assessed compared to the domains expected to be assessed [29]. Moreover, from an Australian perspective, other evaluative research has identified concerns about the assessment of a similar junior doctor population in Australia [3032], with observations indicating “that the tools and processes being used to monitor and assess junior doctor performance could be better” [32].

We have contributed to the literature, which we have reviewed previously [16, 17] by providing an evaluation of confounding influences on supervisor assessments, such as type of supervisor and gender for example, which has not been routinely undertaken in the validity evaluation of supervisor assessments. Similarly the use of CFA or other forms of SEM, with the addition of a reliability analysis have not routinely been used for the validity evaluation of these types of global assessment methods but is clearly feasible.

Practical implications

An important practical implication is that fewer assessments are needed to achieve a reliable score for a truly valid competency construct. The need for fewer assessments is valuable for resource use from the time perspective of the institution, supervisors and trainees.

We have also shown that it is feasible to identify a new main construct that supervisors are using in assessing trainees’ competence, to demonstrate that a previously used assessment method lacks validity evidence, and to simultaneously show that it is feasible to do so within a single training program.

In addition we have shown that it is possible to strengthen validation methods in local training programs by applying traditional methodology to the evaluation of what constructs supervisors are using. By strengthening validation methods the possibility to benchmark between institutions is also strengthened. Moreover, the quality of training may be improved by developing other valid competency constructs that supervisors can assess, allowing for an increase in the sampling of a broader range of competencies.

Also fine-tuning the quality of supervisors’ assessments is potentially resource effective by improving the assessment built into daily work and identifying areas needing improvement. The types of methods used in this study have the potential to evaluate the validity of assessments occurring in the “authentic clinical environment and aligning what we measure with what we do” [33].

The need to “develop tools to meaningfully assess competencies” [34] continues to evolve, especially for competency assessment in the workplace [33]. Carraccio and Englander raise the issue of local relevance of any assessment program: “Studying the implementation of assessment tools in real-world settings—what works and what doesn’t for the faculty using the tool—thus becomes as critical as demonstrating its reliability and validity” [33].

Limitations of the analysis and observations

Generalisability of the observations

As with all such internal structure analyses for locally obtained data, these observations may not be generalizable and the analysis would need to be replicated within each individual assessment program. The conclusions are limited to the particular sample, variables, and time frame represented by the data-set [35]. The results are subject to selection effects which include bias imposed by the individuals, types of measures, and occasions within the sampled groups and the time performed. Such potential biases pose problems for all WBAs.

The response to the generalisability issue for WBAs is that each assessment process should be validated in each individual training program, and the only thing that can be generalised is the methodology. The process of gathering validity evidence is cyclical and should be part of a continuing quality assurance process. Gathering validity evidence and reporting the evidence to standard-setting bodies is now routine for training and leaning programs in general education [36], and is becoming accepted practice in medical education even though the requirements differ [37, 38].

Common method biases

Common method biases leading to CMV exists when some of the differential covariance among items is due to the measurement method rather than the latent factors [19]. The CMV analysis indicated the probability of some confounding effect by inflating the associations between the competency domain constructs and the items. However, the confounding by CMV does not account for all the variance. Because one of the major causes of CMV arises from obtaining the measures from the same rater or source, one way of controlling for it is to collect the measures of these variables from different sources [26]. That is by many different assessors. The reliability analysis provides guidance on how many are potentially needed as a minimum. Reducing the influence of confounding thus can be potentially achieved by developing assessment programs which utilise multiple sources for evidence of competency [39]. If at all possible, intermediate and high-stake decisions should be “based on multiple data points after a meaningful aggregation of information” and being “supported by rigorous organisational procedures to ensure their dependability” [40].

Other potential confounding

The tendency to be lenient or severe in ratings is not consistent across jobs and accuracy of performance assessment is in part situation specific [41]. Variation in validity of assessments may vary within training programs, including that related to the timing of the assessment, trainee improvement, term culture, type of training and so on. However, this is the case for all WBAs and the need to identify potential confounders will always be a perennial issue. The methods to do so and be applicable to individual training programs are an ongoing improvement goal for medical education.


The validity and reliability of clinical performance assessments using judgement-based methods are acceptable when the actual competency constructs used by assessors are identified using standard validation methods, in particular for a general professional job performance competency construct. The validation of these forms of assessment methods in local training schemes is feasible using accepted methods for gathering evidence of validity.

Availability of supporting data

We are willing to share the data should anyone ask you for it, and are prepared to work with any interested researches on the re-analysis of the data particularly if for a systematic review using participant level data.



interpretative/use arguments


Confirmatory factor analysis


Structural equation modelling


Exploratory factor analysis


Number of assessments to achieve a minimum acceptable reliability


Akaike information criterion


Bayes information criterion


Tucker–Lewis index


Comparative fit index


Root mean square error of approximation


Confidence intervals


Standardised root mean square residual


Weighted root mean residual


Standard error


Common method variance


Unmeasured latent method construct


Mean- and Variance-adjusted Maximum Likelihood


Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Authors’ Affiliations

Department of Cardiology, The Canberra Hospital
Department of Educational Research and Development, Maastricht University
Clinical Trial Service Unit, University of Oxford


  1. Govaerts MJ, van der Vleuten CP, Schuwirth LW, Muijtjens AM. Broadening perspectives on clinical performance assessment: rethinking the nature of in-training assessment. Adv Health Sci Educ Theory Pract. 2007;12:239–60.View ArticleGoogle Scholar
  2. Kogan JR, Holmboe ES, Hauer KS. Tools for direct observation and assessment of clinical skills of medical trainees: a systematic review. JAMA. 2009;302:1316–26.View ArticleGoogle Scholar
  3. Dijksterhuis MGK, Schuwirth LWT, Braat DDM, Teunissen PW, Scheele F. A qualitative study on trainees’ and supervisors’ perceptions of assessment for learning in postgraduate medical education. Med Teach. 2013;35:e1396–402.View ArticleGoogle Scholar
  4. Ferguson KJ, Kreiter CD, Axelson RD. Do preceptors with more rating experience provide more reliable assessments of medical student performance? Teach Learn Med. 2012;24:101–5.View ArticleGoogle Scholar
  5. Beckman TJ, Cook DA, Mandrekar JN. Factor instability of clinical teaching assessment scores among general internists and cardiologists. Med Educ. 2006;40:1209–16.View ArticleGoogle Scholar
  6. Reeves S, Fox A, Hodges B. The competency movement in the health professions: ensuring consistent standards or reproducing conventional domains of practice? Adv Health Sci Educ Theory Pract. 2009;14:451–3.View ArticleGoogle Scholar
  7. Kane MT. The validity of licensure examinations. Am Psychol. 1982;37:911–8.View ArticleGoogle Scholar
  8. Kane MT. An argument-based approach to validity. Psychol Bulletin. 1992;112:527–35.View ArticleGoogle Scholar
  9. Kane M. Validating the Interpretations and Uses of Test Scores. In: Lissitz R, editor. The Concept of Validity: Revisions, New Directions and Applications. Charlotte, NC: Information Age Publishing Inc; 2009. p. 39–64.Google Scholar
  10. Kane MT. Validating the interpretations and uses of test scores. J Educ Meas. 2013;50:1–73.View ArticleGoogle Scholar
  11. Brennan RL. Commentary on “validating the interpretations and uses of test scores”. J Educ Meas. 2013;50:74–83.View ArticleGoogle Scholar
  12. Sireci SG. Packing and unpacking sources of validity evidence: History repeats itself again. In: Lissitz R, editor. The Concept of Validity: Revisions, New Directions and Applications. Charlotte, NC: Information Age Publishing Inc; 2009. p. 19–37.Google Scholar
  13. Zumbo BD. Validity as Contextualized and Pragmatic Explanation, and Its Implication for Validation Practice. In: Lissitz R, editor. The Concept of Validity: Revisions, New Directions and Applications. Charlotte, NC: Information Age Publishing Inc; 2009. p. 65–82.Google Scholar
  14. Mislevy RJ. Validity from the Perspective of Model-Based Reasoning. In: Lissitz R, editor. The Concept of Validity: Revisions, New Directions and Applications. Charlotte, NC: Information Age Publishing Inc; 2009. p. 83–108.Google Scholar
  15. Markus KA, Borsboom D. Frontiers of Test Validity Theory. Measurement, Causation, and Meaning. London: Routledge. Taylor & Francis Group; 2013.Google Scholar
  16. McGill D, Van der Vleuten C, Clarke M. Supervisor assessment of clinical and professional competence of medical trainees: a reliability study using workplace data and a focused analytical literature review. Adv Health Sci Educ Theory Pract. 2011;16:405–25.View ArticleGoogle Scholar
  17. McGill DA, van der Vleuten CPM, Clarke MJ. A critical evaluation of the validity and the reliability of global competency constructs for supervisor assessment of junior medical trainees. Adv Health Sci Educ Theory Pract. 2013;18:701–25.View ArticleGoogle Scholar
  18. Viswesvaran C, Schmidt FL, Ones DS. Is there a general factor in ratings of job performance? a meta-analytic framework for disentangling substantive and error influences. J Appl Psychol. 2005;90:108–31.View ArticleGoogle Scholar
  19. Brown TA. Confirmatory Factor Analysis for Applied Research. New York: The Guildford Press; 2006.Google Scholar
  20. Little RJA. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83:1198–202.View ArticleGoogle Scholar
  21. Schreiber JB, Nora A, Stage FK, Barlow EA, King J. Reporting structural equation modeling and confirmatory factor analysis results: a review. J Educ Res. 2006;99:323–38.View ArticleGoogle Scholar
  22. Hu L, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct Equ Modeling. 1999;6:1–55.View ArticleGoogle Scholar
  23. Marsh HW, Hau KT, Wen Z. In search of golden rules: comment on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler’s (1998) findings. Struct Equ Modeling. 2004;11:320–41.View ArticleGoogle Scholar
  24. Gregorich SE. Do self-report instruments allow meaningful comparisons across diverse population groups? testing measurement invariance using the confirmatory factor analysis framework. Med Care. 2006;44:S78–94.View ArticleGoogle Scholar
  25. Schmitt N, Kuljanin G. Measurement invariance: review of practice and implications. Hum Resource Manag Rev. 2008;18:210–22.View ArticleGoogle Scholar
  26. Podsakoff PM, MacKenzie SB, Lee JY, Podsakoff NP. Common method biases in behavioral research: a critical review of the literature and recommended remedies. J Appl Psychol. 2003;88:879–903.View ArticleGoogle Scholar
  27. Richardson HA, Simmering MJ, Sturman MC. A tale of three perspectives: examining post Hoc statistical techniques for detection and correction of common method variance. Organ Res Meth. 2009;12:762–800.View ArticleGoogle Scholar
  28. Eraut M. Developing Professional Knowledge and Competence. London: RoutledgeFalmer; 1994.Google Scholar
  29. Carr S, Celenza A, Lake F. Assessment of junior doctor performance: a validation study. BMC Med Educ. 2013;13:129.View ArticleGoogle Scholar
  30. Bingham CM, Crampton R. A review of prevocational medical trainee assessment in New South Wales. Med J Aust. 2011;195:410–2.View ArticleGoogle Scholar
  31. Zhang JJ, Wilkinson D, Parker MH, Leggett A, Thistlewaite J. Evaluating workplace-based assessment of interns in a Queensland hospital: does the current instrument fit the purpose? Med J Aust. 2012;196:243.View ArticleGoogle Scholar
  32. Carr SE, Celenza T, Lake FR. Descriptive analysis of junior doctor assessment in the first postgraduate year. Med Teach. 2014;36:983–90.View ArticleGoogle Scholar
  33. Carraccio CL, Englander R. From Flexner to competencies: reflections on a decade and the journey ahead. Acad Med. 2013;88:1067–73.View ArticleGoogle Scholar
  34. van der Vleuten CP, Schuwirth LW. Assessing professional competence: from methods to programmes. Med Educ. 2005;39:309–17.View ArticleGoogle Scholar
  35. MacCallum RC, Austin JT. Applications of structural equation modeling in psychological research. Annu Rev Psychol. 2000;51:201–26.View ArticleGoogle Scholar
  36. Linn RL. The concept of validity in the context of NCLB. In: Lissitz R, editor. The Concept of Validity: Revisions, New Directions and Applications. Charlotte, NC: Information Age Publishing Inc; 2009. p. 195–212.Google Scholar
  37. General Medical Council. Tomorrows Doctors 2009. 2009. 15-4-2013.
  38. Nasca TJ, Philibert I, Brigham T, Flynn TC. The next GME accreditation system - rationale and benefits. N Engl J Med. 2012;366(11):1051–6.View ArticleGoogle Scholar
  39. Schuwirth LWT, van der Vleuten CPM. Programmatic assessment: from assessment of learning to assessment for learning. Med Teach. 2011;33:478–85.View ArticleGoogle Scholar
  40. van der Vleuten CPM, Schuwirth LWT, Driessen EW, Govaerts MJB, Heeneman S: 12 Tips for programmatic assessment. Med Teach 2014, 1-6. [Epub ahead of print].Google Scholar
  41. Borman WC. Consistency of rating accuracy and rating errors in the judgment of human performance. Organ Behav Hum Perform. 1977;20:238–52.View ArticleGoogle Scholar


© McGill et al. 2015