Fifty-nine (59) volunteering teachers, participating in a teacher training course at Karolinska Institutet, were asked to fill in the Vermunt questionnaire. Their responses were then scored by summarising the coded answers to the items on the questionnaire, see Additional file 1 and [3].

Three responses, each on an ordered 5- point Likert type scale, from each of the AA, AM and the AR were collected.

When defining a latent trait, i.e. an underlying unobservable variable, such as attitude or social competence, the location and variability of such a measure is arbitrary as well as it's distribution.

Different coding or headlining will change all these characteristics. However, the characteristics of the respondents in terms of ranking on a latent trait axis should be appropriate and invariant to the choice of coding.

The most straightforward approach underlying latent traits is to just summarise the answers from items as they are coded. In the questionnaire evaluated in this study (see Additional file 1), the items are coded on an ordered 5- point scale. A large sum is thought to indicate a 'high degree of activating', and a low sum a 'low degree of activating'.

An alternative approach is offered by the Item Response Theory (IRT), where the actual perception of the items can be accounted for. The ideas from IRT can be applied in many ways, see [5] for a general discussion. One such approach will be evaluated in this study and compared with the conventional sum score. The methods are briefly described below. Further information about the mathematical theory can be found in [6] and [7]. For this study, we have chosen a 2PL model with item specific discrimination. The more parsimonious 1PL model, with equal discrimination for all items, could have been chosen. However, we have to allow a flexible model so that a possible rejection of the intended raw sum score approach does not depend on an application of a too parsimonious model. Models with even more complexity (e.g. with both item and category specific discrimination) are out of the question due to too many parameters in the relation to the moderate sample size. The actual sample size might be seen as too small even for a 2PL model, but its flexibility is needed and it should be emphasized that no final model is looked for. The estimated models will, of necessity, be very approximate with large SE:s for the parameters, but can nevertheless constitute a basis for an evaluation of some basic characteristics of the questionnaires and the use of the raw sum score (or some transformation of the sum score) as a relevant measure.

### Method 1, the sum score approach

Let us assume that the location (the difficulty) is the same for all items within a dimension and that there is a constant distance between sequential categories within questions. This will correspond to the definition of the sum score approach.

The sum score might be in good agreement with the underlying degree of activating, but may also be far from the intended if the items are not suitable for being summarised. The well known main requirements for a sum score to work are the following:

- 1.
The distances between the steps in the graded scale are constant and equal.

- 2.
The difficulty, or the weight of the item, is the same for all items.

- 3.
The items should work in the same direction towards a common underlying trait (often called scalability).

Even if these characteristics are intended at the construction of the questionnaire, the population, on which it will be applied, may perceive the items differently. This risk of misunderstanding is even more cumbersome when the questionnaire is 'moved' to be applied on a different population, in a different environment with, for example, a different language or culture.

The above stated characteristics of the set of items are fixed in advanced and do not take into account how they might change when applied. Under the classical test theory, the teacher's test score is the sum of the scores received on the items of the test. A teacher's latent trait is calculated according to an external fixed scale, decided independently of the intended population. The basis is usually (or should be) some reference set of individuals. However, usually there is no straightforward linear relationship between the sum score and a position on the constructed latent trait.

Thus, the answers as coded are summarised like this:

*SumAA = Q4+Q7+Q17;, SumAM = Q6+Q10+Q15; and SumAR = Q2+Q9+Q13.*

SumAA, SumAM and SumAR represent the tendency of teachers to more or less activate the students' activities that have been suggested important for learning. This 'latent trait' is hereafter called 'teacher tendency' with respect to Activating Application, Activating Meaning and Activating Reproduction.

### The problem of non response

When, for some respondents, there is a 'non response' to a particular question, this has to be taken care of in order to create justified abilities for all respondents. In other words, values have to be imputed. A simple and reasonable method is to look for colleagues with similar profiles as the participant with a non response. The median or most frequent value in this set will then be imputed for the non response. Such a procedure can be refined by an iteration procedure, but this will not be done in this case as there are rather few 'missing data'. A disadvantage of this method is a bias towards a more homogenous sample (i.e. a more favourable sample) than could be expected from a complete sample. A more complicated situation arises when the non responses are not due to missing but rather that the question is interpreted as irrelevant for the respondent (something we seldom know). In such a case, no value should be imputed.

In this material, there is just one non-response for the AA dimension. There is only one colleague with the same profile so the imputation is simple for this case.

For the AR dimension there are 7 'non responses', all from Q13. Obviously, these can not be considered as values missing at random. An imputation might be applied according to the 'simple profile' principle to get a sum score for these 7 teachers. The teacher tendency under IRT might be estimated using an IRT model, without any imputation of individual values. Unfortunately, the evaluation revealed that the sum score does not work, nor could a reasonable IRT model be found. As a consequence, the imputation/estimation was abandoned and the sample of 52 complete teacher responses was used.

The AM dimension is complete.

### Method 2, the IRT approach

Under Item Response Theory, the primary interest is the teacher's score on each individual item, rather than on the test sum score.

### The parametric IRT approach

As for the sum score approach, a latent scale is constructed or identified. However, the advantage of IRT is it's independence of the coding. An individual's position on the scale is estimated from the answer profile. This profile is related to the difficulty of the item and its item thresholds (characterised by the over all relative frequency of answers to the different item levels) as well as the item's quality, which in essence means the item's ability to discriminate between individuals on the scale. The item difficulty is anchored at a location on the latent trait.

The individual values on the latent trait scale are directly related to the odds of answering at different levels of the items. The higher the score, the larger the probability to answer on high levels in a positively ordered item set. To allow flexibility without too many parameter estimates, the so called 2PL graded response model is chosen.

For further details of the actual IRT approach, see Additional file 2.

### The nonparametric 'Mokken scalability' approach

The Mokken scalability analysis [

6] is an efficient method in evaluating to which extent (scalability) the items in a questionnaire work together to form one underlying latent trait. However, it does not estimate the teacher tendency, just evaluate whether the respondents can be reasonably ordered by the sum score. Three measures are essential in such an analysis:

- 1.
The item pair scalability, H_{ij}, in essence the correlation between two ordered variables.

- 2.
The item scalability, H_{i}, an item's correlation with the remaining set of variables.

- 3.
The item set scalability, H, the total correlation for the set of variables.

The scalability can be viewed as the observed correlation divided by the maximum correlation for the observed data (which, in contrast to continuous variables, is < 1.)

An over all requirement is that the scalabilities according to 2 and 3 above must be positive.

If a reasonable scalability is found (a rule of thumb is > 0.3), the sum score approach might be accepted.

The characteristics of the scoring procedure should, ideally, be calculated from a large reference set, and then applied on the actual 'test set'. However, as is often the case, no reference set is available, which leads us to use the actual sample as its own reference.

In this study, the questionnaire will be evaluated in two steps:

- 1.
A nonparametric scalability analysis. If the scalability is found not sufficient, the sum score can be rolled out as inappropriate.

- 2.
A parametric IRT model will be applied to investigate item difficulty and item discrimination.

The assumption of three latent traits, AA, AM and AR, implies that three separate analyses are needed for this study.

The scalability analysis is performed by the Mokken program [8].

The parametric IRT approach is evaluated by the Parscale computer program [9].