Analyses of inter-rater reliability between professionals, medical students and trained school children as assessors of basic life support skills.

BACKGROUND
Training of lay-rescuers is essential to improve survival-rates after cardiac arrest. Multiple campaigns emphasise the importance of basic life support (BLS) training for school children. Trainings require a valid assessment to give feedback to school children and to compare the outcomes of different training formats. Considering these requirements, we developed an assessment of BLS skills using MiniAnne and tested the inter-rater reliability between professionals, medical students and trained school children as assessors.


METHODS
Fifteen professional assessors, 10 medical students and 111-trained school children (peers) assessed 1087 school children at the end of a CPR-training event using the new assessment format. Analyses of inter-rater reliability (intraclass correlation coefficient; ICC) were performed.


RESULTS
Overall inter-rater reliability of the summative assessment was high (ICC = 0.84, 95 %-CI: 0.84 to 0.86, n = 889). The number of comparisons between peer-peer assessors (n = 303), peer-professional assessors (n = 339), and peer-student assessors (n = 191) was adequate to demonstrate high inter-rater reliability between peer- and professional-assessors (ICC: 0.76), peer- and student-assessors (ICC: 0.88) and peer- and other peer-assessors (ICC: 0.91). Systematic variation in rating of specific items was observed for three items between professional- and peer-assessors.


CONCLUSION
Using this assessment and integrating peers and medical students as assessors gives the opportunity to assess hands-on skills of school children with high reliability.


Background
According to the International Liaison Committee on Resuscitation (ILCOR)-guidelines of 2010 the most important determinant of survival from sudden cardiac arrest is the presence of a trained lay rescuer who is ready, willing, and able to act [1].
All over the world initiatives aim to increase the ability and awareness of potential lay rescuers. Initiatives focus on providing adequate training material and recommend to introduce basic life support (BLS) training in schools [2]. Different training concepts e.g. standard instructor-led hands-on training in groups and self-instruction with training videos have been shown to improve BLS skills of school children [3]. It is difficult to compare the effectiveness of the different training concepts because there is no uniform assessment [4]. The ILCOR-guidelines 2010 outline the need for further research on optimizing assessment of CPR skills to support individual learning by providing feedback (formative assessment). It also provides the opportunity to compare the effectiveness of different training formats (summative assessment) [3]. However, assessing BLS skills in a school setting is challenging because assessment in this context takes up a lot of manpower. Because mannequins with included digital skillreporters are expensive and their transport complex, cost-effective and portable MiniAnne mannequins (Laerdal®) are widely used for BLS hands-on training of school children.
We designed a study to improve the feasibility to assess BLS skills at schools. By including medical students and trained school children (peers) as assessors we reduced the personnel requirement of professional assessors.
The two aims of the study were to develop a contextappropriate assessment of hands-on BLS skills of school children at schools and to test the inter-rater reliability when using professionals, medical students and trained school children as assessors.

Setting
This validation study was associated with a study investigating the effectiveness of peer-led hands-on BLS training for school children compared with professional-led training [5]. The study took place at eight schools in Hamburg, Germany, during the awareness-weeks of the national campaign "einlebenretten" ("save one life") in September 2013 and 2014 and the local campaign "Hamburg rettet Leben" ("Hamburg saves lifes") in March 2014. The assessment was part of an educational event at the schools and included school children of grades 7 to 10 (twelve to sixteen years old). The educational event lasted three hours and consisted of three parts. Part one was a 30-min lecture based on a lecture for the school children (available at the platform "www.einlebenretten.de") provided by a physician. Part two consisted of 45 min hands-on training in basic life support (BLS) in groups of 10-14 school children with two instructors using "MiniAnne" (Laerdal®). Every group used 4 MiniAnne mannequins and every child practiced all skills evaluated with the formative-assessment in two-rescuer-scenarios. The third part was the "Assessment".

Assessment
For the practical assessment, we used a standardised three minutes/two-rescuer scenario assessed by a checklist. The examinees were supposed to perform the initial check for responsiveness and breathing and start CPR. They were supposed to tell an assistant rescuer to call the emergency medical service (EMS) and get an automated external defibrillator (AED) and attach it to the mannequin. The assistant rescuer was a child assessed immediately before. We set up numbered assessment-sites (up to eight), each with a MiniAnne and two examiners. The assessors were mixed based on alphabetical lists. We classified professionals and medical students as reference-assessors and tried to combine one reference-assessor with one peerassessor. At some schools there were more assessmentsites than professional assessors. In this case we used assessment pairs consisting of two peers. If there were more professional assessors we combined one professional and one student or two professional assessors. The school children were randomly assigned to the assessment-sites, read the case vignette, and started the assessment upon an acoustic signal.
The assessors judged independently based on a structured assessment checklist consisting of 15 items. All school children received feedback about their performance based on 15 items (formative assessment) from a professional-instructor after their assessment. Eight items of this assessment were relevant to either "pass" or "fail" the assessment. These eight items (summative assessment) were used to compare training outcomes between the peer-led and professionalled training.

Assessors
Trained school children (peers), professional instructors and medical students acted as instructors and assessors at the school event. Peers were school children of the participating schools who were trained beforehand. The training of the peers included two three-hour training sessions provided by two anaesthesiologists who were experienced in instructing medical students in BLS. The first session focused on a knowledge-based access to cardiac arrest and resuscitation as well as hands-on training in BLS. During the second session the peers were taught the 4-step-approach (sequence of teaching steps recommended in the ERC instructor manual) [6] and practiced to rate BLS-performance based on our formative assessment-checklist.
The potential peers were recruited by their teachers. Participation was offered to all school children of a class/ course by their teachers. The school children participated voluntarily.
The medical students were in their final year of study which consists of four months in internal medicine, four months in surgery and four months in a subject of their choice. All students in our study had chosen their elective in the clinic of anaesthesiology of the University Hospital Hamburg-Eppendorf and participated voluntarily in the educational events for the school children.
The professional instructors were recruited from the staff of the Clinic of Anaesthesiology at the University Hospital Hamburg-Eppendorf. All of the professionals were experienced in training others in BLS and also volunteered to participate in the training at the schools. All assessors were trained up to 2 weeks in advance to use the formative assessment-checklist. The training was a group-training for 5-20 assessors. BLS skills were rated during three CPR-scenarios by all assessors and a reference rater (ERC-ALS-instructor). The individual results were discussed and aligned with the results of the reference-rater (frame-of-reference training) [7].

Assessment-instrument First step
We developed a formative assessment-checklist (Table 1) based on the learning objectives for the educational events and providing a structured guide for giving feedback during the hands-on training and after the practical examination. The instructions "check-call-compress" of the initiative "einlebenretten" formed the basis of the formative assessment to maintain structure during the educational event. In a stepwise process, formative assessment was developed by listing all instructions of "einlebenretten" [8] and the ERC-BLS-algorithm [9] and uniting the items. We added parameters, which were based on international consensus to define expected dimension of performance. To rate sufficient compression depth we used the "klicking" of the MiniAnne mannequin indicating compression depth between 42.5-57.5 mm. The compression rate was counted using a tapping beat-counter (free application for android: BPM Tap by Ignace Maes and free application for i-phone: BPM_counter by Torsten Klaus). Hand-activated stop-watch measurements were used to measure the lengths of time until starting chest-compression and the lengths of time of interruptions of chest-compression.

Second step
We developed a summative assessment-checklist to score the outcome of the educational event and to compare the groups. Targeting a binary outcome for every examinee (pass/fail), the trial leader reduced the items stepwise generating pass-relevant items. Quality was ensured with respect to the two most common flaws in examinations, "construct underrepresentation" and "construct irrelevant variance" [10]. Construct underrepresentation was avoided by including only outcome relevant items and excluding items oversampling of one content area by doublescoring. According to current literature, we focused on chest-compression as the key determinate to improve survival and excluded mouth-to-mouth ventilation.
Construct irrelevant variance was achieved by excluding items with unreliable scoring or representing not mainly No interruption of more than 10 s task-specific knowledge and skills of the examinee. The final checklist contained 15 Items of the formative assessment and eight of those, ( Table 2) were relevant to pass the exam. The summative assessment was judged "passed", when all eight items of the checklist were rated with "yes" by both examiners. If one item was judged "no" the overall assessment was judged as "failed".

Statistical analysis
We used the intraclass correlation coefficient (ICC) to assess inter-rater reliability of the assessment between the different groups of assessors. Statistical analyses were performed with SPSS Version 21 (IBM SPSS Inc., Chicago, IL, USA) using a two-way-random-model (confidenceinterval 95 %  [5]. We expected the sample size of main study as adequate to evaluate intraclass correlation coefficient (ICC) between professionals, medical students and peers with narrow confidence limits.

Results
One hundred eleven peer-instructors, 10 medical students and 15 professional trainers evaluated 1087 school No interruption of more than 10 s children with this checklist. Demographic data of the assessors are presented in Table 3. One hundred fourteen school children forgot their written informed consent and 84 were evaluated by only one assessor. Overall data of 889 summative-assessments were used to calculate interrater reliability (Fig. 1). Mean age of the assessed school children was 14 years, 51 % were female and 49 % male school children.

Inter-rater reliability
Overall inter-rater reliability ( Table 4)  . Agreement between peer-and student assessors was high. Disagreement (assessment judged as "failed" by only one of the two assessors) occurred comparably often in the group of peer-and student-assessors. The scoring of peerand professional-assessors differed in 65 of 339 ratings. Peers rated the examinations of the school children more often as "failed" (n = 43) compared to professionals (n = 22). Professionals, compared to medical students, rated the examination of school children more often as "failed" (number of examinations only failed according to the scoring of professional-assessors: n = 7/student-assessors: n = 3). Peers and professionals rated the items "check responsiveness", "correct compression point" and "no-flow-time" significantly different (p-value of McNemar's test for "check responsiveness" = 0.004, "correct compression point" = 0.001 and "no-flow-time" < 0.001). The school children failed those items more often, when rated by peers than by professionals. Failure-rates of the other items did not differ between peer-and professionalassessors (Table 5). Failure-rates of the school children rated by medical students and professionals and those rated by students and peers differed not significantly according to McNemar's test.

Discussion
In the current study, we found a good inter-rater reliability between professional-and peer-assessors (ICC > 0.7) and optimal (ICC > 0.8) to excellent (ICC > 0.9) inter-rater reliability between peer-and other peer-or student-assessors, according to intra-class correlation coefficient (ICC) and the 95 %-confidence interval of ICC.
School children were more likely to pass the exam when rated by a professional than by a peer because they failed the items "check responsiveness", "correct compression point" and "no-flow-time" more often, when rated by peers than by professionals. These results are in line with the current literature indicating examinees are more likely to pass the exam, when rated by professional-instructors than by (student-) peer-instructors [11,12].
Though we can't explain the differences on item level based on current literature, we assume that the peers rated the items "check responsiveness", "correct compression point" and "no-flow-time more "text-book" based on definitions and metrics. Whereas professionals may have used their medical experience to judge items "yes" that aren't necessarily "correct" based on guideline definitions but work out in a real-life.
Failure-rates of the school children rated by medical students and professionals did not differ significantly for the whole assessment. There was no systematic scoring difference between these two groups of assessors on single item level. Neither could we demonstrate a significant, systematic scoring-difference between student-and peer-assessors.
This study adds two important aspects to the current literature: 1) This study developed a context-appropriate assessment of hand-on BLS-skills of school children. It is reliable using professional instructors, medical students and peers as assessors. Additionally, the new assessment meets other performance-indicators like feasibility, cost effectiveness, specificity and fidelity [6]. Using low-cost assessment instruments (MiniAnne and smart-phones of the assessors) and including medical students as well as trained school children as assessors, we managed to assess up to 96 school Table 3 Demographic data of the assessors 2) The analysis of inter-rater reliability between the three assessor-groups (professionals, medical students and peers) demonstrated peers and medical students are capable of reliably assessing BLS skills. We could not show any systematic rating-difference between professional-and student-assessors. But failure-rates of school children rated by peers were higher. The severe examination by peers led to a low-risk of school children passing the exam unjustified when rated by peers.
There are several limitations of the study. First of all the dropout rate was high. One hundred ninety-eight assessments could not be included in the analysis because  school children forgot their written informed consent or were only evaluated by one assessor. We cannot estimate the consequence of the missing assessments on the results. The development of the assessment-checklist and the assessor-training was performed by one professional-trainer only. Probably a consensus-process like the DELPHIprocess [13] would have enforced the content-validity of the assessment checklist. The fact that we used frameof-reference training for all assessors, the scoring of the assessment checklist is closely related to our reference-rater. This has to be considered before generalising the results. The CPR performance assessed with MiniAnne was not compared to performance assessed with another instrument of assessment e.g. high-quality-feedback mannequin, thus allowing to estimate of dependency of pass-rates and reliability of assessors from the assessment instrument. The "clicking" sound of the MiniAnne did not only offer feedback for the assessors but also for the examinees and probably improved the observed compression quality. The influence of supportive behaviour of the assistant rescuer on performance of the examinee during the two-rescuer scenario could be reduced by using a "standardized" roleplay for the assistant rescuer.
The summative assessment scores performance of compression-only CPR. According to current literature, we focused on chest-compression as the key determinate to improve survival. We excluded mouth-to-mouth ventilation from the summative assessment, because in the city of Hamburg the emergency medical service arriving time is in 79 % within 8 min and cardiac origin of arrest is very high [14]. In different locations mouth-to-mouth ventilation may be more important for patient outcome [15].

Conclusion
Using this assessment and integrating peers (trained school children) and medical students as assessors gives the opportunity to assess hands-on skills of school children, with high reliability.