Codebook for rating clinical communication skills based on the Calgary-Cambridge Guide

The aim of the study was to confirm the validity and reliability of the Observation Scheme-12, a measurement tool for rating clinical communication skills. The study is a sub-study of an intervention study using audio recordings to assess the outcome of communication skills training. This paper describes the methods used to validate the assessment tool Observation Scheme-12 by operationalizing the crude 5-point scale into specific elements described in a codebook. Reliability was tested by calculating the intraclass correlation coefficients for interrater and intrarater reliability. The validation of the Observation Scheme-12 produced a rating tool with 12 items. Each item has 0 to 5 described micro-skills. For each item, the codebook described the criteria for delivering a rating from 0 to 4 depending on how successful the different micro-skills (or number of used jargon words) was accomplished. Testing reliability for the overall score intraclass correlation coefficients was 0.74 for interrater reliability and 0.86 for intrarater reliability. An intraclass correlation coefficient greater than 0.5 was observed for 10 of 12 items. The development of a codebook as a supplement to the assessment tool Observation Scheme-12 enables an objective rating of audiotaped clinical communication with acceptable reliability. The Observation Scheme-12 can be used to assess communication skills based on the Calgary-Cambridge Guide.

tool. However, during teaching sessions, it has been used as a guide to assess the specific communication skills performed and to provide systematic and structured feedback.
With the introduction of teaching programmes, many assessment tools have been developed [10][11][12], including tools based on the C-CG [13][14][15][16][17][18][19][20][21][22]. The tools differ in the number of items, response scales, settings, and aims of the assessment. One tool used three items, as the aim of the study was to assess agenda making [17]. Another tool excluded items measuring the beginning and closing of the consultations [16]. The most common use of the C-CG as an assessment tool is to evaluate communication throughout consultation [13,14,18,20]. Some tools have been developed for an Objective Structured Clinical Examination (OSCE) [15,19], while others have been developed for rating audio or video recordings of the consultation [13,22]. The tools have been used in different countries [14,17,21].
In Denmark, an assessment tool based on the C-CG was developed by two of the co-authors (JA and PK) [20] with the purpose of comparing medical students' self-efficacy in communication skills to the observed ratings using simulated patients and an examiner during an OSCE [20]. The questionnaire was a useful and reliable tool for measuring communication skills based on the C-CG. As the questionnaire was familiar to the authors and tested in a Danish setting, we decided to confirm the validity and reliability before using it in an intervention study where audio recordings were planned to be rated in a pre and post design. The questionnaire was named Observation Scheme -12 (OS- 12).
The aim of the study was to confirm the validity and reliability of Observation Scheme-12, a measurement tool for rating clinical communication skills.

Setting
The study was part of an intervention study investigating the impact of the implementation of communication skills training based on C-CG at a large regional hospital in Denmark ("Clear Cut Communication with Patient") [23]. The consultations occurred at the interdisciplinary outpatient clinic at the Spine Centre of Southern Denmark, Lillebaelt Hospital.

Study sample
During the period from 2014 to 2015, 51 HCPs were asked to audio record 10 encounters before and after participating in the communication skills training. All audio recordings documented individual consultations between patients presenting with back or neck pain and a medical doctor, nurse, physiotherapist or chiropractor. Patients were informed about the purpose of the study at the beginning of the consultation and asked whether they wanted to participate. The HCPs turned on the audio recorder after the patients had provided informed consent.

Assessment tool
The OS-12 contains 12 items covering the following six domains: initiating the session, gathering information, building the relationship, explanation and planning, providing structure, and closing the session. Each item was rated on a 5-point scale with the following levels of quality: 0 -'Poor', 1 -'Fair', 2 -'Good', 3 -'Very good', and 4 -'Excellent'. Consequently, the overall score ranged from 0 to 48 points.

Content validation
A panel of four researchers and three teachers were selected to judge the ability of the OS-12 to measure the construct of the provided communication skills training. The researchers had been a part of developing the communication skills training program, "Clear Cut Communication with Patient", based on the C-CG and the teachers were trained as communication trainers in the program.

Codebook development
The codebook was developed by rating 23 audio recordings from seven HCPs (Table 1 describes the characteristics of the included patients and HCP's). The codebook described how points should be allocated in terms of distinguishing between similar scores. The coders divided the micro-skills from each item into four groups to systematize and quantify the points to be allocated. As the full length of some consultations had not been recorded, the option of rating an item as "not applicable" was added.

Coding procedure
Two of the authors (EI and HP) coded the recordings. These authors are an experienced medical doctor and an experienced nurse, respectively. The nurse had completed the same communication skills training programme as the participating HCPs and the medical doctor had experience in teaching communication skills to medical students.
The coders listened to the audio recordings while making notes on a handwritten form of the OS-12 before transferring the results into a SurveyXact solution, an online data management system. The coders found no need for transcriptions of the audio recordings as they manually wrote important sentences and described how micro-skills were demonstrated to support the points given.

Outcome measures and statistical analysis
The OS-12 is intended to measure communication throughout the consultation, and therefore our primary measurement of reliability was the overall score calculated by adding the scores for the 12 items. Reliability was assessed by calculating the intraclass correlation coefficient (ICC) [24]. It is based on two-way randomeffect with an absolute agreement for interrater reliability [25]. The ICC for intrarater reliability was also based on the two-way model, but with a mixed-effect [25]. The ICC for each item was calculated to investigate whether some items had a lower correlation than others. The statistical analysis was conducted using the STATA/IC 15.0 software package.

Results
Audio recordings from 30 HCPs were included. See Table 1 for the characteristics.

Content validation
The panel of researchers and teachers determined that every item was relevant and matched the communications skills training based on the C-CG. In addition, they suggested adding micro-skills from the C-CG to increase the understanding of the items. The micro-skills selection was based on the teacher's experience from the first training courses and were included if both researchers and teachers agreed that the micro-skills were essential to the item. For some items, it was decided to merge two micro-skills from the C-CG as they were considered to be connected. In item 1, "Identifies problems the patient wishes to address" the micro-skills "making an opening question" was merged with "listening actively" as the panel decided that HCPs had to give space for the patient to answer if they used an opening question. In addition, the panel found it difficult to negotiate an agenda without screening for further issues. Therefore those two micro-skills were merged. The results from the content validation are shown in Table 2. Table 2 also presents the codebook with an overview of the criteria for points allocated to each item of the OS-12. It is based on an assessment of the demonstrated micro-skills and other types of behaviours as they appeared in the audio recordings. Before using the OS-12 and the codebook, an understanding of the micro-skills as described in the C-CG [9] is necessary, as the coding procedure is based on the raters' abilities to identify these micro-skills.

Codebook development
Four items were more troublesome for the coders to describe than others. Therefore, details regarding the coding of these items are provided below. Item 3, "Uses easily understood language, avoids jargon", does not contain any micro-skills. Consequently, the coders decided to allocate points according to the number of medical terms used. However, an issue was that some words were clearly medical jargon, for example: "cerebrum", "column" or the question "how is your general condition?" whereas other words were more difficult to specify as medical jargon, such as, "prognosis", "paracetamol" and a very commonly used word, "functioning". The coders concluded that the use of medical jargon was acceptable as long as the words  Item 4, "Uses appropriate non-verbal behaviour", was challenging to rate in audio recordings instead of videos. The distinction listed below was made between the four micro-skills. The tone of voice of the HCP was used to assess a "calm speaking pace", whereas "pausing" meant Be very precise about coding the demonstrated skills in the domains in which they occurred Zero points were recorded if an item was not apparent If the audio recording stops before all information is provided, items 8, 9 and 10 were coded as "not applicable" The structure was coded if the audio recording did not stop during "Initiating the session" and "Gathering information" that the HCP allowed silence during the conversation. Points for "no interruptions" were given when the HCP listened to the patients without interruptions nor finishing the patient's sentences. Finally, "Leaves space for the patient to talk" was present when the HCP allowed patients to tell their stories and enabled the patients to talk about their worries and concerns. In item 7, "Attends to timekeeping, and keeps the interview on track", the coders listened for the ability of the HCP to structure the consultation according to the 4 C-CG domains: initiating the session, gathering information, explanation, and planning and closing the session. When the HCP demonstrated proficiency in these four domains they received two points. Thus, if the coders disagreed on whether the HCP convincingly demonstrated the four domains, they also disagreed on item 7.
Coding item 9 "Checks the patient's understanding" proved to be difficult, as the micro-skills were rarely demonstrated. The use of a summary, an essential part of the first micro-skill, was occasionally performed by the HCP, but very few HCPs had the patients summarize the information or confirmed that the patient had understood the information provided to them. The last micro-skill, "Asks patients what other information would be helpful, address patient's needs for information", was often demonstrated at the end of the consultation and was sometimes difficult to differentiate from the microskill: "Finally checks that the patient agrees and is comfortable with the plan" from item 12, as some HCP asked "are there any uncertainties?" or "anything else we need to talk about?" when closing the consultation. Consequently, it was specified in the codebook to give points only if the demonstrated micro-skill occurred in the right domain.

Interrater reliability
The main outcome measurement for the ICC was the overall score, and the codebook resulted in good interrater reliability (IRR), as the ICC was 0.74 (95% CI 0.52-0.85), Table 3. The ICC was greater than 0.5 for 10 items, while the ICCs for two items, "Attends to timekeeping, and keeps the interview on track" and "Checks patient's understanding", were below this threshold. Items 1 and 2 were rated in 82 of 83 cases, as the audio recorder was not turned on at the beginning of the consultation on one occasion. Items 11 and 12 were rated in 80 of 83 cases as the audio recorder stopped in three cases before the closing of the consultations.

Intrarater reliability
With an interval of 3 months, one of the authors (EI) rerated 20 audio recordings. The ratings correlated with the overall score, with an ICC of 0.86 (95% CI 0.64-0.94).

Discussion
In this study, we present the validation and the process of developing a codebook to establish reliability in rating clinical communication skills using the OS-12 assessment tool. Based on guidelines [26], good interrater reliability (0.74) and excellent intrarater reliability (0.86) were observed for the overall score when the codebook was used alongside the OS-12 assessment tool.
Only a few other studies have reported the IRR when using assessment tools based on the C-CG. Simmenroth-  [21]. In 2014 [27], the same group reported poor-fair reliability (ICC ranging from 0.05-0.57) on individual items from the C-CG. Thus, coding communication is difficult and despite the codebook, we were not able to observe a sufficient ICC (> 0.4) [26] for item 7 "Attends to timekeeping and keeps the interview on track" and item 9 "Checks patient's understanding". The two coders allocated two points for item 7 "Attends to timekeeping, and keeps the interview on track" if the interview was structured based on the C-CG, including initiating the session, gathering information, explanation and planning, and closing the session. However, if the coders disagreed on the successful fulfilment of other items, such as item 2 "Clarifies the patient's prior knowledge and desire for information" or item 12 "Summarizes the session briefly and clarifies the plan of care", they also disagreed on item 7, making item 7 sensitive to disagreement on other items (data not shown).
When the coders talked about item 9, they defined the meaning of "checking for patient's understanding" and the micro-skills related to this item. They concluded that the HCPs must confirm that the patient understood the information provided in the consultation. However, because the raters did not have access to the patients' nonverbal responses, they were unable to easily assess whether the patients understood the information. HCPs may have accepted a nod as an acknowledgement that the patient understood the explanation. Only a few HCPs explicitly asked patients to repeat or summarize the information provided. Generally, HCPs asked a simple closing question, e.g., "Do you understand?" or "Do you have any questions?", and accepted a yes or a no, respectively, as verification of the patient's understanding, making the judgement of whether the patient actually understood the information difficult. The confirmation of a patient's understanding is a well-known challenge, as HCPs have been shown to overestimate and rarely thoroughly confirm the patient's understanding [28]. Likewise, patients overestimate what they understand or do not express their lack of understanding [29].
The difficulties with an insufficient ICC for items 7 and 9 indicate the well-known problem of a low ICC when items have low scores or variance, as minor disagreements subsequently have a greater impact on the IRR [24,30]. However, this problem was not observed in the present study, and a valuable discussion is whether items with a low ICC should be excluded. Nevertheless, the OS-12 is based on the C-CG and therefore builds on the assumption that every item is essential and relevant to the consultation. Consequently, no items were excluded and we recommend using the "not applicable" response option only due to technical difficulties or similar situations. In this study, none of the items were coded "not applicable" if the entire encounter was recorded.
We used a 5-point scale in the codebook because it was tested in the original study [20]. Other researchers have used two-point [17,31], three-point [14,19], fourpoint [13,18] or five-point scales [27] when rating communication skills based on the C-CG. We recommend maintaining the 5-point scale when utilizing the OS-12, as all micro-skills are divided into groups of five.
The two coders had similar characteristics (e.g., training, experience, and gender) and previous experience in coding [32]. However, they had different professional backgrounds (e.g., a nurse and a doctor). According to other studies [33], coders with the same gender, professional background, and coding experience generate a higher IRR. In the present study, a decision was made to have coders from different professional disciplines rate the audio recordings, because the recordings were obtained from an interdisciplinary clinic with different HCPs represented.
The fact that the encounters were audio-recorded instead of video recorded was a limitation of the study resulting in an incomplete rating of the non-verbal communication. Without access to visual documentation of the encounter, it was impossible to assess how the body language and the interaction between the HCP and the patient affected the relationship. However, in order to be able to assess parts of the non-verbal communication, we chose to rate calm non-speaking paces, no interruptions of the patient, leaving space for the patient to talk and pausing. The audio solution was chosen because it was the most feasible method in that setting. A second limitation was that the OS-12 did not include every micro-skills from the C-CG. The C-CG contains 73 different micro-skills [9] and in this study, the expert group selected the ones that were given the highest priority at the training course. Consequently, the OS-12 reflects the selected skills and the coding tool has to be used considering this limitation. Furthermore, as the C-CG is a generic communication skill teaching strategy the OS-12 may be utilized to code these skills in other countries and settings where communication skills training is based on the C-CG. However, studies are required to investigate whether similar results can be obtained in other countries and when the OS-12 is applied in other settings and countries validation is recommended including careful consideration of which micro-skills have been given priority in the specific training course.

Conclusions
The utilization of a codebook as a supplement to the OS-12 assessment tool fosters an objective rating of clinical communication skills. It provides acceptable interrater and intrarater reliabilities for the overall score when audio recordings are coded separately by two raters. The OS-12 can be used to assess the communication skills of HCPs and evaluate communication throughout the HCP-patient encounter. The OS-12 is particularly recommended as an assessment tool if communication is based on the Calgary-Cambridge Guide.