Aim, participants, and setting
We aimed to show that medical students share an understanding of qualities inherent to high-quality and low-quality written comments and to determine features identifying high and low quality comments to clinical medical students. Because the primary aim of the study was to evaluate a shared culture among medical students, a purposive sampling strategy was used to select medical students in their third and fourth year of training. Sampling in this way ensured that participants had experience receiving evaluative feedback in written form. Recruitment for the study took place at an Ivy League Medical School in Northern New England, and the total sample included 22 students. For reliability at or above 0.90 in studies using cultural consensus analysis, samples between 20 and 30 individuals are recommended [24], and thus, our sample falls well within this range. More than half of participants in the sample were men (n = 13); 49 % of the classes sampled were male. The average age of participants was 26.3 years (range 22–33 years of age). Participants in our sample had experience reading and interpreting written comments as, at our institution, evaluations, including written comments, are available to the student upon completion of the clinical clerkship; clinical clerks at our institution are accustomed to reading written comments about their own performance and are accustomed to doing so out of the context of the clinical encounter. As with the majority of medical schools nation-wide, our school uses oral, formative feedback to improve the performance of the student over the course of the clerkship, and written, summative feedback is decontextualized and provided either in the form of grade narratives or raw evaluations weeks after the clinical experience.
Design and data collection
Participants participated in a pile sorting activity to determine the helpfulness of each comment. Written comments were drawn from written assessments by supervising faculty clinicians of medical students two years prior (this time lag was to prevent any clinical medical student from evaluating a comment of their own performance).
After the research team read and reviewed all written comments, comments were segmented according to “meaning units,” or phrases or paragraphs that contain the same central meaning based on their content and context [23]. Meaning units were generated by the research assistant and reviewed and validated by the senior author. After segmenting comments, we randomly selected 62 segmented comments for inclusion in the pile sorting activity.
Pile sorting is a rigorous qualitative technique used within cognitive anthropology to examine how participants perceive items to be related [24]. Data collection for pile-sorting followed a two-step process outlined by Bernard and Ryan [25] and Weller [26]. First, participants were asked to sort items (written comments) into piles based on the perceived similarity. Specifically, participants were given cards, and each card contained one written comment. Participants were asked to sort cards into two piles: “unhelpful” and “helpful.” Thus, all written comments that were perceived to be similarly helpful were sorted in one pile, and comments that were perceived to be unhelpful were sorted into a different pile.
Second, participants were asked to describe, in their own words, their piles. Thus, following the sorting process, participants participated in a semi-structured qualitative interview to elicit their reasons for sorting each comment as helpful or unhelpful. Follow-up qualitative interviews are an essential component of collecting data when pile-sorting because descriptive answers (participant comments) obtained in the interview can be used to interpret the data gathered [25]. Open-ended questions were asked to understand what qualities contributed to a comment being perceived as helpful versus unhelpful, as well as perceived general characteristics of helpful and unhelpful comments. Each student provided participant comments on twenty randomly chosen written comments. We also performed a member-check with a small sample of participants to obtain their reflections about what was found and to shed more insight. This study was approved by the institutional review board of the medical school as well as the academic medical center.
Data analysis
Data analysis proceeded in several stages to determine which written comments were perceived as helpful, and then, to determine why specific written comments were perceived as helpful. We performed cluster analysis first, and then analyzed the written comments in each cluster using both qualitative analysis and statistical analysis.
Cluster analysis
First, results from pile-sorts were analyzed using Visual Anthropac: Pile Sort [27]. Anthropac analyzes data along a given domain (in this case, the quality of written comments), and determines the degree of informant knowledge within a particular group. Specifically, Anthropac analyzes individual participant’s data, based on the percentage of participants who sorted any two items together in the same pile, to produce an aggregate similarity matrix that quantifies the percent of participants who sorted items (i.e., written comment) together in the same pile. Multidimensional scaling (MDS) is a non-metric means of visualizing the level of similarity of individual cases in a dataset and is also known as perceptual mapping. MDS obtains the underlying dimensions from respondents’ judgments about the similarity of two items and does not depend on researchers’ judgments. The underlying dimensions come from respondents’ judgments about pairs of items. MDS converts similarity data (such as the aggregate similarity matrix discussed above) in matrix form into a two-dimensional visual representation of the "distance" between sorted terms. Thus, using MDS analysis to analyze the aggregate similarity matrix, a data display was created to visually map in two-dimensional space how written comments were sorted similarly across all student participants [25]. The MDS map was then layered using the cluster analysis module to facilitate the identification of specific groups of items determined to be similar by students. In cluster analysis, items that share, on average, higher degrees of similarity are outlined visually into groups on the MDS map. Accordingly, cluster analysis permitted the research team to clearly demarcate written comments that were perceived as similar across the sample of participants.
Using the cultural consensus module in Anthropac, the degree of shared knowledge was calculated to determine the level of internal validity in the data [28]. The strength of the cultural consensus is evidenced by a value known as the eigenvalue, which serves as a goodness-of-fit indicator that a single factor (the cultural consensus of the group), is present in the pattern of responses. An eigenvalue of three or greater indicates that a group shares a common culture and consensus [29].
Qualitative analysis
In order to identify characteristics associated within helpful comments, data elicited during the semi-structured interview was transcribed and imported into Microsoft Excel for coding [30]. The coding process identified the specific words used by participants to describe how and why they evaluated written comments as helpful or unhelpful. The analysis of participant’s comments helped the research team to determine what contributed to the identification of a written comment as high-quality or low-quality. Participant’s comments were read, and the words they used to describe their evaluation of written comments were coded. When participants made explicit references to an example of a written comment, this information was also coded.
Statistical analysis
The analysis of participant’s comments was supplemented with statistical analysis to identify key patterns and characteristics associated with helpful written comments. Length of written comment was compared using one-way analysis of variance. Frequency with which comments were thought of as helpful were analyzed by chi-square. To determine the role of valence on the perceived quality of comments, the authors (HR and WG) performed independent structured coding of the data, grouping each comment into “positive,” “negative,” or “neutral” categories. Individual coding was compared and a consensus assignment given; kappa of >0.9 indicated high agreement. Valence within a cluster was treated as a categorical variable and analyzed by chi-square. In conducting statistical tests, our purpose was not to achieve statistical generalizability, but to systematically identify key differences within our sample.