We hypothesized that adding explicit descriptors to a VRS used in a PRO instrument would decrease the variance of PRO scores and improve correlation with known predictors, without unduly affecting time to questionnaire completion. We found, instead, that use of the additional descriptors increased time to completion importantly, particularly for the first questionnaires completed by the patient, without any beneficial effects on PRO properties. We have accordingly switched back to simple descriptors without the additional language for use in clinical practice. We did so even though the instrument with the additional descriptors would no doubt meet the typical criteria for validation of a PRO instrument.
Our study is a rare example of health-related, psychometric research comparing two versions of the same item. It is not unusual for entire instruments to be compared. For instance, Hjermstad et al. reported a systematic review of 54 papers comparing VRS, visual analog scales (VAS) and numerical rating scales (NRS) of pain [8]. Another common approach is to see whether a shorter version of a questionnaire can be used in place of a longer one. El-Baalbaki et al., for instance, compared the 15-item short-form McGill Pain Questionnaire (MPQ-SF) to a single item NRS pain measure in patients with systemic sclerosis. They concluded that there was not much advantage to the MPQ-SF and that the NRS should be used instead due to its lower patient burden [9]. A similar type of study is where fixed questionnaires are compared to those with computer-adaptive testing [10]. Studies have also compared modes of administration – electronic versus paper [11] or interview versus self-administration [12] – or different recall periods – for instance, shorter versus up to 4 week recall periods are generally comparable for fatigue [13], urinary function [14] or physical functioning [15].
That said, there are few quantitative analyses comparing versions of the same health-related questionnaire with alternative wording choices. Most typically, a questionnaire is developed, from initial focus groups with patients to external validation, with quantitative comparison restricted to item selection. To illustrate this point, we chose, pretty much at random, the Anaphylaxis Quality of Life Scale for Adults [16]. The investigators interviewed some patients newly diagnosed with anaphylaxis and analyzed the transcripts for themes. Following further discussion with psychologists and allergy specialists, the investigators developed a 28-item prototype scale with five response options: never / rarely / sometimes / most of the time / always. This was administered to 115 participants, with factor analysis used to create three domains (social, emotional, limitations) and to remove seven items that did not correlate well with other items. The investigators found that the resulting scale correlated well with other measures of quality of life and recommended its use for research and clinical practice. However, at no point did the authors quantitatively compare different wordings. For instance, the item “Having anaphylaxis stops me getting on with my life” is included in the scale because it correlated reasonably well with other items, not because it was demonstrated to be superior to alternatives such as, say, “I feel I cannot plan for the future because of my anaphylaxis” or “Because of my anaphylaxis, my life isn’t where it should be”. Similarly, the response options “never / rarely / sometimes / most of the time / always” were never compared with alternatives such as “strongly agree / agree / neutral / disagree / strongly disagree”.
Of interest, in their review of pain instruments [8], Hjermstad et al. explicitly recommend this sort of research: “Whether the variability in anchors and response options directly influences the numerical scores needs to be empirically tested.” We have found only a few examples. Cook et al. undertook a modeling study suggesting that two or three response options on a NRS was too few, 5 was adequate and 11 unlikely to be additional benefit [17]. Similar findings have been reported in the general psychometric literature, for example, for personality assessment scales [18].
Our experience demonstrates that comparative research on PROs can be conducted easily and inexpensively when piggy-backed on electronic PROs implemented as part of routine clinical care. We were able to analyze data on over 50,000 questionnaires with zero costs for research data collection. The cost of the research is minor, being restricted to investigator meetings, regulatory administration (for the IRB waiver) and statistical analysis.
The size of our study is in some contrast with prospective research specifically conducted to investigate psychometric questions, which rarely includes more than 1000 respondents. This can have substantial implications for methodologic research. Take a study where patients received one of two different scales. To detect a 0.05 standard deviation (SD) difference between the scales would require ~ 12,500 subjects for 80% power. This is far from a trivial difference: a trial of a novel treatment with 80% power to detect a moderate effect size of 0.3 SD would have power of only 65% if using an inferior scale that resulted in a 0.25 SD difference between groups.
A possible limitation of our study is the relatively high rate of unanswered items. This is expected as, first, not all patients have access to the patient portal and second, many patients stop answering the daily questionnaires before the final one is sent at 10 days because they have fully recovered and are not experiencing any operative symptoms at that point. The rate of missing data is slightly lower for the additional descriptors, likely due to increased use of the portal over time. However, there is no reasonable mechanism by which missing data could have an important effect on our main finding that use of additional descriptors did not improve the association between symptom scores and known predictors thereof.
While we have removed the additional descriptors for the Recovery Tracker, we would caution against any over-interpretation of our findings. It would be unsound to make a general conclusion that additional descriptors for symptom states are unhelpful. First, the value of additional descriptors may depend on mode of administration. Specifically, about 70% of responses were completed using a mobile phone, where the small screen would favor a shorter response option. Second, additional descriptors may have greater or lesser utility depending on chronicity or type of symptom. For instance, the additional descriptors were particularly problematic for nausea. This may be because symptom tends to come and go during the course of a day, compared to pain, which is a more constant level of severity. Indeed, the poorer properties of the item with additional descriptors may be related to a focus on severity rather than duration: the original item was “how often do you have nausea?”. Third, there may be better descriptors for severity than those based on the mental intrusiveness of a symptom, and better descriptors for interference than those based on difficulty with everyday activities. One obvious explanation for our findings is that the additional descriptors led to additional variation, for instance, patients varied in how they interpreted “generally ignore” compared to “ignore at times”. Hence further research might examine alternative descriptors less open to variations in interpretation. Research might also examine whether additional descriptors might be of value in situations where patients experience only one symptom at a time, as it is plausible that perceptions of how much a symptom can be ignored depend on the presence of other symptoms.
In conclusion, adding descriptors to a verbal rating scale of post-operative symptoms did not improve scale properties in patients undergoing ambulatory cancer surgery. We recommend further comparative psychometric research using data from PROs collected as part of routine clinical care.
Comments (0)