Background

Gastroesophageal Reflux Disease (GERD) appears when stomach contents flux back to the esophagus. It happens when the valve located between the esophagus and the stomach does not close properly. Most frequent disease symptoms are acidity and acid reflux. Other less frequent but associated symptoms are heartburn without clear motive, panting, throat ache and cough, among others [1, 2].

GERD can be classified into four severity levels, ranging from the appearance of edema and erythema, causing some degree of esophagus erosion, up to esophageal ulcers or Barret’s esophagus. Consumption of alcohol or carbonated drinks, obesity and smoking are known to be GERD risk factors [3].

According to the DIGEST international study, approximately 7.7% of the population suffers from GERD [4]. Attending to the current consensual definition: “GERD should be used to include all individuals who are exposed to the risk of physical complications from gastroesophageal reflux, or who experience clinically significant impairment of health related well-being (quality of life) due to reflux related symptoms, after adequate reassurance of the benign nature of their symptoms” [5]. Furthermore, it is commonly accepted that self-reporting is one of the main sources of diagnosis [6] and patients should report experiencing symptoms at least twice a week [2, 7] for a diagnosis of GERD.

It is important to remark that the impairments caused by GERD symptoms are highly variable and may affect quality of life even when there are no endoscopic findings [2]. Patients tend to adopt eating behaviors in order to prevent or attenuate their clinical situation. The Agency for Healthcare Research and QALY reports that the more frequent treatments are antacids (neutralizing stomach acids) and type 2 histamine receptor antagonists (H2RA) or proton pump inhibitors (PPI), both reducing the production of stomach acid [8, 9]. The impacts of GERD symptoms on patients’ health-related quality of life (HRQoL) is usually ascertained by means of patient-reported-outcomes measurements (PROMs) such as the Gastrointestinal Short Form Questionnaire (GSF-Q) [10].

HRQoL measures are particularly important for GERD sufferers given their diagnostic capabilities, while they also reveal important issues to health service providers for several reasons. First, HRQoL has been shown to have a direct relation with mortality, hospitalization and consumption of clinical resources. Second, it has been shown to have a low to moderate relation with other disease-specific indicators, hence contributing complementary information for assessing clinical impairment [11]. Presently, HRQoL has been identified as a clinical target in itself, both in patients with limited life expectancy and for therapies directed towards disease coping or symptom accommodation, as much as for biological improvement (as is the case for most chronic diseases). Preference-based measures (PBMs) play a central role in these evaluations. They allow patients to describe the impact of ill health and have an associated “utility” score for each health-state description. These utility scores can then be used to calculate quality-adjusted life-years (QALYs), which is an outcome metric used in many economic evaluations of potential health benefits [12].

In the past, clinical studies did not always include a PBM. Often they included one or more of the many PROMs that are not full PBMs because they do not have an associated, preference-based scoring system. On the other hand, PROMs have proved to be very sensitive to variations in patient health conditions, and this is one of the reasons for their extended use in clinical studies. Furthermore, when a major research need is to compare result with those of other pathologies or comorbidities, it will not be possible to use disease-specific PROMs, and generic HRQoL instruments should be preferred. Most popular generic instruments (like SF-6D, EQ-5D and HUI3), offer the possibility of computing the utility score associated to each health condition (as captured by the instrument attribute profile), reflecting the population preference towards each health state in a situation of uncertainty. This peculiarity allows using them in computing QALYs and in health economics in general.

It is usually the case that a disease-specific PROM instrument will be preferred in research about a particular disorder and when the use of generic instruments has been avoided because they do not capture properly the different levels of disease symptomatology on patients’ HRQoL. Also, because there is evidence suggesting that generic measurements might have poor sensitivity to change in some health conditions, such as GERD or others non-threatening illnesses, or are incapable of discriminating well between patients using different drugs to treat their health problems [13, 14]. In such cases, the usual strategy is to map the specific measurements into a generic instrument allowing further comparison with other studies in which the specific instruments may not be pertinent or are otherwise unavailable (e.g., retrospective databases) [15, 16].

Aligned with such an approach, since 2008, NICE’s preferred measure of health-related quality of life in adults has been EQ-5D, to derive utilities set values for health economic evaluations (see Guide to the methods of technology appraisal 2013, at https://www.nice.org.uk/process/pmg9/chapter/foreword.

The aim of the present study was to obtain the mapping algorithms needed for translating the specific HRQoL measure obtained by the GSF-Q into two of the most popular preference-based generic instruments, the SF-6D and the EQ-5D-3 L. As a secondary benefit, we will be able to assess which one of the generic instruments is more suitable for capturing HRQoL deterioration due to GERD conditions.

Methods

Study design

The present study is a secondary analysis carried out using the data gathered for the cultural validation of the GSF-Q into Spanish [17]. The original study was developed to ensure adequate estimation of psychometric properties, and was designed as an observational study that would provide a rich data set, not only for instrument validation but particularly for mapping studies, beyond what could be obtained in controlled clinical trials. This was a cross-sectional, single time point assessment design. The original sample design was thought to ensure representativeness of three strata: gender, age (< 45, ≥45 years) and symptom severity (Savary-Miller: 0-I, ≥II). Patients were selected at random by demand of attention and covering each sample stratum. Scales were administered in a single visit. Patients were over 18 years of age, able to read Spanish, and signed an informed consent form. The Ethics Committee of one of the participating centers in the validation study was responsible for approving the study design. Clinicians were recruited at random and proportionally on the geographical extension and service demand in the Spanish Autonomous Communities. The study recruited the participation of 510 gastroenterologists, and they were requested to provide 4 to 8 subjects each. Additional data on the study design may be found elsewhere [17].

Participants

The final sample was composed by 3405 patients, from whom 2251 completed all the questionnaires, sociodemographic and clinical data. Half of the participants were women (49.8%), 63.9% were obese, 40.1% smokers, 42.8% consumed alcohol, and 46.5% consumed carbonated beverages. GERD was diagnosed in 80% of cases, 46.3% were under IBP treatment, 16.5% used H2RA, and 25.3% used antacids. It should be mentioned that 48.4% were on treatment for at least one other comorbidity (Table 1). All patients had signed informed consent forms, and the Helsinki declaration guidelines were met.

Table 1 Sample sociodemographic and clinical descriptors

Instruments

Three questionnaires were used to measure HRQoL, the 2 most popular generic ones and a GERD specific instrument.

The Gastrointestinal Short Form Questionnaire (GSF-Q) [6, 7], was used to measure GERD symptom impact on HRQoL. The questionnaire is composed of six items, plus 2 filter items. The first four gauge the impact of GERD symptoms during the most recent week (upper abdomen pain, breastbone pain, limited eating, heartburn) using a 5-point Likert scale (0 = Never, 4 = All of the time). The last two inquire about the number of days per week with daytime or nighttime disturbances (0–7 days). The total score is obtained by adding up individual item scores, and it is customary to rescale it into a 0–100 severity scale. A higher score represents a higher impact on HRQoL and scores are usually interpreted by comparison with population norms [17].

EuroQol-5 Dimension-3 Levels (EQ-5D-3L) [18, 19] is a generic, preference-based HRQoL instrument. It gathers the level of deterioration for 5 attributes: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression; using 3-level items (1 = none, 2 = some problems, 3 = a lot of problems). Each combination of levels creates a health profile, with a total of 243 possible health states, although not all of them are equally likely. Profile [11111] corresponds to perfect health and profile [33333] represents the worse possible health state. Based on population preference ranking, health states are translated to a social utility value using a multi-attribute utility function (MAUF). Different MAUFs are used for different countries, mainly using estimates based on Time Trade-Off (TTO) and Visual Analogue Scale (VAS) methods [20]. The basic form of the EQ-5D-3 L MAUF is:

$$ {u}_i=1-\left(q+\sum \limits_{j=1}^{j=5}\sum \limits_{k=1}^{k=3}{b}_{jk}{D}_{ijk}+{b}_{N 3}N{3}_i\right) $$

Where the utility/preference value for health state i (ui) is obtained by subtracting from 1 the health state disutility (\( {\overline{u}}_i \)). Disutility is obtained by weighting (bjk) the deterioration level k attained in dimension Dj, plus an interaction term (N3i), which adds a constant bN3 when any of the dimensions reaches its maximum deterioration level, plus a constant (q). It should be noted that bj1 = 0 for the first level of any dimension (k = 1), which represents no deterioration in that dimension [21].

The Medical Outcomes Survey Sort Form-6 Dimension (SF-6D) [22, 23] is a generic, preference-based HRQoL instrument derived from the 36-item MOS SF-36 [24]. It gathers the level of deterioration for 6 dimensions: physical functioning, role limitations, social functioning, pain, mental health, and vitality; using a recoding of 11 specific items into 4 to 6 levels. A total of 18,000 health profiles are possible, with the profile [111111] corresponding to perfect health and [645655] representing the worse possible health state. Different MAUFs have been estimated for deriving preference utilities in different countries, with the peculiarity that no severity (interaction) constant is used. As in the previous case, a value of 0 is assigned to the first level for each dimension/attribute.

Statistical analyses

The first step consisted in checking the unidimensionality of GSF-Q items and, if met, deriving an overall severity index due to GERD condition. This severity index will be used to short generic health states (EQ-5D-3 L or SF-6D) when their corresponding profiles differ only in the permutation of one severity level, e.g.: [11112] vs. [11121]. A first approach was to estimate a unidimensional latent variable model assuming the latent variable to be continuous and items/indicators to be ordinal while using the WLSMV estimation method. A second approach was to decompose each k-categories item into a series of k dummy variables (0 = No, 1 = Yes) and coding lower level dummy categories as fulfilled (1) when a particular item-level was reached. A Partial Credit model [25] (an extension of the Rasch model) using ML estimation was obtained. In this way, estimated category thresholds could be compared across items and monotonic distribution of item step thresholds could be checked. Observed EQ-5D and SF-6D utility mean scores were compared using standard t-test and using bootstrap estimates in order to avoid the influence of skewness and extreme utility values.

Once a summary GERD-specific severity index was obtained, this index was mapped onto each of two utility values (separately), and several models were tested (see below) in order to predict the utility value associated to each GERD severity condition.

Disutility values (di = 1-ui) were modeled, instead of utility values, for several reasons. First, the data-mass usually concentrates around more lenient health states, and low disutilities will fall closer to the axis origin. Second, it is always possible to estimate a model without the intercept term, anchoring 0 value disutilities (perfect health) at the 0 GERD severity value. Since GERD is not necessarily a disabling condition, and in order to attenuate the impact of possible comorbidities in the disutility value for each individual, disutilities were aggregated, using the mean value, by GERD severity, before modelling.

The following regression models estimated linear, quadratic and cubic trends, using density function values, and Tobit and Probit, using cumulative distribution values. The following covariates were tested for inclusion: Age (decades), BMI (low, normal and overweight), GERD diagnosis (Yes), smoking, alcohol consumption, carbonated drinks consumption, IBP treatment, H2RA treatment, antacid treatment, and treatment for comorbidities. In order to anchor the best possible health states in both instruments, the GERD severity factor scores were rescaled into the range 0–1, and regression models were fit through the origin.

Along with the statistical significance of regression coefficients, model goodness-of-fit (GOF) was assessed using R2, mean absolute error (MAE) and mean absolute percentage error (MAPE). MAE and MAPE were computed overall and by quintile group based on severity scores to assess local GOF at the different levels of severity. Bootstrap estimates for model coefficient standard errors were also obtained to avoid the influence of outlier observations in the assessment of parameter significance levels. General internationally-accepted guidelines proposed for instrument mapping were followed [13].

All analyses were conducted using the SPSS for Windows statistical software, version 22.0 and Mplus 7.

Results

GSF-Q scores ranged between 0 and 30 with a mean value of 10.54 (SD = 5.94). GERD Severity summary scores (factor scores) ranged between − 1.40 and 1.88 with a mean value of 0 (SD = 0.636) with a symmetric distribution (Skewness = 0.021, SE = 0.052).

At the individual level, SF-6D mean utility scores (MSF = 0.656, SDSF = 0.207) were significantly lower than EQ-5D-3 L scores (MEQ = 0.744, SDEQ = 0.206), both under asymptotic assumptions (t = − 27.54, p < 0.001) and using 10.000 bootstrap samples: Difference 95% CI = (− 0.093, − 0.081), suggesting that slightly higher utilities were obtained with the EQ-5D. As expected, both utility scores showed a marked negative skewness, SF-6D: SkewnessSF = − 0.784, SESF = 0.052; EQ-5D-3 L: SkewnessEQ = − 1.049, SEEQ = 0.052, with a high correlation between them (r = 0.733, p < 0.001).

The first eigenvalue of the correlation matrix was λ = 3.55 and all further eigenvalues were below 1. The confirmatory factor analysis for the 1-dimension solution (assuming variables to be ordinal) attained good GOF indexes with CFI = 0.951 and TLI = 0.918. Figure 1 shows the cumulative distribution for rescaled factor scores, exhibiting a smooth ogive distribution with no evident changes in curvature. This figure may be used as normative data to obtain percentiles from severity scores. Figure 2 represents the response category thresholds for each item with respect to the latent normal severity score. In this figure, severity scores are expressed in standard deviations from the mean latent severity of 0 and, for each GSF-Q item, partial credit thresholds for each step rating response are plotted, showing a rather even spread and separation of rating categories for the first four items, and a displacement of the category thresholds above the mean severity for the last two items of daytime and nighttime limitations. This later result is in accordance with the smaller weight received by the two last items in computing the factor score.

Fig. 1
figure 1

GSF-Q severity score cumulative distribution

Fig. 2
figure 2

GSF-Q item thresholds assuming unidimensionality (Partial Credit Model)

Fig. 3
figure 3

SF-6D: (up) and EQ-5D-3 L (down) observed (blue) and predicted (green) utility values vs. GERD severity for the linear (left), quadratic (center) and cubic (right) models

The resulting equation needed for computing re-scaled estimated factor scores from observed GSF-Q items scores may be expressed as follows:

$$ {\widehat{f}}_i=\left(0.183{x}_{1i}+0.204{x}_{2i}+0.100{x}_{3i}+0.174{x}_{4i}+0.047{x}_{5i}+0.044{x}_{6i}+1.4025\right)\times 0.30479 $$

Where x1 to x4 are the scores in the first 4 GSF-Q items (0 = Never, 4 = Always), x5 is the number of days with disability, x6 is the number of nights with GERD problems, and 1.4025 and 0.30479 are scaling constants moving the factor scores into the 0–1 range.

EQ-5D-3 L showed to be particularly less sensitive to GERD severity. Only 78 (32%) of the 243 possible EQ-5D-3 L profiles were observed and 17 (7%) of them gathered more than 90% of patients. Table 2 shows the most frequent EQ-5D-3 L profiles observed in our sample. In the case for SF-6D utility scores, 975 (5.4%) out of the 18,000 possible health states were observed, 35 (0.2%) profiles presented a prevalence above 5/1000, gathering only 25.5% of cases.

Table 2 Most prevalent EQ-5D-3 L and SF-6D health state profiles, associated utilities, and frequencies (cases, percentages and cumulative percentages; partial listing)

The best fitting model for mapping GSF-Q into SF-6D disutilities was a cubic model including variables GERD severity (linear, quadratic and cubic), age (in decades), gender, BMI group (infra, normal, and over-weight), and being treated for comorbidities (see Table 3). The model GOF was good (R2 = 0.888), with MAE = 0.092 and MAPE = 27.9% (Table 4) Fig. 3.

Table 3 Estimated model coefficients
Table 4 Estimated model goodness of fit statistics

The best fitting model form mapping GSF-Q onto EQ-5D-3 L disutilities was the cubic model including GERD severity (linear, quadratic and cubic), age (in decades), gender, and being treated for comorbidities. BMI group was not significant and the following GOF statistics were obtained: R2 = 0.831, MAE = 0.086 and MAPE = 37.0%.

Discussion

Specific HRQoL instruments are the preferred choice for measuring patient perceptions on their health condition because of their high sensitivity to changes due to disease management and treatment suitability. However, mapping specific HRQoL into generic utility scores can present methodological problems. Albeit the good psychometric properties of instruments like GSF-Q for measuring the impact of GERD on patients’ daily lives [10, 17, 26], GERD is a relatively mild health disabling disease, as compared to other possible health states measured by generic instruments. Besides, it is difficult to instruct patients to restrict their thinking to only one specific disease-related disability, isolating their judgments from other comorbidities that might be present, or from the impact of normal disabilities associated with to aging, when responding to generic instruments. The final result is that generic instruments might capture the effects of other disabilities and limitations which are not be directly related to the specific disease being mapped.

One possible strategy for avoiding these problems would be to design a preference-choice experiment with the health conditions vignettes derived from the specific instrument [27]. Unfortunately, it could be expected that marginal disutilities could be oversized if other, very severe health conditions are included as anchoring. Another possibility could be to describe specific health conditions only by the set of generic health profiles that are prevalent and meaningful in the particular disease, and only mapping those conditions. This approach could be used when observed distributions like the one obtained for the EQ-5D-3 L are found (see Table 2), and a reduced number of health states gather the majority of patients. But, very large samples would need to be used, if the intent is to obtain representative results, and it could be cumbersome when the number of possible health states is very large, as has happened with the SF-6D (Table 2).

In the time being, directly mapping specific health states onto generic utility values seems to remain the best option, and special care should be taken, by aggregating generic utility values over specific severity scores, in order to smooth out the impact of non-specific effects on the mapping estimates. The present paper reports the first study mapping GSF-Q onto two of the most widely used generic HRQoL instruments. In fact, our study could be considered to have high ecological validity due to the large sample used and its ample representativeness.

In our study, GERD was found to be a quite lenient pathology, with mean utility values of 0.656 (SF-6D) and 0.744 (EQ-5D-3 L). In fact, the most prevalent health-attribute level reported was the first (no deterioration), in both generic instruments, except for the attributes/dimensions of pain and Mental Health (see Table 5). Even the scaling of the response levels of one’s own GSF-Q suggests that the third response level (L2 in Fig. 2) had been selected by patients in order to be located above the mean in the latent (error-free) severity score for all items, except for the number of days with problems. These results are in agreement with regular GERD diagnosis, which states that stomach problems should be present more than 2 days a week in order to be consistent with GERD [7].

Table 5 Percentage of responses by dimension level for each dimension/attribute of the EQ-5D-3 L and SF-6D generic instruments

Obtained SF-6D utility scores were shown to be more sensitive to GERD-severity than those obtained from EQ-5D-3 L.The distribution of the former was more spread out, with less likelihood of ceiling effects, and did not exhibit a gap between perfect health, u(11111) = 1, and the following larger value, as it was the case with the later, u(11121) = 0.79. The observed cumulative distribution function of SF-6D disutility scores was more uniform; the distribution function of EQ-5D-3 L disutilities was steeper (especially in the milder health states) and the distributions did not cross over within their ranges.

GSF-Q scores showed good unidimensional behavior which allowed summarization of GERD-related severity in a single score using factor analysis weights. Unidimensionality analyses endorsed the possibility of summarizing the different GERD symptoms in an aggregated overall score, also obtaining an adequate scaling of response levels. In our case, this strategy should be preferred against one using item-response dummy coding in the regression models, since it avoids deciding how to aggregate item response levels [28] and minimizes the possible impact of covariates in particular response levels.

For each of the generic instruments, the best-fitting model was selected. In both cases, the model including GSF-Q severity (observed, squared and cubed), age, gender, and being treated for comorbidities attained the best fit, and the SF-6D model additionally included BMI. The sign of the regression coefficients were in accordance with predicting a higher disutility as GSF-Q severity scores increase. The inclusion of significant covariates by all models suggests that the loss in HRQoL may be influenced not only by GERD symptoms but also by personal comorbidities present. This is to say that GERD symptoms may be not very prominent when assessing HRQoL using a generic instrument if other health conditions might be present, such as aging, being treated for comorbidities and overweight.

R2 values were within the range 0.885–0.888 for model SF-6D, and within 0.827–0.831 for model EQ-5D-3 L. Overall MAPE = 27.9% for predicting SF-6D and MAPE = 37.0% for predicting EQ-5D-3 L when using predictions derived from the cubic model. Computing predicted SF-6D disutility MAPE by GSF-Q severity quintile groups, MAPE ranged between 33.6% for Q1 and 23.3% for Q3 while for predicted EQ-5D-3 L disutility, MAPE ranged between 33.6% for Q1 and 28.9% for Q3 (see Table 4). As expected, the error magnitude was smaller near the location of the centroid; while it was particularly high when predicting EQ-5D-3 L disutilities using the linear model (up to 62.3% in Q1).

Some additional covariates, like smoking and drinking carbonated beverages or alcohol, approached statistical significance, but all models were kept as parsimonious as possible, and only statically-significant predictors were included (p < 0.05). Bootstrap estimates were generated, based on 1000 samples with replacement, obtaining parameter estimate bias smaller than |0.002| and significance levels\( \widehat{p}\le 0.002 \).

Mapping disease-specific instruments onto generic health related measures is a common methodological strategy due to the high sensitivity of specific instruments and the wide generalizability of generic measures. Mapping the GERD-specific GSF-Q scores onto generic utilities (SF-6D and EQ-5D-3 L) was shown to be possible, attaining adequate goodness-of-fit values. In both cases, the best-fitting model was the more complex one; the model based on GSF-Q severity, raised to the cubic power, and including generic covariates: age, gender, BMI and treatment for comorbidities. However, the model for predicting EQ-5D-3 L disutilities did not include BMI as a statistically significant covariate.

The use of cubic prediction models needs special care, since small variations in the cubed predictors can entail excessively large predicted values, including those for predictors out of the range of the observed data used for prediction, that can produce unreasonable predictions. In our case this prevention is needless, given that all GERD severity values are scaled within the 0–1 range (any value will have to be inside the range of values used for estimation), and possible covariate values are limited to the observed repertoire.

In our study, we found that utility values associated with GERD-specific conditions were rather high, suggesting that this disease is not very disabling (in general). Nevertheless, patients with utility values as low as SF-6D = − 0.3150 and EQ-5D-3 L = − 0.0757 were observed, although they were not always associated with the worst GSF-Q severity scores. Given the reduced number of prevalent health states obtained for the generic instruments (especially for EQ-5D-3 L) the question arises whether some characteristic or “natural” disease-related health states could be identified for each generic instrument, discarding other comorbidity-influenced health states. From a nosological point of view, it looks quite tempting to think that GERD would not entail a high deterioration in mobility, but it could be the case that bed-ridden people might very likely develop GERD. One possible way to minimize the impact of comorbidities, when measuring specific health conditions with a generic instrument, would be to use a set of instructions demanding that the patient assess his or her overall health condition while thinking only of his or her specific disease.

Limitations

The present study has been carried out with a Spanish population, and we cannot ensure that other cultural or eating habits would not distort our results.

Conclusions

In the present study two methods are presented allowing the mapping of specific GERD-severity scores obtained by use of the GSF-Q, onto generic HRQoL values, as measured by the SF-36 and EQ-5D-3 L instruments. In both cases, the cubic model attains best adjustment.

Mapping is an approach that enables utilities to be predicted for the calculation of quality-adjusted life-years when no preference-based information has been elicited what will allow to elaborate health economic evaluations in a simpler way, since it is not necessary to have data of no preference-based instruments. The results of this study will allow to carry out economic evaluations in the world of gastroesophageal reflux disease which will help in the future when it is necessary to make decisions with new alternatives that arrive at the market.