Incorporating small-area estimation into mediation analyses with areal datasets

The topic of mediation analysis has received considerable attention in recent years, especially in the field of epidemiology. With mediation analyses, health researchers aim to understand the mechanism through which an exposure, intervention, or characteristic affects a health outcome. While mediation analyses are often performed with individual-level data, there is a growing body of literature using mediation analysis methods with areal datasets (e.g. O’Connor et al., 2018, Moss et al., 2017, Chien et al., 2017, Drewnowski et al., 2014, Palacio and Tamariz, 2021, Ransome et al., 2016, Nicholson et al., 2017). In health research, areal datasets are characterized by aggregated measurements corresponding to a finite set of small, non-overlapping geographic regions. Areal datasets may include variables such as disease counts, average household income values, or the per capita density of fast-food restaurants for a collection of counties, ZIP codes, or census tracts (Moraga, 2019). Studies in which areal datasets are used to examine exposure–outcome relationships are known as ecological studies (Wakefield and Lyons, 2010).

There are several reasons why conducting mediation analyses with an ecological study design might be of interest. First, researchers might be interested in better understanding reasons for an area-level disparity, rather than a disparity at the individual level. Pertinent questions may include: does the density of fast-food restaurants explain any of the relationship between a ZIP code’s socioeconomic environment and cancer incidence? How about access to healthcare? In this work, we will particularly focus on the latter, assessing if healthcare availability explains any of the relationship between ZIP code-level socioeconomic environment and colorectal cancer incidence in the state of Iowa.

Secondly, researchers might be interested in pathways leading to individual-level disparities, yet individual-level data may not be available for important variables in the analysis. Individual-level data can be costly to collect, and it may not be representative of all of the individuals in the population of interest. While it is widely understood that conclusions from ecological studies should not be interpreted at the individual level due to the potential for ecological bias (Wakefield and Lyons, 2010, Lawson, 2018, Shaddick and Zidek, 2015), they can be used to assess geographic disparities and can serve as a useful starting point for generating causal hypotheses at the individual level.

When conducting mediation analyses with areal datasets, age-adjusted rates (AARs) are the frequently selected outcome measure (e.g. Moss et al., 2017, Nicholson et al., 2017, O’Connor et al., 2018). AARs are useful for comparing a health outcome across two or more regions. To compute AARs (in the following example incidence AARs), the direct standardization procedure is used. This procedure entails dividing the number of new cases and the population sizes for each region by age group. For each age group, a rate is calculated by dividing the number of cases in that group by the corresponding population size and then multiplying that rate by 100,000 people. An AAR is obtained by taking a weighted average of the age-group-specific rates, where the weights correspond to the proportion of individuals in that age group in some standard population. By computing a weighted average, the confounding effect of age distribution on disease incidence in a region is largely reduced (Klein and Schoenborn, 2001, Buescher, 2010). AARs can then be compared across regions and can be used to assess the absolute and relative magnitudes of a health problem across different regions (Curtin, 1995).

Numerous approaches have been used in the literature to conduct mediation analyses aimed at understanding why AARs differ by a regional characteristic. Two broad approaches used in the literature are what we call the “Calculation before mediation” (C-BM) approach and the “Small-area estimation before mediation” (SAE-BM) approach. This paper additionally proposes the “Small-area estimation within mediation” (SAE-WM) approach, which employs small-area estimation techniques in the outcome model of the mediation analysis.

Both Moss et al. (2017) and Nicholson et al. (2017) used a version of the method that we will refer to as the “Calculation before mediation” or C-BM approach. The C-BM approach is characterized by calculating AARs for each areal unit directly from the raw data before performing the mediation analysis. These AARs are then treated as a continuous outcome variable in a series of linear regression models, where the product of coefficients method (Preacher and Hayes, 2008) or another mediation analysis method is employed, to obtain an estimate of the indirect effect in the mediation analysis. The C-BM approach treats the AARs continuously, often utilizing a linear regression model as the outcome model in the mediation analysis. The use of linear regression models in this setting has several limitations. First, it does not account for any residual spatial correlation that may be present in the AAR outcome model. Furthermore, AARs are derived measures from count variables and would never arise from a continuous data-generating process.

When working with count data from small geographic regions, AARs estimated directly from the raw data can also be subject to substantial variability. This variability results from both the small population counts in the denominators and from sampling variability (Shaddick and Zidek, 2015). It is one downside of the C-BM approach, and this issue becomes more pronounced when working with data on a finer spatial scale, such as counts at the ZIP code level compared to counts at the county level. Numerous methods have been developed in the area of disease mapping to reduce this variability and produce reliable small area estimates (SAEs) of disease risk (Lawson, 2018). Many of these methods employ Bayesian hierarchical models and assume that disease risk is spatially correlated. The spatial correlation is captured through a spatial random effect term with a conditional autoregressive (CAR) prior distribution. SAEs derived from these models lead to more precise estimates of disease risk than those calculated from the raw data (Shaddick and Zidek, 2015).

The second approach found in the literature is what we refer to as the “Small-area estimation before mediation” or SAE-BM approach. SAEs have been created for diverse health outcomes and geographies, and many are publicly available for download by researchers (e.g. Institute for Health Metrics and Evaluation (IHME), 2016). The SAE-BM approach is characterized by using a pre-existing SAE of the AAR as the outcome variable for the regions of interest and performing the mediation analysis by treating this estimated AAR as a continuous outcome. For example, O’Connor et al. (2018) utilized the SAE-BM approach to identify county-level variables that explain the relationship between county-level median household income and county-level age-adjusted cancer mortality rates in the United States.

Several challenges exist with using SAEs as an outcome variable in mediation analyses. While this approach is utilized in the literature (e.g. O’Connor et al., 2018, Drewnowski et al., 2014), the model used to create the SAEs can affect the results of secondary regression analyses (Kong and Zhang, 2020). For example, if SAEs are created from variables that are also of interest in the mediation analysis, biases may arise. Researchers have cautioned against using an outcome variable that was derived from a variable which is included as a covariate in the new regression model (Ogburn et al., 2021, Kong and Zhang, 2020). This approach can introduce a substantial amount of bias into an analysis, potentially creating artificial or inflated relationships between the derived estimates and the compositional variables, but its influence has not been assessed in a causal mediation analysis using SAEs. Furthermore, SAE models may omit important contextual factors of a neighborhood or region, and while random effects for each small area may capture some of this local variation, they will not capture all of it. This can lead to biased SAEs being used as outcome variables in regression models (Kong and Zhang, 2020). The use of SAEs as an outcome variable could also lead to very narrow confidence or credible intervals because the outcome variable in the regression model is itself a mean with less variability than a realization of the data. Finally, treating SAEs as a continuous variable does not reflect the true data-generating process of AARs, which arise from a set of age-group-specific counts.

The final approach that we consider is the “Small-area estimation within mediation” or SAE-WM approach. We propose this novel approach to overcome a number of the aforementioned limitations of the C-BM and SAE-BM approaches, allowing for estimation of the AARs and the mediation analysis to occur simultaneously. In this proposal, a Bayesian SAE model for counts is utilized as the outcome model in the mediation analysis, integrating the small-area estimation and mediation steps into a single analysis.

The objectives of this paper are to introduce the SAE-WM method as an alternative approach to conducting mediation analyses with areal datasets while illustrating challenges that may arise when using the existing approaches. This work was motivated by the need to select an appropriate statistical method for understanding Iowa’s unexplained high cancer rates. In particular, age-adjusted colorectal cancer incidence rates from 2014–2018 were higher in ZIP codes with lower median household incomes, with variability of rates and risk factors across regions of the state. Based on this, the research question that motivated this work is as follows:

Does ZIP code-level healthcare availability explain any of the relationship between ZIP code-level socioeconomic environment and colorectal cancer incidence AARs in the state of Iowa?

We contribute to the range of statistical methods available for rigorously addressing these types of questions and, secondly, apply the proposed SAE-WM approach to explore the impact of healthcare availability, as defined by presence of a hospital in a given ZIP code.

The remainder of this paper is organized as follows. In Section 2, we provide important background information on mediation analysis, small-area estimation, and relevant notation. In Section 3, we describe the methods including additional details on the novel SAE-WM approach and introduce the simulation study design. In Section 4, we present the results of the simulation study to highlight the advantages of the SAE-WM approach to mediation analyses with areal datasets and to illustrate the pitfalls that may arise when conducting mediation analyses using the “C-BM” and “SAE-BM” approaches. We then highlight the SAE-WM method’s utility in practice in Section 5 by applying this method to our real-world application. Finally, in Section 6, we discuss the strengths, limitations, and practical recommendations for conducting mediation analyses with areal data, particularly in the context of AAR outcomes.

Comments (0)

No login
gif