Ecological inference is a statistical problem where aggregate-level
data are used to make inferences about individual-level behavior.
In this article, we conduct a theoretical and empirical study of Bayesian and
likelihood inference for 2 x 2 ecological tables by applying
the general statistical framework of incomplete data. We first show
that the ecological inference problem can be decomposed into three
factors:
distributional effects which address the possible
misspecification of parametric modeling assumptions about the
unknown distribution of missing data,
contextual effects which
represent the possible correlation between missing data and observed
variables, and
aggregation effects which are directly related
to the loss of information caused by data aggregation. We then
examine how these three factors affect inference and offer new
statistical methods to address each of them. To deal with
distributional effects, we propose a nonparametric Bayesian model
based on a Dirichlet process prior which relaxes common parametric
assumptions. We also identify the statistical adjustments necessary
to account for contextual effects. Finally, while little can be
done to cope with aggregation effects, we offer a method to quantify
the magnitude of such effects in order to formally assess its
severity. We use simulated and real data sets to empirically
investigate the consequences of these three factors and to evaluate
the performance of our proposed methods.
C code, along with an
easy-to-use R interface, is publicly available for implementing our
proposed methods.