Skip to content

Priors for Data Pathologies

Derek Miller edited this page Nov 6, 2018 · 6 revisions

The following describes various ways of looking at different data pathologies. Each description includes a discussion about a prior on the coefficients that might represent this behavior. The goal of specifying a prior is described in Kaipio and Somersalo's book Statistical and Computational Inverse Problems. They write,

[T]he prior probability distribution should be concentrated on those values of x that we expect to see and assign a clearly higher probability to them than to those that we do not expect to see.

That said, prior specification is full of nuance and complexity. Here we detail our thoughts on ways to look at constructing priors for data pathologies in the context of discrete choice experiments for conjoint analysis. General recommendations for specifying priors can be found on the Stan Prior wiki.

Attribute Non-Attendance

  1. Blind Respondent The respondent totally ignores certain features as if they couldn't see them. This violates IIA since they may make different decisions if they were aware of the other features. A possible way to model this scenario is to randomly exclude certain variables or combinations of variables in the analysis (from the design matrix X); they can't just be zeroed out because they are technically not in the consideration set. So this means fitting lots of models on subsets of the data.

  2. Respondent Apathy The respondent is aware of the other features but doesn't care about them. In this case, we include all variables since they are all in the consideration set but we expect to see more zeros from the posterior betas than those from the standard MVN prior. These extra zero coefficients could be modeled with a multi Cauchy or multi student t prior but empirically, these do little to identify pathological respondents.

Notes

  • Priors that could potentially approximate this behavior: multi student t, multi cauchy, Laplace, Horseshoe.
  • In particular, the Horseshoe or the Finnish Horseshoe seems to work well for ANA.

Screening Behavior

  1. I'd Rather Die. In this case, the respondent is totally against one or more levels of a certain feature and will avoid it at all costs. In the case that it shows up in all of the alternatives, the respondent considers that feature irrelevant and makes their choice accordingly. Infinity itself is pathological so we haven't found a prior that is able to account for this kind of extreme behavior without breaking the sampler.

  2. I'd Rather Not In this case, the respondent is averse to certain feature levels (coefficients take on large negative values) and whenever possible, the respondent avoids choosing alternatives with these levels present. This will not capture behavior of coefficients with infinite magnitude and it may lead to nonidentifiable posterior.

  3. Thresholding The respondent can tolerate some levels of an attribute but once the level crosses a threshold, the respondent chooses the option with this undesirable level with probability zero. This kind of behavior is similar to the I'd Rather Not scenario but instead of focusing on how much (dis)utility a respondent gets from the presence of a feature, they directly assign a probability of 0 to choosing options with that feature present beyond a certain threshold. This sort of thresholding behavior may manifest as a discontinuity in the distribution of the coefficients. However, alternative specifications may be able to mitigate the discontinuity.

Notes

  • Priors that model this behavior: mixture of gaussians, ReLU or Leaky ReLU* applied to gaussian coefficients, skew normal

Respondent Quality

  1. Speeders and Cheaters The respondent always selects the nth alternative independent of their actual preferences.

  2. Hypersensitivity The respondent only selects alternatives with the maximum or minimum level of some feature.

Notes

  • WIP

Comments

* Leaky ReLU stands for Leaky Rectified Linear Unit. Each coefficient is passed through a thresholding function. In the case of screening behavior, we've found that the function f(x) = -x^2 if x <= -1; else x provides a good fit. Admittedly, this is more ad hoc than based on principle but it tends to do ok.