Funneling

Filed under bayesian, python.

Bayesian hierarchical (or multilevel) models have two or more layers of random variables. Commonly, variables from one layer are drawn from a parent distribution whose shape depends on the variables from the layer above it. This imposition of structure on the random variables causes partial pooling (or shrinkage), which can lead to hierarchical models having drastically improved out-of-sample predictive accuracy compared to their non-hierarchical counterparts. For a real-world demonstration of this, see Gelman and Stern (2006).

Unfortunately, it can be difficult to sample the posterior distributions of hierarchical models properly, especially when there are few data, due to a phenomenon sometimes called funneling. In this post, I demonstrate funneling using some simple models and show how it can be remedied via reparameterization.

While the funneling problem and its solution have been written about before, in my opinion, they are not well known enough among Bayesian modelers. For instance, in their introductory textbook on Bayesian modeling for cognitive scientists, Lee and Wagenmakers (2013) present numerous models that will likely suffer from funneling if applied to psychological experiments containing few data.

Neal’s funnel

Neal (2003) provides the classic demonstration of funneling. This paper was written to show the inadequacies of older Markov chain Monte Carlo (MCMC) sampling methods, such as Gibbs sampling (Geman & Geman, 1984). However, as we shall see momentarily, even contemporary MCMC sampling methods struggle to properly sample from Neal’s funnel.

Suppose we have the following ten normally distributed random variables:

\[\begin{equation} v\sim\textrm{Normal}\left(0, 3\right)\\ x_i\sim\textrm{Normal}\left(0,e^v\right)\textrm{ for }i=0\textrm{ to }8 \end{equation}\]

(Actually, we don’t need \(x_1\) to \(x_8\) to demonstrate funneling, but this is what Neal used, so we’ll stick with it.) Let’s simulate some data and visualize this probability distribution using Python. (Thanks to Junpeng Lao for creating the notebook where I got most of this code.)

"""Generate data from Neal's funnel distribution.

"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
from scipy.stats import norm


def main():

    # set up figure
    fs = rcParams["figure.figsize"]
    rcParams["figure.figsize"] = (fs[0], fs[0] / 2)
    rcParams["lines.linewidth"] = 2
    rcParams["font.size"] = 14

    # generate data
    np.random.seed(0)
    k = 9
    n = 10000
    v = norm.rvs(0, 3, n)
    x = norm.rvs(0, np.exp(v / 2), (k, n))

    # plot data
    fig, axes = plt.subplots(1, 2, constrained_layout=True)
    ax = axes[0]
    ax.scatter(x[0], v, marker=".", alpha=0.05, rasterized=True)
    ax.set_xlim(-20, 20)
    ax.set_ylim(-9, 9)
    ax.set_xlabel("$x_0$")
    ax.set_ylabel("$v$")

    # plot analytic log-likelihood
    ax = axes[1]
    r = 500
    x, v = np.meshgrid(np.linspace(-20, 20, r), np.linspace(-9, 9, r))
    logp = norm.logpdf(v, 0, 3) + norm.logpdf(x, 0, np.exp(v / 2))
    ax.imshow(logp, vmin=-7.5, vmax=-2.5, cmap="viridis", origin="lower")
    ax.set_yticks([])
    ax.set_yticklabels([])
    ax.set_xticks(np.linspace(0, 499, 5))
    ax.set_xticklabels(np.linspace(-20, 20, 5).astype(int))
    ax.set_xlabel("$x_0$")

    # save
    plt.savefig("../images/neals-funnel-a.svg", bbox_inches=0, transparent=True)


if __name__ == "__main__":
    main()

Random samples (left) and log likelihood (right) of two parameters from Neal’s funnel, \(x_0\) and \(v\).

Now let’s try to sample from this distribution using PyMC3, a Bayesian modeling package for Python (Salvatier et al., 2016). We will definine our random variables and assign them the correct prior distributions, but will not provide any data with which to update the priors. Thus, the model’s prior and posterior distributions are identical. When we sample from the posterior distribution using MCMC, we should obtain samples that are extremely similar to the data we simulated earlier.

For continuous random variables, PyMC3 uses no-U-turn sampling (NUTS; Hoffman & Gelman, 2014), a form of adaptive Hamiltonian Monte Carlo (HMC; Neal, 2012), by default. This sampling method is considered to be state of the art and is considerably more efficient than older MCMC sampling methods such as Gibbs.

"""Generate data and sample from Neal's funnel distribution.

"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
import pymc3 as pm
from scipy.stats import norm


def main():

    with pm.Model():

        # set up figure
        fs = rcParams["figure.figsize"]
        rcParams["figure.figsize"] = (fs[0], fs[0] / 2)
        rcParams["lines.linewidth"] = 2
        rcParams["font.size"] = 14

        # simulate data
        np.random.seed(0)
        k = 9
        n = 10000
        v = norm.rvs(0, 3, n)
        x = norm.rvs(0, np.exp(v / 2), (k, n))

        # plot simulated data
        fig, axes = plt.subplots(
            1, 2, constrained_layout=True, sharex=True, sharey=True
        )
        ax = axes[0]
        ax.scatter(x[0], v, marker=".", alpha=0.05, rasterized=True)
        ax.set_xlim(-20, 20)
        ax.set_ylim(-9, 9)
        ax.set_xlabel("$x_0$")
        ax.set_ylabel("$v$")

        # set up model
        v_ = pm.Normal("v", mu=0, sd=3)
        x_ = pm.Normal("x", mu=0, sd=pm.math.exp(v_ / 2), shape=k)

        # sample and save samples
        trace = pm.sample(n, chains=1)
        v_samples = trace["v"][:]
        x_samples = trace["x"][:].T

        # plot samples
        ax = axes[1]
        ax.scatter(
            x_samples[0], v_samples, marker=".", alpha=0.05, rasterized=True, color="r"
        )
        ax.set_xlabel("$x_0$")

        # save
        plt.savefig("../images/neals-funnel-b.svg", bbox_inches=0, transparent=True)


if __name__ == "__main__":
    main()

The same random samples as before (left) and the posterior samples from the Bayesian model (right).

Clearly something has gone wrong. Posterior samples appear to have been drawn from the top part of the funnel only; small values of \(v\) are not represented. As mentioned earlier, Neal (2003) proposed this example to demonstrate the problems of older MCMC sampling methods. However, as demonstrated above, NUTS also fails to sample Neal’s funnel correctly (see Betancourt & Girolami, 2013).

A more realistic example

Neal’s funnel is a somewhat unrealistic distribution to propose as a model for any real-world data-analysis problem. Let’s define a distribution that more closely resembles something we might actually use:

\[\begin{equation} \mu\sim\textrm{Normal}\left(0,1\right)\\ \sigma\sim\textrm{Half-Cauchy}\left(1\right)\\ y_{i}\sim\textrm{Normal}\left(\mu, \sigma^2\right)\textrm{ for }i=0\textrm{ to }n \end{equation}\]

Here, \(y_0\) to \(y_n\) could plausibly be some observed data drawn from the same distribution whose mean and variance are unknown. For the sake of simplicity we will set \(n\) to 1. As demonstrated in by the code and figure below, funneling occurs here too.

"""Generate data from a more realistic hierarchical distribution.

"""
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
from matplotlib import rcParams
from scipy.stats import norm, halfcauchy


def main():

    # generate data
    np.random.seed(0)
    n = 1
    m = 10000
    mu = norm.rvs(0, 1, m)
    sigma = halfcauchy.rvs(0, 1, m)
    y = norm.rvs(mu, sigma, (n, m))

    # set up model
    with pm.Model():

        mu_ = pm.Normal("mu", 0, 1)
        sigma_ = pm.HalfCauchy("sigma", 1)
        y_ = pm.Normal("y", mu_, sigma_, shape=n)

        # sample and save samples
        trace = pm.sample(m, chains=1)
        mu_samples = trace["mu"][:]
        sigma_samples = trace["sigma"][:]
        y_samples = trace["y"].T[:]

    # plot 2-D figures
    sc = 5
    fs = rcParams["figure.figsize"]
    rcParams["figure.figsize"] = (fs[0], fs[0])
    rcParams["lines.linewidth"] = 2
    rcParams["font.size"] = 14
    fig, axes = plt.subplots(2, 2, constrained_layout=True, sharex=True)

    ax = axes[0, 0]
    ax.scatter(y[0], mu, marker=".", alpha=0.05, rasterized=True)
    ax.set_xlim(-sc, sc)
    ax.set_ylim(-sc, sc)
    ax.set_ylabel("$\mu$")

    ax = axes[0, 1]
    ax.scatter(
        y_samples[0], mu_samples, marker=".", alpha=0.05, rasterized=True, color="r"
    )
    ax.set_ylim(-sc, sc)
    ax.set_yticklabels([])

    ax = axes[1, 0]
    ax.scatter(y[0], sigma, marker=".", alpha=0.05, rasterized=True)
    ax.set_ylim(0, sc / 2)
    ax.set_xlabel("$y_0$")
    ax.set_ylabel("$\sigma$")
    ax = axes[1, 1]
    ax.scatter(
        y_samples[0], sigma_samples, marker=".", alpha=0.05, rasterized=True, color="r"
    )
    ax.set_ylim(0, sc / 2)
    ax.set_yticklabels([])
    ax.set_xlabel("$y_0$")

    # save
    plt.savefig("../images/realistic-funnel-a.svg", bbox_inches=0, transparent=True)


if __name__ == "__main__":
    main()

Random samples from the new distribution (blue) and posterior samples from its Bayesian model (red).

One can perhaps see that the problem lies in the sampling of the hyperparameters, the higher-level random variables that control the shapes of the prior distributions on the lower-level variables. In both examples, small values of those variables dictating the scale of the prior, \(v\) in Neal’s funnel and \(\sigma\) above, are not represented.

Reparameterization

As explained by Betancourt and Girolami (2013), the problem is caused by the fact that random variables within hierarchical models are necessarily highly dependent on one another when the data are sparse. (Actually, Betancourt and Girolami describe this as a correlation between variables. Perhaps this is the correct term in a mathematical sense, but in the empirical sciences, correlation usually implies a linear relationship, and the kinds of dependencies we see in hierarchical models are typically not linear. For example, the Pearson product–moment coefficient of \(v\) and \(x_0\) in Neal’s funnel is 0.) Although modern MCMC sampling methods do better than older ones, the performance of all methods is degraded in the presence of such dependencies.

As more data are added, dependencies between random variables in the joint posterior distribution are reduced, which allows MCMC sampling to better explore it. So long as a hierarchical model is provided with enough data, funneling is minimized. But how many data are sufficient? This is a difficult question to answer, but as this blog post by Thomas Wiecki shows, funneling occurs in the hierarchical analysis of fairly typical datasets. Moreover, in almost all real-life data-analysis situations, researchers do not have the luxury of collecting more data. Therefore, a more general solution is required.

It has long been known that correlations between random variables in hierarchical models can be reduced by adopting non-centered parameterizations (NCPs; Papaspiliopoulos et al., 2007). A successful NCP of Neal’s funnel is

\[\begin{equation} v\sim\textrm{Normal}\left(0, 3\right)\\ \tilde{x_i}\sim\textrm{Normal}\left(0,1\right)\textrm{ for }i=0\textrm{ to }8\\ x_i=e^v\tilde{x_i} \end{equation}\]

There are two differences between the NCP and the original, centered parameterization: first, \(x_0\) through \(x_8\) are now deterministic rather than stochastic random variables, meaning that MCMC sampling methods do not update them directly; second, the new stochastic random variables, denoted by \(\tilde{x_i}\), are independent of \(v\). This NCP of Neal’s funnel is presented in PyMC3 code below, along with a figure that demonstrates independence between the stochastic random variables and lack of funneling.

"""Generate data and sample from Neal's funnel distribution.

"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
import pymc3 as pm
from scipy.stats import norm


def main():

    with pm.Model():

        # set up figure
        fs = rcParams["figure.figsize"]
        rcParams["figure.figsize"] = (fs[0], fs[0] / 2)
        rcParams["lines.linewidth"] = 2
        rcParams["font.size"] = 14

        # simulate data
        np.random.seed(0)
        k = 9
        n = 10000
        v = norm.rvs(0, 3, n)
        x = norm.rvs(0, np.exp(v / 2), (k, n))

        # set up model
        v_ = pm.Normal("v", mu=0, sd=3)
        xt_ = pm.Normal("xt", mu=0, sd=1, shape=k)
        x_ = pm.Deterministic("x", pm.math.exp(v_ / 2) * xt_)

        # sample and save samples
        trace = pm.sample(n, chains=1)
        v_samples = trace["v"][:]
        xt_samples = trace["xt"][:].T
        x_samples = trace["x"][:].T

        # plot samples
        # plot simulated data
        fig, axes = plt.subplots(1, 2, constrained_layout=True)
        ax = axes[0]
        ax.scatter(
            xt_samples[0], v_samples, marker=".", alpha=0.05, rasterized=True, color="r"
        )
        ax.set_xlim(-3.5, 3.5)
        ax.set_ylim(-9, 9)
        ax.set_xlabel(r"$\tilde{x}_0$")
        ax.set_ylabel("$v$")
        ax = axes[1]
        ax.scatter(
            x_samples[0], v_samples, marker=".", alpha=0.05, rasterized=True, color="r"
        )
        ax.set_xlabel("$x_0$")
        ax.set_xlim(-20, 20)
        ax.set_ylim(-9, 9)

        # save
        plt.savefig("../images/neals-funnel-c.svg", bbox_inches=0, transparent=True)


if __name__ == "__main__":
    main()

Posterior samples from the NCP of Neal’s funnel. As shown in the left panel, the dependencies between the stochastic random variables are gone. As shown in the right panel, the joint distribution is sampled properly.

(Strangely, in other definitions of the NCP of Neal’s funnel, such as the one found in the Stan user manual, \(v\) is also a deterministic variable where \(\tilde{v}\sim\textrm{Normal}\left(0,1\right)\), and \(v=3\tilde{v}\). But intuitively one can see that this extra step is not necessary because it doesn’t make any difference to the dependencies between the stochastic random variables. Perhaps it is included for historical reasons, much like the additional unnecessary lower-level variables in Neal’s funnel.)

Similarly, our second example distribution can be reparameterized as

\[\begin{equation} \mu\sim\textrm{Normal}\left(0,1\right)\\ \sigma\sim\textrm{Half-Cauchy}\left(1\right)\\ \tilde{y}_{i}\sim\textrm{Normal}\left(0, 1\right)\textrm{ for }i=0\textrm{ to }n\\ y=\mu+\tilde{y}_{i}\sigma \end{equation}\]

Again, the NCP removes the dependencies between all the stochastic random variables, and reduces funneling, as shown below.

"""Generate data from a more realistic hierarchical distribution.

"""
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
from matplotlib import rcParams
from scipy.stats import norm, halfcauchy


def main():

    # generate data
    np.random.seed(0)
    n = 1
    m = 10000
    mu = norm.rvs(0, 1, m)
    sigma = halfcauchy.rvs(0, 1, m)
    y = norm.rvs(mu, sigma, (n, m))

    # set up model
    with pm.Model():

        mu_ = pm.Normal("mu", 0, 1)
        sigma_ = pm.HalfCauchy("sigma", 1)
        yt_ = pm.Normal("yt", 0, 1, shape=n)
        pm.Deterministic("y", mu_ + yt_ * sigma_)
        # y_ = pm.Normal("y", mu_, sigma_, shape=n)

        # sample and save samples
        trace = pm.sample(m, chains=1)
        mu_samples = trace["mu"][:]
        sigma_samples = trace["sigma"][:]
        yt_samples = trace["yt"].T[:]
        y_samples = trace["y"].T[:]

    # plot 2-D figures
    sc = 5
    fs = rcParams["figure.figsize"]
    rcParams["figure.figsize"] = (fs[0], fs[0])
    rcParams["lines.linewidth"] = 2
    rcParams["font.size"] = 14
    fig, axes = plt.subplots(2, 2, constrained_layout=True)

    ax = axes[0, 0]
    ax.scatter(
        yt_samples[0], mu_samples, marker=".", alpha=0.05, rasterized=True, color="r"
    )
    ax.set_xlim(-sc, sc)
    ax.set_ylim(-sc, sc)
    ax.set_ylabel("$\mu$")
    ax.set_xticklabels([])

    ax = axes[0, 1]
    ax.scatter(
        y_samples[0], mu_samples, marker=".", alpha=0.05, rasterized=True, color="r"
    )
    ax.set_xlim(-sc, sc)
    ax.set_ylim(-sc, sc)
    ax.set_yticklabels([])
    ax.set_xticklabels([])

    ax = axes[1, 0]
    ax.scatter(
        yt_samples[0], sigma_samples, marker=".", alpha=0.05, rasterized=True, color="r"
    )
    ax.set_xlim(-sc, sc)
    ax.set_ylim(0, sc / 2)
    ax.set_xlabel(r"$\tilde{y}_0$")
    ax.set_ylabel("$\sigma$")

    ax = axes[1, 1]
    ax.scatter(
        y_samples[0], sigma_samples, marker=".", alpha=0.05, rasterized=True, color="r"
    )
    ax.set_xlim(-sc, sc)
    ax.set_ylim(0, sc / 2)
    ax.set_yticklabels([])
    ax.set_xlabel("$y_0$")

    # save
    plt.savefig("../images/realistic-funnel-b.svg", bbox_inches=0, transparent=True)


if __name__ == "__main__":
    main()

Posterior samples from the NCP of the second example distribution.

References

Betancourt, M. J., & Girolami, M. (2013). Hamiltonian Monte Carlo for hierarchical models. arXiv:1312.0906.

Gelman, A., & Stern, H. (2006). The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. The American Statistician, 60(4), 328–331. 10.1198/000313006X152649

Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(6), 721–741. 10.1109/TPAMI.1984.4767596

Hoffman, M. D., & Gelman, A. (2014). The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J Mach Learn Res, 15, 1593–1623.

Lee, M. D., & Wagenmakers, E. (2013). Bayesian cognitive modeling: A practical course. Cambridge University Press.

Neal, R. M. (2003). Slice sampling. The Annals of Statistics, 31(3), 705–767. 10.1214/aos/1056562461

Neal, R. M. (2012). MCMC using Hamiltonian dynamics. arXiv:1206.1901.

Papaspiliopoulos, O., Roberts, G. O., & Sköld, M. (2007). A General Framework for the Parametrization of Hierarchical Models. Statistical Science, 22(1), 59–73. 10.1214/088342307000000014

Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in Python using PyMC3. PeerJ Comput Science, 2, e55. 10.7717/peerj-cs.55

Version history

Originally posted August 17, 2019.
Shortened the title on March 01, 2020.
Fixed strange new mathjax/kramdown bug on July 15, 2020.

“Bayesian SEM IV: Generalized SEM,” May 19, 2020.
“Bayesian SEM III: Univariate paramaterization,” May 19, 2020.
“Bayesian SEM II: Hierarchical latent variables,” May 15, 2020.
“Measuring working-memory capacity,” Mar 16, 2020.
All posts filed under bayesian, python.

The Cracked Bassoon