Bayesian SEM III:

Univariate paramaterization

Structural equation modeling (SEM; Bollen, 1989) is a somewhat tricky statistical technique because it involves multivariate distributions. Conventional SEM is also restrictive, since generalizations beyond multivariate normality are not straightforward. If data aren’t approximately multivariate normal—perhaps they are counts of successful trials in an experiment or Likert-style ratings on a questionnaire—researchers are faced with a very daunting modeling task. Most of the time, the issue is simply ignored and multivariate normality is assumed anyway!

Within the Bayesian framework, it may be possible to recast an SEM problem as one in which all random variables are given univariate priors. This approach could be advantageous when dealing with non-normal data, because it is much more straightforward to create generalizations of univariate normal models than multivariate normal models. However, the downside is that the univariate model may be much larger than the multivariate model due to having more latent variables, all of which need to be estimated explicitly, making posterior sampling slower and more difficult. Nevertheless, for scientists with small data sets, lots of time on their hands, and/or access to powerful computers, it might be worth the extra effort.

Here, I convert my previous hierarchical model of the classic Holzinger and Swineford (1939) data set from a multivariate model to a purely univariate model.

Original (multivariate) version

Here are the key equations that describe the previous Bayesian SEM that was placed on the standardized Holzinger and Swineford (1939) data:

\[\begin{equation} \boldsymbol{Y}\sim\mathrm{MvNormal}\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right)\\ \boldsymbol{\mu}=\boldsymbol{\nu}+\boldsymbol{\Lambda}\boldsymbol{\alpha}\\ \boldsymbol{\Sigma}=\boldsymbol{\Lambda}(\boldsymbol{I}-\boldsymbol{\Gamma})^{-1}\boldsymbol{\Psi}(\boldsymbol{I}-\boldsymbol{\Gamma}^\mathrm{T})^{-1}\boldsymbol{\Lambda}^\mathrm{T}+\boldsymbol{\Theta} \end{equation}\]

where \(\boldsymbol{Y}\) is the data matrix with shape \(n \times p\), where \(n\) is the number of cases and \(p\) is the number of items; \(\boldsymbol{\mu}\) is an \(p\)-length vector of item intercepts; \(\boldsymbol{\Lambda}\) is the sparse loading matrix of shape \(p \times m\), where \(m\) is the number of latent variables; \(\boldsymbol{I}\) is the \(m \times m\) identity matrix, \(\boldsymbol{\Gamma}\) is the \(m \times m\) sparse path matrix; \(\boldsymbol{\Psi}\) is the \(m \times m\) matrix of latent variable residual covariances; and \(\boldsymbol{\Theta}\) is the \(p \times p\) matrix of latent item covariances. The model assumed no residual correlations between latent variables and that they all had unit variance, so \(\boldsymbol{\Psi}\) was an identity matrix, and no residual correlations between items, so \(\boldsymbol{\Theta}\) was a diagonal matrix.

Relationships between variables in the hierarchical model.

Here is a Python function to construct the model in PyMC3:

def bcfam(items, factors, paths, nu_sd=2.5, alpha_sd=2.5, d_beta=2.5):
    r"""Constructs a Bayesian CFA model in "multivariate form".

    Args:
        items (np.array): Data.
        factors (np.array): Factor design matrix.
        paths (np.array): Paths design matrix.
        nu_sd (:obj:`float`, optional): Standard deviation of normal prior on item
            intercepts.
        alpha_sd (:obj:`float`, optional): Standard deviation of normal prior on factor
            intercepts.
        d_beta (:obj:`float`, optional): Scale parameter of half-Cauchy prior on factor
            standard deviation.

    Returns:
        None: Places model in context.

    """
    # get numbers of cases, items, and factors
    n, p = items.shape
    p_, m = factors.shape
    assert p == p_, "Mismatch between data and factor-loading matrices"

    # priors on item intercepts
    nu = pm.Normal(name=r"$\nu$", mu=0, sd=nu_sd, shape=p, testval=items.mean(axis=0))

    # priors on factor intercepts
    alpha = pm.Normal(name=r"$\alpha$", mu=0, sd=alpha_sd, shape=m, testval=np.zeros(m))

    # priors on factor loadings
    l = np.asarray(factors).sum()
    lam = pm.Normal(name=r"$\lambda$", mu=0, sd=1, shape=l, testval=np.ones(l))

    # loading matrix
    Lambda = tt.zeros(factors.shape)
    k = 0
    for i, j in product(range(p), range(m)):
        if factors[i, j] == 1:
            Lambda = tt.inc_subtensor(Lambda[i, j], lam[k])
            k += 1
    pm.Deterministic(name=r"$\Lambda$", var=Lambda)

    # item means
    mu = nu + matrix_dot(Lambda, alpha)

    # item residual covariance matrix
    d = pm.HalfCauchy(
        name=r"$\sqrt{\theta}$", beta=d_beta, shape=p, testval=items.std(axis=0)
    )
    Theta = tt.diag(d) ** 2

    # factor covariance matrix
    Psi = I = np.eye(m)

    # priors on paths
    g = np.asarray(paths).sum()
    gam = pm.Normal(name=r"$\gamma$", mu=0, sd=1, shape=g, testval=np.ones(g))

    # path matrix
    Gamma = tt.zeros(paths.shape)
    k = 0
    for i, j in product(range(m), range(m)):
        if paths[i, j] == 1:
            Gamma = tt.inc_subtensor(Gamma[i, j], gam[k])
            k += 1
    pm.Deterministic(name=r"$\Gamma$", var=Gamma)

    # item covariance matrix
    Sigma = (
        matrix_dot(
            Lambda,
            matrix_inverse(I - Gamma),
            Psi,
            matrix_inverse(I - Gamma.T),
            Lambda.T,
        )
        + Theta
    )

    # observations
    pm.MvNormal(name="$Y$", mu=mu, cov=Sigma, observed=items, shape=items.shape)

For the Grant-White school, this took about 6 minutes to generate 1200 posterior samples and produced the following factor loadings:

	Spatial	Verbal	Speed	Memory
Visual	0.42	0	0	0
Cubes	0.3	0	0	0
Paper	0.33	0	0	0
Flags	0.42	0	0	0
General	0	0.61	0	0
Paragrap	0	0.62	0	0
Sentence	0	0.63	0	0
Wordc	0	0.52	0	0
Wordm	0	0.64	0	0
Addition	0	0	0.46	0
Code	0	0	0.48	0
Counting	0	0	0.49	0
Straight	0	0	0.53	0
Wordr	0	0	0	0.36
Numberr	0	0	0	0.36
Figurer	0	0	0	0.41
Object	0	0	0	0.43
Numberf	0	0	0	0.43
Figurew	0	0	0	0.33

Here are the path weights:

	g
Spatial	1.32
Verbal	0.89
Speed	1.02
Memory	1.1

Univariate version

Notice that the latent variables do not appear in the equations for the multivariate model—they were marginalized out. We can represent the same model explicitly in terms of latent variables as follows. Let \(\boldsymbol{y}\) denote the rows in \(\boldsymbol{Y}\), such that \(\boldsymbol{Y}=\left[\boldsymbol{y}_0, \boldsymbol{y}_1, \dots \boldsymbol{y}_n\right]^\mathrm{T}\),

\[\begin{equation} \boldsymbol{y}=\boldsymbol{\nu}+\boldsymbol{\Lambda}\boldsymbol{\eta}+\boldsymbol{\epsilon}\\ \boldsymbol{\eta}=\boldsymbol{\alpha}+\boldsymbol{\Gamma}\boldsymbol{\eta}+\boldsymbol{\zeta}\\ \boldsymbol{\epsilon}\sim\mathrm{MvNormal}\left(0, \boldsymbol{\Theta}\right)\\ \boldsymbol{\zeta}\sim\mathrm{MvNormal}\left(0, \boldsymbol{\Psi}\right) \end{equation}\]

where the \(m\)-length vector \(\boldsymbol{\eta}\) contains the latent variables. Because \(\boldsymbol{\eta}\) is defined in terms of itself, we must do some rearranging if we actually want to use this model. (I got this from Jake Westfall over at Cross Validated.)

\[\begin{equation} \boldsymbol{\eta} = (\boldsymbol{I} - \boldsymbol{\Gamma})^{-1} (\boldsymbol{\alpha}+\boldsymbol{\zeta}) \end{equation}\]

Since \(\boldsymbol{\Theta}\) is diagonal, we can let \(\boldsymbol{\theta}=\mathrm{diag}\left(\boldsymbol{\Theta}\right)\) be a \(n\)-length vector of variances without loss in generality and express \(\boldsymbol{\epsilon}\) as a collection of univariate random variables, \(\boldsymbol{\epsilon}\sim\mathrm{Normal}\left(0, \boldsymbol{\theta}\right)\). We can now express \(\boldsymbol{y}\) as univariate random variables:

\[\begin{equation} \boldsymbol{y}\sim\mathrm{Normal}\left(\boldsymbol{\nu}+\boldsymbol{\Lambda}\boldsymbol{\eta}, \boldsymbol{\theta}\right) \end{equation}\]

The model is not yet completely univariate because \(\boldsymbol{\zeta}\) is still defined as multivariate normal. Luckily, since \(\boldsymbol{\Psi}\) is diagonal, we can let \(\boldsymbol{\psi}=\mathrm{diag}\left(\boldsymbol{\Psi}\right)\) and

\[\begin{equation} \boldsymbol{\zeta}\sim\mathrm{Normal}\left(0, \boldsymbol{\psi}\right) \end{equation}\]

Now we have a purely univariate version of the previous model.

The univariate model is completely defined, but is not practical. If we were to code this exactly as-is, we would need to iterate over every \(\boldsymbol{y}\) with a Python for loop. This would create separate nodes in the graph for every case, which would be terribly inefficient. We need to get everything back into matrix form. Through some not-very-impressive trial and error, I came up with

\[\begin{equation} \boldsymbol{Y}\sim\mathrm{Normal}\left(\boldsymbol{M}, \boldsymbol{S}^2\right) \end{equation}\]

where \(\boldsymbol{M}\) and \(\boldsymbol{S}\) are the elementwise prior means and standard deviations on \(\boldsymbol{Y}\). The former is given by

\[\begin{equation} \boldsymbol{M}=\boldsymbol{V}+\left(\boldsymbol{A}+\boldsymbol{Z}\right)\left(\boldsymbol{I}-\boldsymbol{\Gamma}^\mathrm{T}\right)^{-1}\boldsymbol{\Lambda}^\mathrm{T} \end{equation}\]

where \(\boldsymbol{V}\) is a matrix containing broadcasted \(\boldsymbol{\nu}\) (I don’t know how to write this in formal notation, sorry!); \(\boldsymbol{A}\) is a matrix containing broadcasted \(\boldsymbol{\alpha}\) (ditto); and \(\boldsymbol{Z}=\left[\boldsymbol{\zeta}_0, \boldsymbol{\zeta}_1, \dots \boldsymbol{\zeta}_n\right]^\mathrm{T}\). The matrix \(\boldsymbol{S}\) is a broadcasted matrix of standard deviations; the corresponding unbroadcasted vector could be written \(\sqrt{\boldsymbol{\theta}}\).

Here is a Python function to construct the univariate model in PyMC3:

def bcfau(items, factors, paths, nu_sd=2.5, alpha_sd=2.5, d_beta=2.5):
    r"""Constructs a Bayesian CFA model in "univariate form" by directly estimating the
    factors.

    Args:
        items (np.array): Data.
        factors (np.array): Factor design matrix.
        paths (np.array): Paths design matrix.
        nu_sd (:obj:`float`, optional): Standard deviation of normal prior on item
            intercepts.
        alpha_sd (:obj:`float`, optional): Standard deviation of normal prior on factor
            intercepts.
        d_beta (:obj:`float`, optional): Scale parameter of half-Cauchy prior on factor
            standard deviation.

    Returns:
        None: Places model in context.

    """
    # get numbers of cases, items, and factors
    n, p = items.shape
    p_, m = factors.shape
    assert p == p_, "Mismatch between data and factor-loading matrices"

    # priors on item intercepts
    nu = pm.Normal(name=r"$\nu$", mu=0, sd=nu_sd, shape=p, testval=items.mean(axis=0))

    # priors on factor intercepts
    alpha = pm.Normal(name=r"$\alpha$", mu=0, sd=alpha_sd, shape=m, testval=np.zeros(m))

    # priors on factor loadings
    l = np.asarray(factors).sum()
    lam = pm.Normal(name=r"$\lambda$", mu=0, sd=1, shape=l, testval=np.ones(l))

    # loading matrix
    Lambda = tt.zeros(factors.shape)
    k = 0
    for i, j in product(range(p), range(m)):
        if factors[i, j] == 1:
            Lambda = tt.inc_subtensor(Lambda[i, j], lam[k])
            k += 1
    pm.Deterministic(name=r"$\Lambda$", var=Lambda)

    # priors on paths
    g = np.asarray(paths).sum()
    gam = pm.Normal(name=r"$\gamma$", mu=0, sd=1, shape=g, testval=np.ones(g))

    # path matrix
    Gamma = tt.zeros(paths.shape)
    k = 0
    for i, j in product(range(m), range(m)):
        if paths[i, j] == 1:
            Gamma = tt.inc_subtensor(Gamma[i, j], gam[k])
            k += 1
    pm.Deterministic(name=r"$\Gamma$", var=Gamma)

    # priors on factor residuals
    zeta = pm.Normal(name=r"$\zeta$", mu=0, sigma=1, shape=(n, m), testval=0)

    # latent variables
    I = np.eye(m)
    M = nu + matrix_dot(matrix_dot((alpha+zeta), matrix_inverse(I-Gamma.T)), Lambda.T)

    # item residual standard deviations
    S = pm.HalfCauchy(
        name=r"$\sqrt{\theta}$", beta=d_beta, shape=p, testval=items.std(axis=0)
    )

    # observations
    pm.Normal(name="$Y$", mu=M, sigma=S, observed=items, shape=items.shape)

I have not found a good way to prevent latent variables from flipping (discussed in my first post in this series), so I just sampled with one chain for twice as long as the multivariate model. Surprisingly, it didn’t take too long!

Here are the loadings:

	Spatial	Verbal	Speed	Memory
Visual	0.43	0	0	0
Cubes	0.3	0	0	0
Paper	0.33	0	0	0
Flags	0.42	0	0	0
General	0	0.61	0	0
Paragrap	0	0.62	0	0
Sentence	0	0.63	0	0
Wordc	0	0.52	0	0
Wordm	0	0.63	0	0
Addition	0	0	0.46	0
Code	0	0	0.48	0
Counting	0	0	0.49	0
Straight	0	0	0.53	0
Wordr	0	0	0	0.36
Numberr	0	0	0	0.36
Figurer	0	0	0	0.41
Object	0	0	0	0.43
Numberf	0	0	0	0.43
Figurew	0	0	0	0.33

And here are path weights:

	g
Spatial	1.31
Verbal	0.89
Speed	1.02
Memory	1.1

These are almost identical to those from the multivariate model to two significant figures, suggesting that the multivariate and univariate models are indeed equivalent. Small discrepancies are to be expected given the randomness of Markov chain Monte Carlo.

Quick discussion

To be clear, we have not actually gained anything from recasting this particular model from a multivariate normal model to univariate normal model. The univariate version is bigger (though not actually slower it seems) and flipping is even more difficult to prevent. Recasting is simply an intermediate step towards creating generalized structural equation models to deal with non-normal data.

Recasting this model was not very difficult because the residual item and latent variable covariance matrices were both diagonal. If they are not diagonal, extra work is needed to make them so. Specifically, one needs to add additional latent variables to capture the non-zero off-diagonal values. I may write more about this in the future.

References

Bollen, K. A. (1989). Structural equations with latent variables. Wiley.

Holzinger, K. J., & Swineford, F. (1939). A study in factor analysis: The stability of a bi-factor solution. Supplementary Educational Monographs, 48, 91.

Version history

Originally posted May 19, 2020.

“Bayesian SEM IV: Generalized SEM,” May 19, 2020.
“Bayesian SEM II: Hierarchical latent variables,” May 15, 2020.
“Measuring working-memory capacity,” Mar 16, 2020.
“Bayesian SEM I: Confirmatory factor analysis,” Jan 19, 2020.
All posts filed under bayesian, python.

The Cracked Bassoon