Bayesian SEM III:
Structural equation modeling (SEM; Bollen, 1989) is a somewhat tricky statistical technique because it involves multivariate distributions. Conventional SEM is also restrictive, since generalizations beyond multivariate normality are not straightforward. If data aren’t approximately multivariate normal—perhaps they are counts of successful trials in an experiment or Likert-style ratings on a questionnaire—researchers are faced with a very daunting modeling task. Most of the time, the issue is simply ignored and multivariate normality is assumed anyway!
Within the Bayesian framework, it may be possible to recast an SEM problem as one in which all random variables are given univariate priors. This approach could be advantageous when dealing with non-normal data, because it is much more straightforward to create generalizations of univariate normal models than multivariate normal models. However, the downside is that the univariate model may be much larger than the multivariate model due to having more latent variables, all of which need to be estimated explicitly, making posterior sampling slower and more difficult. Nevertheless, for scientists with small data sets, lots of time on their hands, and/or access to powerful computers, it might be worth the extra effort.
Here, I convert my previous hierarchical model of the classic Holzinger and Swineford (1939) data set from a multivariate model to a purely univariate model.
Original (multivariate) version
Here are the key equations that describe the previous Bayesian SEM that was placed on the standardized Holzinger and Swineford (1939) data:
\[\begin{equation} \boldsymbol{Y}\sim\mathrm{MvNormal}\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right)\\ \boldsymbol{\mu}=\boldsymbol{\nu}+\boldsymbol{\Lambda}\boldsymbol{\alpha}\\ \boldsymbol{\Sigma}=\boldsymbol{\Lambda}(\boldsymbol{I}-\boldsymbol{\Gamma})^{-1}\boldsymbol{\Psi}(\boldsymbol{I}-\boldsymbol{\Gamma}^\mathrm{T})^{-1}\boldsymbol{\Lambda}^\mathrm{T}+\boldsymbol{\Theta} \end{equation}\]where \(\boldsymbol{Y}\) is the data matrix with shape \(n \times p\), where \(n\) is the number of cases and \(p\) is the number of items; \(\boldsymbol{\mu}\) is an \(p\)-length vector of item intercepts; \(\boldsymbol{\Lambda}\) is the sparse loading matrix of shape \(p \times m\), where \(m\) is the number of latent variables; \(\boldsymbol{I}\) is the \(m \times m\) identity matrix, \(\boldsymbol{\Gamma}\) is the \(m \times m\) sparse path matrix; \(\boldsymbol{\Psi}\) is the \(m \times m\) matrix of latent variable residual covariances; and \(\boldsymbol{\Theta}\) is the \(p \times p\) matrix of latent item covariances. The model assumed no residual correlations between latent variables and that they all had unit variance, so \(\boldsymbol{\Psi}\) was an identity matrix, and no residual correlations between items, so \(\boldsymbol{\Theta}\) was a diagonal matrix.
Relationships between variables in the hierarchical model.
Here is a Python function to construct the model in PyMC3:
def bcfam(items, factors, paths, nu_sd=2.5, alpha_sd=2.5, d_beta=2.5):
r"""Constructs a Bayesian CFA model in "multivariate form".
Args:
items (np.array): Data.
factors (np.array): Factor design matrix.
paths (np.array): Paths design matrix.
nu_sd (:obj:`float`, optional): Standard deviation of normal prior on item
intercepts.
alpha_sd (:obj:`float`, optional): Standard deviation of normal prior on factor
intercepts.
d_beta (:obj:`float`, optional): Scale parameter of half-Cauchy prior on factor
standard deviation.
Returns:
None: Places model in context.
"""
# get numbers of cases, items, and factors
n, p = items.shape
p_, m = factors.shape
assert p == p_, "Mismatch between data and factor-loading matrices"
# priors on item intercepts
nu = pm.Normal(name=r"$\nu$", mu=0, sd=nu_sd, shape=p, testval=items.mean(axis=0))
# priors on factor intercepts
alpha = pm.Normal(name=r"$\alpha$", mu=0, sd=alpha_sd, shape=m, testval=np.zeros(m))
# priors on factor loadings
l = np.asarray(factors).sum()
lam = pm.Normal(name=r"$\lambda$", mu=0, sd=1, shape=l, testval=np.ones(l))
# loading matrix
Lambda = tt.zeros(factors.shape)
k = 0
for i, j in product(range(p), range(m)):
if factors[i, j] == 1:
Lambda = tt.inc_subtensor(Lambda[i, j], lam[k])
k += 1
pm.Deterministic(name=r"$\Lambda$", var=Lambda)
# item means
mu = nu + matrix_dot(Lambda, alpha)
# item residual covariance matrix
d = pm.HalfCauchy(
name=r"$\sqrt{\theta}$", beta=d_beta, shape=p, testval=items.std(axis=0)
)
Theta = tt.diag(d) ** 2
# factor covariance matrix
Psi = I = np.eye(m)
# priors on paths
g = np.asarray(paths).sum()
gam = pm.Normal(name=r"$\gamma$", mu=0, sd=1, shape=g, testval=np.ones(g))
# path matrix
Gamma = tt.zeros(paths.shape)
k = 0
for i, j in product(range(m), range(m)):
if paths[i, j] == 1:
Gamma = tt.inc_subtensor(Gamma[i, j], gam[k])
k += 1
pm.Deterministic(name=r"$\Gamma$", var=Gamma)
# item covariance matrix
Sigma = (
matrix_dot(
Lambda,
matrix_inverse(I - Gamma),
Psi,
matrix_inverse(I - Gamma.T),
Lambda.T,
)
+ Theta
)
# observations
pm.MvNormal(name="$Y$", mu=mu, cov=Sigma, observed=items, shape=items.shape)
For the Grant-White school, this took about 6 minutes to generate 1200 posterior samples and produced the following factor loadings:
Spatial | Verbal | Speed | Memory | |
---|---|---|---|---|
Visual | 0.42 | 0 | 0 | 0 |
Cubes | 0.3 | 0 | 0 | 0 |
Paper | 0.33 | 0 | 0 | 0 |
Flags | 0.42 | 0 | 0 | 0 |
General | 0 | 0.61 | 0 | 0 |
Paragrap | 0 | 0.62 | 0 | 0 |
Sentence | 0 | 0.63 | 0 | 0 |
Wordc | 0 | 0.52 | 0 | 0 |
Wordm | 0 | 0.64 | 0 | 0 |
Addition | 0 | 0 | 0.46 | 0 |
Code | 0 | 0 | 0.48 | 0 |
Counting | 0 | 0 | 0.49 | 0 |
Straight | 0 | 0 | 0.53 | 0 |
Wordr | 0 | 0 | 0 | 0.36 |
Numberr | 0 | 0 | 0 | 0.36 |
Figurer | 0 | 0 | 0 | 0.41 |
Object | 0 | 0 | 0 | 0.43 |
Numberf | 0 | 0 | 0 | 0.43 |
Figurew | 0 | 0 | 0 | 0.33 |
Here are the path weights:
g | |
---|---|
Spatial | 1.32 |
Verbal | 0.89 |
Speed | 1.02 |
Memory | 1.1 |
Univariate version
Notice that the latent variables do not appear in the equations for the multivariate model—they were marginalized out. We can represent the same model explicitly in terms of latent variables as follows. Let \(\boldsymbol{y}\) denote the rows in \(\boldsymbol{Y}\), such that \(\boldsymbol{Y}=\left[\boldsymbol{y}_0, \boldsymbol{y}_1, \dots \boldsymbol{y}_n\right]^\mathrm{T}\),
\[\begin{equation} \boldsymbol{y}=\boldsymbol{\nu}+\boldsymbol{\Lambda}\boldsymbol{\eta}+\boldsymbol{\epsilon}\\ \boldsymbol{\eta}=\boldsymbol{\alpha}+\boldsymbol{\Gamma}\boldsymbol{\eta}+\boldsymbol{\zeta}\\ \boldsymbol{\epsilon}\sim\mathrm{MvNormal}\left(0, \boldsymbol{\Theta}\right)\\ \boldsymbol{\zeta}\sim\mathrm{MvNormal}\left(0, \boldsymbol{\Psi}\right) \end{equation}\]where the \(m\)-length vector \(\boldsymbol{\eta}\) contains the latent variables. Because \(\boldsymbol{\eta}\) is defined in terms of itself, we must do some rearranging if we actually want to use this model. (I got this from Jake Westfall over at Cross Validated.)
\[\begin{equation} \boldsymbol{\eta} = (\boldsymbol{I} - \boldsymbol{\Gamma})^{-1} (\boldsymbol{\alpha}+\boldsymbol{\zeta}) \end{equation}\]Since \(\boldsymbol{\Theta}\) is diagonal, we can let \(\boldsymbol{\theta}=\mathrm{diag}\left(\boldsymbol{\Theta}\right)\) be a \(n\)-length vector of variances without loss in generality and express \(\boldsymbol{\epsilon}\) as a collection of univariate random variables, \(\boldsymbol{\epsilon}\sim\mathrm{Normal}\left(0, \boldsymbol{\theta}\right)\). We can now express \(\boldsymbol{y}\) as univariate random variables:
\[\begin{equation} \boldsymbol{y}\sim\mathrm{Normal}\left(\boldsymbol{\nu}+\boldsymbol{\Lambda}\boldsymbol{\eta}, \boldsymbol{\theta}\right) \end{equation}\]The model is not yet completely univariate because \(\boldsymbol{\zeta}\) is still defined as multivariate normal. Luckily, since \(\boldsymbol{\Psi}\) is diagonal, we can let \(\boldsymbol{\psi}=\mathrm{diag}\left(\boldsymbol{\Psi}\right)\) and
\[\begin{equation} \boldsymbol{\zeta}\sim\mathrm{Normal}\left(0, \boldsymbol{\psi}\right) \end{equation}\]Now we have a purely univariate version of the previous model.
The univariate model is completely defined, but is not practical. If we were to code this exactly as-is, we would need to iterate over every \(\boldsymbol{y}\) with a Python for
loop. This would create separate nodes in the graph for every case, which would be terribly inefficient. We need to get everything back into matrix form. Through some not-very-impressive trial and error, I came up with
where \(\boldsymbol{M}\) and \(\boldsymbol{S}\) are the elementwise prior means and standard deviations on \(\boldsymbol{Y}\). The former is given by
\[\begin{equation} \boldsymbol{M}=\boldsymbol{V}+\left(\boldsymbol{A}+\boldsymbol{Z}\right)\left(\boldsymbol{I}-\boldsymbol{\Gamma}^\mathrm{T}\right)^{-1}\boldsymbol{\Lambda}^\mathrm{T} \end{equation}\]where \(\boldsymbol{V}\) is a matrix containing broadcasted \(\boldsymbol{\nu}\) (I don’t know how to write this in formal notation, sorry!); \(\boldsymbol{A}\) is a matrix containing broadcasted \(\boldsymbol{\alpha}\) (ditto); and \(\boldsymbol{Z}=\left[\boldsymbol{\zeta}_0, \boldsymbol{\zeta}_1, \dots \boldsymbol{\zeta}_n\right]^\mathrm{T}\). The matrix \(\boldsymbol{S}\) is a broadcasted matrix of standard deviations; the corresponding unbroadcasted vector could be written \(\sqrt{\boldsymbol{\theta}}\).
Here is a Python function to construct the univariate model in PyMC3:
def bcfau(items, factors, paths, nu_sd=2.5, alpha_sd=2.5, d_beta=2.5):
r"""Constructs a Bayesian CFA model in "univariate form" by directly estimating the
factors.
Args:
items (np.array): Data.
factors (np.array): Factor design matrix.
paths (np.array): Paths design matrix.
nu_sd (:obj:`float`, optional): Standard deviation of normal prior on item
intercepts.
alpha_sd (:obj:`float`, optional): Standard deviation of normal prior on factor
intercepts.
d_beta (:obj:`float`, optional): Scale parameter of half-Cauchy prior on factor
standard deviation.
Returns:
None: Places model in context.
"""
# get numbers of cases, items, and factors
n, p = items.shape
p_, m = factors.shape
assert p == p_, "Mismatch between data and factor-loading matrices"
# priors on item intercepts
nu = pm.Normal(name=r"$\nu$", mu=0, sd=nu_sd, shape=p, testval=items.mean(axis=0))
# priors on factor intercepts
alpha = pm.Normal(name=r"$\alpha$", mu=0, sd=alpha_sd, shape=m, testval=np.zeros(m))
# priors on factor loadings
l = np.asarray(factors).sum()
lam = pm.Normal(name=r"$\lambda$", mu=0, sd=1, shape=l, testval=np.ones(l))
# loading matrix
Lambda = tt.zeros(factors.shape)
k = 0
for i, j in product(range(p), range(m)):
if factors[i, j] == 1:
Lambda = tt.inc_subtensor(Lambda[i, j], lam[k])
k += 1
pm.Deterministic(name=r"$\Lambda$", var=Lambda)
# priors on paths
g = np.asarray(paths).sum()
gam = pm.Normal(name=r"$\gamma$", mu=0, sd=1, shape=g, testval=np.ones(g))
# path matrix
Gamma = tt.zeros(paths.shape)
k = 0
for i, j in product(range(m), range(m)):
if paths[i, j] == 1:
Gamma = tt.inc_subtensor(Gamma[i, j], gam[k])
k += 1
pm.Deterministic(name=r"$\Gamma$", var=Gamma)
# priors on factor residuals
zeta = pm.Normal(name=r"$\zeta$", mu=0, sigma=1, shape=(n, m), testval=0)
# latent variables
I = np.eye(m)
M = nu + matrix_dot(matrix_dot((alpha+zeta), matrix_inverse(I-Gamma.T)), Lambda.T)
# item residual standard deviations
S = pm.HalfCauchy(
name=r"$\sqrt{\theta}$", beta=d_beta, shape=p, testval=items.std(axis=0)
)
# observations
pm.Normal(name="$Y$", mu=M, sigma=S, observed=items, shape=items.shape)
I have not found a good way to prevent latent variables from flipping (discussed in my first post in this series), so I just sampled with one chain for twice as long as the multivariate model. Surprisingly, it didn’t take too long!
Here are the loadings:
Spatial | Verbal | Speed | Memory | |
---|---|---|---|---|
Visual | 0.43 | 0 | 0 | 0 |
Cubes | 0.3 | 0 | 0 | 0 |
Paper | 0.33 | 0 | 0 | 0 |
Flags | 0.42 | 0 | 0 | 0 |
General | 0 | 0.61 | 0 | 0 |
Paragrap | 0 | 0.62 | 0 | 0 |
Sentence | 0 | 0.63 | 0 | 0 |
Wordc | 0 | 0.52 | 0 | 0 |
Wordm | 0 | 0.63 | 0 | 0 |
Addition | 0 | 0 | 0.46 | 0 |
Code | 0 | 0 | 0.48 | 0 |
Counting | 0 | 0 | 0.49 | 0 |
Straight | 0 | 0 | 0.53 | 0 |
Wordr | 0 | 0 | 0 | 0.36 |
Numberr | 0 | 0 | 0 | 0.36 |
Figurer | 0 | 0 | 0 | 0.41 |
Object | 0 | 0 | 0 | 0.43 |
Numberf | 0 | 0 | 0 | 0.43 |
Figurew | 0 | 0 | 0 | 0.33 |
And here are path weights:
g | |
---|---|
Spatial | 1.31 |
Verbal | 0.89 |
Speed | 1.02 |
Memory | 1.1 |
These are almost identical to those from the multivariate model to two significant figures, suggesting that the multivariate and univariate models are indeed equivalent. Small discrepancies are to be expected given the randomness of Markov chain Monte Carlo.
Quick discussion
To be clear, we have not actually gained anything from recasting this particular model from a multivariate normal model to univariate normal model. The univariate version is bigger (though not actually slower it seems) and flipping is even more difficult to prevent. Recasting is simply an intermediate step towards creating generalized structural equation models to deal with non-normal data.
Recasting this model was not very difficult because the residual item and latent variable covariance matrices were both diagonal. If they are not diagonal, extra work is needed to make them so. Specifically, one needs to add additional latent variables to capture the non-zero off-diagonal values. I may write more about this in the future.
References
Bollen, K. A. (1989). Structural equations with latent variables. Wiley.
Holzinger, K. J., & Swineford, F. (1939). A study in factor analysis: The stability of a bi-factor solution. Supplementary Educational Monographs, 48, 91.
Version history
- Originally posted May 19, 2020.
Related posts
- “Bayesian SEM IV: Generalized SEM,” May 19, 2020.
- “Bayesian SEM II: Hierarchical latent variables,” May 15, 2020.
- “Measuring working-memory capacity,” Mar 16, 2020.
- “Bayesian SEM I: Confirmatory factor analysis,” Jan 19, 2020.
- All posts filed under bayesian, python.