Bayesian SEM IV:
So far in this series, I have assumed that the Holzinger and Swineford (1939) data are normally distributed random variables. However, looking closely at the data reveals that they are test scores; that is, counts of correct responses. Assuming that test scores are normally distributed is incorrect in at least two ways: (1) test scores must be integers; and (2) they have have lower and upper bounds (0 and whatever the maximum score on a particular test was, respectively). These incorrect assumptions may or may not have influenced the previous results.
A typical way to analyze count data is using logistic regression. Under logistic regression, the data are distributed as
\[\begin{equation} \boldsymbol{y}\sim\mathrm{Binomial}\left(\boldsymbol{\pi}, \boldsymbol{k}\right) \end{equation}\]where \(\boldsymbol{\pi}\) is the probability of a correct response on a question on the test and \(\boldsymbol{k}\) is the maximum possible score. Using the same principles as logistic regression, we can connect the data to latent variables in a structural equation model as follows:
\[\begin{equation} \boldsymbol{\pi}=\mathrm{logistic}\left(\boldsymbol{\nu}+\boldsymbol{\Lambda}\boldsymbol{\eta}+\boldsymbol{\epsilon}\right)\\ \boldsymbol{\eta} = (\boldsymbol{I} - \boldsymbol{\Gamma})^{-1} (\boldsymbol{\alpha}+\boldsymbol{\zeta})\\ \boldsymbol{\zeta}\sim\mathrm{Normal}\left(0, \boldsymbol{\psi}\right) \end{equation}\]See the previous posts for definitions of these variables. I have omitted the equations necessary to convert these vectors to matrices, as described in the last post. This model does not estimate any residual item or latent variable correlations, so any such relationships must be captured by latent variables. Also, there is no residual item variance: this is a feature of logisitc regression, in which all error comes from the Bernoulli process.
It doesn’t require much work to convert the previous PyMC3 (Salvatier et al., 2016) code to create this model instead:
def bcfab(items, factors, paths, nu_sd=2.5, alpha_sd=2.5):
r"""Constructs a Bayesian CFA model in "binomial form".
Args:
items (np.array): Data.
factors (np.array): Factor design matrix.
paths (np.array): Paths design matrix.
nu_sd (:obj:`float`, optional): Standard deviation of normal prior on item
intercepts.
alpha_sd (:obj:`float`, optional): Standard deviation of normal prior on factor
intercepts.
Returns:
None: Places model in context.
"""
# get numbers of cases, items, and factors
n, p = items.shape
p_, m = factors.shape
assert p == p_, "Mismatch between data and factor-loading matrices"
# priors on item intercepts
nu = pm.Normal(name=r"$\nu$", mu=0, sd=nu_sd, shape=p, testval=np.zeros(p))
# priors on factor intercepts
alpha = pm.Normal(name=r"$\alpha$", mu=0, sd=alpha_sd, shape=m, testval=np.zeros(m))
# priors on factor loadings
l = np.asarray(factors).sum()
lam = pm.Normal(name=r"$\lambda$", mu=0, sd=1, shape=l, testval=np.zeros(l))
# loading matrix
Lambda = tt.zeros(factors.shape)
k = 0
for i, j in product(range(p), range(m)):
if factors[i, j] == 1:
Lambda = tt.inc_subtensor(Lambda[i, j], lam[k])
k += 1
pm.Deterministic(name=r"$\Lambda$", var=Lambda)
# priors on paths
g = np.asarray(paths).sum()
gam = pm.Normal(name=r"$\gamma$", mu=0, sd=1, shape=g, testval=np.zeros(g))
# path matrix
Gamma = tt.zeros(paths.shape)
k = 0
for i, j in product(range(m), range(m)):
if paths[i, j] == 1:
Gamma = tt.inc_subtensor(Gamma[i, j], gam[k])
k += 1
pm.Deterministic(name=r"$\Gamma$", var=Gamma)
# priors on factor residuals
zeta = pm.Normal(name=r"$\zeta$", mu=0, sigma=1, shape=(n, m), testval=0)
# latent variables
I = np.eye(m)
Pi = pm.math.sigmoid(nu + matrix_dot(
matrix_dot((alpha + zeta), matrix_inverse(I - Gamma.T)), Lambda.T
))
# observations
pm.Binomial(name="$Y$", p=Pi, n=items.max(axis=0), observed=items, shape=items.shape)
Unfortunately, the binomial model takes much, much longer to sample, going up from 5 minutes to over an hour on my Mac. I’m not entirely sure where this slowdown comes from. It could be the additional step of applying the logistic function elementwise or the calculation of the binomial log probabilities. If anyone reading this has any suggestions for speeding up sampling, please let me know!
Here are the factor loadings for the Grant-White school under the binomial model:
Spatial | Verbal | Speed | Memory | |
---|---|---|---|---|
Visual | -0.28 | 0 | 0 | 0 |
Cubes | -0.22 | 0 | 0 | 0 |
Paper | -0.19 | 0 | 0 | 0 |
Flags | -0.8 | 0 | 0 | 0 |
General | 0 | -0.39 | 0 | 0 |
Paragrap | 0 | -0.47 | 0 | 0 |
Sentence | 0 | -0.53 | 0 | 0 |
Wordc | 0 | -0.31 | 0 | 0 |
Wordm | 0 | -0.61 | 0 | 0 |
Addition | 0 | 0 | 0.46 | 0 |
Code | 0 | 0 | 0.35 | 0 |
Counting | 0 | 0 | 0.26 | 0 |
Straight | 0 | 0 | 0.3 | 0 |
Wordr | 0 | 0 | 0 | -0.36 |
Numberr | 0 | 0 | 0 | -0.19 |
Figurer | 0 | 0 | 0 | -0.27 |
Object | 0 | 0 | 0 | -0.45 |
Numberf | 0 | 0 | 0 | -0.48 |
Figurew | 0 | 0 | 0 | -0.47 |
Interestingly, these loadings are very different to those under the normal model. The sign isn’t important (the latent variables have just flipped for Spatial, Verbal, and Memory), but the rank ordering of loadings within a factor have changed dramatically. The same is true of the paths:
g | |
---|---|
Spatial | -0.9 |
Verbal | -0.99 |
Speed | 0.85 |
Memory | -0.91 |
References
Holzinger, K. J., & Swineford, F. (1939). A study in factor analysis: The stability of a bi-factor solution. Supplementary Educational Monographs, 48, 91.
Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in Python using PyMC3. PeerJ Comput Science, 2, e55. 10.7717/peerj-cs.55
Version history
- Originally posted May 19, 2020.
Related posts
- “Bayesian SEM III: Univariate paramaterization,” May 19, 2020.
- “Bayesian SEM II: Hierarchical latent variables,” May 15, 2020.
- “Measuring working-memory capacity,” Mar 16, 2020.
- “Bayesian SEM I: Confirmatory factor analysis,” Jan 19, 2020.
- All posts filed under bayesian, python.