Bayesian SEM IV: Generalized SEMTags: bayesian | python
So far in this series, I have assumed that the Holzinger and Swineford (1939) data are normally distributed random variables. However, looking closely at the data reveals that they are test scores; that is, counts of correct responses. Assuming that test scores are normally distributed is incorrect in at least two ways: (1) test scores must be integers; and (2) they have have lower and upper bounds (0 and whatever the maximum score on a particular test was, respectively). These incorrect assumptions may or may not have influenced the previous results.
A typical way to analyze count data is using logistic regression. Under logistic regression, the data are distributed as
where is the probability of a correct response on a question on the test and is the maximum possible score. Using the same principles as logistic regression, we can connect the data to latent variables in a structural equation model as follows:
See the previous posts for definitions of these variables. I have omitted the equations necessary to convert these vectors to matrices, as described in the last post. This model does not estimate any residual item or latent variable correlations, so any such relationships must be captured by latent variables. Also, there is no residual item variance: this is a feature of logisitc regression, in which all error comes from the Bernoulli process.
def bcfab(items, factors, paths, nu_sd=2.5, alpha_sd=2.5): r"""Constructs a Bayesian CFA model in "binomial form". Args: items (np.array): Data. factors (np.array): Factor design matrix. paths (np.array): Paths design matrix. nu_sd (:obj:`float`, optional): Standard deviation of normal prior on item intercepts. alpha_sd (:obj:`float`, optional): Standard deviation of normal prior on factor intercepts. Returns: None: Places model in context. """ # get numbers of cases, items, and factors n, p = items.shape p_, m = factors.shape assert p == p_, "Mismatch between data and factor-loading matrices" # priors on item intercepts nu = pm.Normal(name=r"$\nu$", mu=0, sd=nu_sd, shape=p, testval=np.zeros(p)) # priors on factor intercepts alpha = pm.Normal(name=r"$\alpha$", mu=0, sd=alpha_sd, shape=m, testval=np.zeros(m)) # priors on factor loadings l = np.asarray(factors).sum() lam = pm.Normal(name=r"$\lambda$", mu=0, sd=1, shape=l, testval=np.zeros(l)) # loading matrix Lambda = tt.zeros(factors.shape) k = 0 for i, j in product(range(p), range(m)): if factors[i, j] == 1: Lambda = tt.inc_subtensor(Lambda[i, j], lam[k]) k += 1 pm.Deterministic(name=r"$\Lambda$", var=Lambda) # priors on paths g = np.asarray(paths).sum() gam = pm.Normal(name=r"$\gamma$", mu=0, sd=1, shape=g, testval=np.zeros(g)) # path matrix Gamma = tt.zeros(paths.shape) k = 0 for i, j in product(range(m), range(m)): if paths[i, j] == 1: Gamma = tt.inc_subtensor(Gamma[i, j], gam[k]) k += 1 pm.Deterministic(name=r"$\Gamma$", var=Gamma) # priors on factor residuals zeta = pm.Normal(name=r"$\zeta$", mu=0, sigma=1, shape=(n, m), testval=0) # latent variables I = np.eye(m) Pi = pm.math.sigmoid(nu + matrix_dot( matrix_dot((alpha + zeta), matrix_inverse(I - Gamma.T)), Lambda.T )) # observations pm.Binomial(name="$Y$", p=Pi, n=items.max(axis=0), observed=items, shape=items.shape)
Unfortunately, the binomial model takes much, much longer to sample, going up from 5 minutes to over an hour on my Mac. I’m not entirely sure where this slowdown comes from. It could be the additional step of applying the logistic function elementwise or the calculation of the binomial log probabilities. If anyone reading this has any suggestions for speeding up sampling, please let me know!
Here are the factor loadings for the Grant-White school under the binomial model:
Interestingly, these loadings are very different to those under the normal model. The sign isn’t important (the latent variables have just flipped for Spatial, Verbal, and Memory), but the rank ordering of loadings within a factor have changed dramatically. The same is true of the paths:
Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in Python using PyMC3. PeerJ Comput Science, 2, e55. 10.7717/peerj-cs.55
- Originally posted May 19, 2020.