The Cracked Bassoon

Bayesian SEM IV:

Generalized SEM
Filed under bayesian, python.

So far in this series, I have assumed that the Holzinger and Swineford (1939) data are normally distributed random variables. However, looking closely at the data reveals that they are test scores; that is, counts of correct responses. Assuming that test scores are normally distributed is incorrect in at least two ways: (1) test scores must be integers; and (2) they have have lower and upper bounds (0 and whatever the maximum score on a particular test was, respectively). These incorrect assumptions may or may not have influenced the previous results.

A typical way to analyze count data is using logistic regression. Under logistic regression, the data are distributed as

\[\begin{equation} \boldsymbol{y}\sim\mathrm{Binomial}\left(\boldsymbol{\pi}, \boldsymbol{k}\right) \end{equation}\]

where \(\boldsymbol{\pi}\) is the probability of a correct response on a question on the test and \(\boldsymbol{k}\) is the maximum possible score. Using the same principles as logistic regression, we can connect the data to latent variables in a structural equation model as follows:

\[\begin{equation} \boldsymbol{\pi}=\mathrm{logistic}\left(\boldsymbol{\nu}+\boldsymbol{\Lambda}\boldsymbol{\eta}+\boldsymbol{\epsilon}\right)\\ \boldsymbol{\eta} = (\boldsymbol{I} - \boldsymbol{\Gamma})^{-1} (\boldsymbol{\alpha}+\boldsymbol{\zeta})\\ \boldsymbol{\zeta}\sim\mathrm{Normal}\left(0, \boldsymbol{\psi}\right) \end{equation}\]

See the previous posts for definitions of these variables. I have omitted the equations necessary to convert these vectors to matrices, as described in the last post. This model does not estimate any residual item or latent variable correlations, so any such relationships must be captured by latent variables. Also, there is no residual item variance: this is a feature of logisitc regression, in which all error comes from the Bernoulli process.

It doesn’t require much work to convert the previous PyMC3 (Salvatier et al., 2016) code to create this model instead:

def bcfab(items, factors, paths, nu_sd=2.5, alpha_sd=2.5):
    r"""Constructs a Bayesian CFA model in "binomial form".

        items (np.array): Data.
        factors (np.array): Factor design matrix.
        paths (np.array): Paths design matrix.
        nu_sd (:obj:`float`, optional): Standard deviation of normal prior on item
        alpha_sd (:obj:`float`, optional): Standard deviation of normal prior on factor

        None: Places model in context.

    # get numbers of cases, items, and factors
    n, p = items.shape
    p_, m = factors.shape
    assert p == p_, "Mismatch between data and factor-loading matrices"

    # priors on item intercepts
    nu = pm.Normal(name=r"$\nu$", mu=0, sd=nu_sd, shape=p, testval=np.zeros(p))

    # priors on factor intercepts
    alpha = pm.Normal(name=r"$\alpha$", mu=0, sd=alpha_sd, shape=m, testval=np.zeros(m))

    # priors on factor loadings
    l = np.asarray(factors).sum()
    lam = pm.Normal(name=r"$\lambda$", mu=0, sd=1, shape=l, testval=np.zeros(l))

    # loading matrix
    Lambda = tt.zeros(factors.shape)
    k = 0
    for i, j in product(range(p), range(m)):
        if factors[i, j] == 1:
            Lambda = tt.inc_subtensor(Lambda[i, j], lam[k])
            k += 1
    pm.Deterministic(name=r"$\Lambda$", var=Lambda)

    # priors on paths
    g = np.asarray(paths).sum()
    gam = pm.Normal(name=r"$\gamma$", mu=0, sd=1, shape=g, testval=np.zeros(g))

    # path matrix
    Gamma = tt.zeros(paths.shape)
    k = 0
    for i, j in product(range(m), range(m)):
        if paths[i, j] == 1:
            Gamma = tt.inc_subtensor(Gamma[i, j], gam[k])
            k += 1
    pm.Deterministic(name=r"$\Gamma$", var=Gamma)

    # priors on factor residuals
    zeta = pm.Normal(name=r"$\zeta$", mu=0, sigma=1, shape=(n, m), testval=0)

    # latent variables
    I = np.eye(m)
    Pi = pm.math.sigmoid(nu + matrix_dot(
        matrix_dot((alpha + zeta), matrix_inverse(I - Gamma.T)), Lambda.T

    # observations
    pm.Binomial(name="$Y$", p=Pi, n=items.max(axis=0), observed=items, shape=items.shape)

Unfortunately, the binomial model takes much, much longer to sample, going up from 5 minutes to over an hour on my Mac. I’m not entirely sure where this slowdown comes from. It could be the additional step of applying the logistic function elementwise or the calculation of the binomial log probabilities. If anyone reading this has any suggestions for speeding up sampling, please let me know!

Here are the factor loadings for the Grant-White school under the binomial model:

  Spatial Verbal Speed Memory
Visual -0.28 0 0 0
Cubes -0.22 0 0 0
Paper -0.19 0 0 0
Flags -0.8 0 0 0
General 0 -0.39 0 0
Paragrap 0 -0.47 0 0
Sentence 0 -0.53 0 0
Wordc 0 -0.31 0 0
Wordm 0 -0.61 0 0
Addition 0 0 0.46 0
Code 0 0 0.35 0
Counting 0 0 0.26 0
Straight 0 0 0.3 0
Wordr 0 0 0 -0.36
Numberr 0 0 0 -0.19
Figurer 0 0 0 -0.27
Object 0 0 0 -0.45
Numberf 0 0 0 -0.48
Figurew 0 0 0 -0.47

Interestingly, these loadings are very different to those under the normal model. The sign isn’t important (the latent variables have just flipped for Spatial, Verbal, and Memory), but the rank ordering of loadings within a factor have changed dramatically. The same is true of the paths:

Spatial -0.9
Verbal -0.99
Speed 0.85
Memory -0.91


Holzinger, K. J., & Swineford, F. (1939). A study in factor analysis: The stability of a bi-factor solution. Supplementary Educational Monographs, 48, 91.

Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in Python using PyMC3. PeerJ Comput Science, 2, e55. 10.7717/peerj-cs.55

Version history

Related posts