Derivation

Derivation of the flexible cutoffs

On this page, we want to illustrate how the flexible cutoffs of our tool have been generated in detail. As we use R and respective packages (most importantly: lavaan, see Rosseel, 2012), advanced knowledge of R, the aforementioned packages and CBSEM is required for full understanding. Still, since our entire code is optimized to speed up simulations on cloud-based servers, we do not want to provide the entire code. If you are interested in special issues on the code, please contact us. Thank you.

Step 1: Generate a design

It all starts with a conceptual step – how many models (out of an unlimited number of possible models) should be covered by the flexible cutoffs? As described in the paper, we decided to use the following factors:

  • Sample sizes (n): 100 to 1,000 in steps of 50
  • Latent variables (lv): 2 to 10
  • Manifest indicators per latent variable (i): 2 to 10
  • Factor loadings (fl): .7, .8 and .9
  • Non-normality (non): normal, moderate, severe; see step 4

Overall, this leads to 19 (n) x 9 (lv) x 9 (i) x 3 (fl) x 3 (non) = 13,851 CFA models that were used to calculate flexible cutoffs. Of course, one can always add more models and we constantly work on adding more, but keep in mind that SEM is a computational-intensive procedure that requires to find convergence. Small models with small sample sizes might be fast to simulate, overly extensive ones (ten latent variables, ten manifest indicators each) need substantially more time. Since we need a high number of replications (500 and more), this fact yields a rather complex simulation that needs days (on cloud-servers in 2017) or even years (on standard computers). Hence, compromises were needed.

Using standard R functions (e.g., ‚expand.grid‘), one can easily generate a dataset of possible designs. The following code shows an example to generate the 13,851 designs. A simple for-loop (or in case of cloud-based computing, a parallel version such as foreach) then goes through this datasets. To keep it simple, the relevant design characteristics (n, lv, i, fl, non) in any iteration are read and consequently applied to determine the next steps. For example, assume that in the 50th iteration, flexible cutoffs for a model with n = 250, four latent variables, four manifest indicators per latent variables, an average factor loading of .8 based on normal data should be simulated.

n <- seq(100, 1000, by = 50)
lv <- seq(2, 10)
i <- seq(2, 10)
fl <- c(.7, .8, .9)
non <- c("normal", "moderate", "severe")

designs <- expand.grid(n, lv, i, fl, non)

Step 2: Determine the population model

In order to simulated data, a population model has to be determined. This model describes the covariance structure in the true population. It is based on the correlation between latent variables (cor) and the average factor loading (fl).

In the following example code, ‚L‘ corresponds to the factor loading (fl) and ‚R‘ reflects the correlation between latent variables (cor). Please note, that they are always constant. For example, if fl = .8 is selected, L is constantly set to .8. R is always .3.

‚lv‘ denotes a latent variable, ‚x‘ a manifest indicator. For example, ‚x1_1‘ reflects the first manifest indicator of the first latent variable ‚lv1‘. ‚x3_3‘ then would name the third indicator of the third latent variable ‚lv3‘. The example uses standard lavaan terminology (‚=~‘ indicates  a measurement model, ‚~‘ a structural model in CFA).

population.model <- '
    lv1 =~ L*x1_1 + L*x1_2 + L*x1_3 + L*x1_4 + L*x1_5 + L*x1_6 + L*x1_7 + L*x1_8 + L*x1_9 + L*x1_10
    lv2 =~ L*x2_1 + L*x2_2 + L*x2_3 + L*x2_4 + L*x2_5 + L*x2_6 + L*x2_7 + L*x2_8 + L*x2_9 + L*x2_10
    lv3 =~ L*x3_1 + L*x3_2 + L*x3_3 + L*x3_4 + L*x3_5 + L*x3_6 + L*x3_7 + L*x3_8 + L*x3_9 + L*x3_10
    lv4 =~ L*x4_1 + L*x4_2 + L*x4_3 + L*x4_4 + L*x4_5 + L*x4_6 + L*x4_7 + L*x4_8 + L*x4_9 + L*x4_10
    lv5 =~ L*x5_1 + L*x5_2 + L*x5_3 + L*x5_4 + L*x5_5 + L*x5_6 + L*x5_7 + L*x5_8 + L*x5_9 + L*x5_10
    lv6 =~ L*x6_1 + L*x6_2 + L*x6_3 + L*x6_4 + L*x6_5 + L*x6_6 + L*x6_7 + L*x6_8 + L*x6_9 + L*x6_10
    lv7 =~ L*x7_1 + L*x7_2 + L*x7_3 + L*x7_4 + L*x7_5 + L*x7_6 + L*x7_7 + L*x7_8 + L*x7_9 + L*x7_10
    lv8 =~ L*x8_1 + L*x8_2 + L*x8_3 + L*x8_4 + L*x8_5 + L*x8_6 + L*x8_7 + L*x8_8 + L*x8_9 + L*x8_10
    lv9 =~ L*x9_1 + L*x9_2 + L*x9_3 + L*x9_4 + L*x9_5 + L*x9_6 + L*x9_7 + L*x9_8 + L*x9_9 + L*x9_10
    lv10 =~ L*x10_1 + L*x10_2 + L*x10_3 + L*x10_4 + L*x10_5 + L*x10_6 + L*x10_7 + L*x10_8 + L*x10_9 + L*x10_10
    lv10 ~~ R*lv1 + R*lv2 + R*lv3 + R*lv4 + R*lv5 + R*lv6 + R*lv7 + R*lv8 + R*lv9
    lv9 ~~ R*lv1 + R*lv2 + R*lv3 + R*lv4 + R*lv5 + R*lv6 + R*lv7 + R*lv8
    lv8 ~~ R*lv1 + R*lv2 + R*lv3 + R*lv4 + R*lv5 + R*lv6 + R*lv7
    lv7 ~~ R*lv1 + R*lv2 + R*lv3 + R*lv4 + R*lv5 + R*lv6
    lv6 ~~ R*lv1 + R*lv2 + R*lv3 + R*lv4 + R*lv5
    lv5 ~~ R*lv1 + R*lv2 + R*lv3 + R*lv4
    lv4 ~~ R*lv1 + R*lv2 + R*lv3
    lv3 ~~ R*lv1 + R*lv2
    lv2 ~~ R*lv1'

Further, it is important to note that the population model is identical for all model sizes, i.e., the number of indicators (i) and the number of latent variables (lv) as the lavaan function cfa automatically drops unused variables depending on the model being estimated. Standardized coefficients (standardized factor loading, correlations) are assumed.

Step 3: Determine the target model

Once the population model is determined, our approach determines the CFA model that is estimated based on our loop, the target model. For example, if we want to estimate flexible cutoffs for a model with four latent variables and four manifest indicators per latent variable, the target model can be written in standard lavaan syntax as:

cfa.4.4 <- '
lv1.4 =~ x1_1 + x1_2 + x1_3 + x1_4
lv2.4 =~ x2_1 + x2_2 + x2_3 + x2_4
lv3.4 =~ x3_1 + x3_2 + x3_3 + x3_4
lv4.4 =~ x4_1 + x4_2 + x4_3 + x4_4'

Again, ‚lv‘ denotes a latent variable, ‚x‘ a manifest indicator. Hence, ‚lv1.4‘ states that a first latent variable consisting of four indicators is estimated based on four corresponding indicators (e.g., ‚x1_1‘). Naming the variables in that manner helped to generalize the code a bit. For example, if one writes the model for ten latent variables with four indicators each, the code for ‚lv1.4‘ can be recycled. Please note that the target model is always correctly specified. Hence, flexible cutoffs do not need assumptions about misspecified models. Please also note that the ‚cfa‘ function in lavaan automatically drops unneeded variables from the population model.

Step 4: Generate the data

Based on determined population model and target model, data of a size n (e.g., n = 250) can be simulated. The random data generated is always standardized (mean is 0, standard deviation = 1), which corresponds to the standardized versions of the population model. However, there is one challenge: the possible non-normality of that data. We thus made use of the ’simulateData‘ function from lavaan which allows to define (multivariate) kurtosis (‚ku‘) and skewness (’sk‘). In case of assuming normal data, kurtosis and skewness are set to the standard value of 0. If a moderate deviation from normality is assumed, kurtosis is set to 3.5 and skewness to 1, for a severe deviation to 7 and 2 (see our paper for a reasoning on these levels). This leads to a certain issue: Especially for more severe non-normality and smaller samples, it can happen that even correctly specified models do not converge due to severely distorted random data in the simulated dataset (particularly violating the requirement of a positive semi-definite covariance matrix).

To overcome this issue, we applied a workaround that ensures enough replications (at least 500): the number of models simulated is based on the preset number of replications (500) and a multiplier that comes from our experiences with non-convergence. The multiplier is at least 3 (1,500 reps), but increases up to 14. Thus, in order to safely find 500 replications, up to 7,000 models are estimated. For data size reasons, the simulated data itself is not saved. However, we used a vector of 50,000 random seeds (termed ‚random.seeds‘) to allow replication.

This snippet shows the basic principle of simulating data and estimating the model. Please note that ’nRep‘ is based on the multiplier (= 500 replications * multiplier) and that we used the ‚MLM‘-estimator from lavaan (scaled chi-square values). No other changes from the standards for the cfa function have been made.

for (i in 1:nRep) {
    x <- simulateData(
      model = population.model,
      model.type = "cfa",
      sample.nobs = n,
      skewness = sk,
      kurtosis = ku,
      seed = random.seeds[i]
    )
    tryCatch({
      res <-
        cfa(
          model = target.model,
          data = x,
          estimator = "MLM",
          warn = FALSE
        )
    })

Step 5: Store the model results

Each model result is then stored in a dataset (matrix in R). For instance, we store the chi-square value, the degrees of freedom, CFI, TLI, RMSEA and SRMR (see page Documentation). To make sure that all cutoffs for all models are based on the same amount of data, only the first 500 valid values are retained.

Step 6: Obtain quantiles to represent flexible cutoffs

Based on the non-aggregated dataset from step 5, flexible cutoffs are estimated from empirical quantiles. We apply the standard R function ‚quantile‘ in version 8 to do so, finding rather unbiased, distribution-independent quantiles. Given a specific uncertainty level ‚alpha‘ (.1, .05, .01, .001), the cutoff then looks like this:

if (gof == TRUE)
    co <- quantile(sel, alpha, type = 8)
  if (gof == FALSE)
    co <- quantile(sel, 1 - alpha, type = 8)

The flexible cutoff (‚co‘) is estimated as the empirical quantile of a vector of values for a given index (’sel‘), given a predefined uncertainty. Please note that we linked each index with the attribute of being a goodness-of-fit index (‚gof‘) or being a badness-of-fit index. For example, CFI (gof is TRUE) uses the alpha directly (we want the lower quantile value representing the lower width of the confidence interval), SRMR (gof is FALSE) uses 1-alpha (we want the upper quantile value) accordingly.

Finally, the results of each cutoff for the given number of indices (see tool) are returned to a dataset consisting of the designs investigated (see step 1). For database reasons, every alpha has its own line in that dataset. Hence, the 13,851 models are available for all alpha levels (.1, .05, .01, and .001) yielding 55,404 lines in the final dataset. This dataset is exported to a SQL database on which the tool is based.

Further readings

Rosseel, Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), http://www.jstatsoft.org/v48/i02/.