Manual

CAARS 2 Manual

Appendix M: Methods of Evaluating Measurement Bias


Appendix M: Methods of Evaluating Measurement Bias

This appendix describes the two main methods for evaluating bias in the Conners Adult ADHD Rating Scale 2nd Edition (CAARS™ 2): invariance tests and mean group difference tests. Results of these methods are presented in chapter 10, Fairness, and chapter 11, CAARS 2–Short.

Invariance Testing

Invariance testing broadly refers to the degree that a test generalizes across groups. When a test is invariant across groups, the test measures the same construct in the same way (Vandenberg & Lance, 2000), regardless of group membership. Establishing measurement invariance is a necessary step in applying test scores to specific groups. Without evidence of invariance, the same score in different groups may not indicate the same result, making interpretation extremely difficult and problematic. Measurement invariance can be established using different analyses, including confirmatory factor analysis (CFA; from a classical test theory [CTT] framework) and differential test functioning (DTF; from an item-response theory [IRT] framework). The results of each type of analysis provide slightly different information. Note that because invariance testing relies on modeled data (i.e., estimating the population, rather than describing the sample), larger sample sizes are required, and a greater range of responses is desired. Therefore, the Total Sample (as described in the Standardization Phase in chapter 6, Development) is used for these analyses, because it includes a considerable number of individuals from the general population, as well as clinical cases.

Measurement Invariance

In a CTT framework, multiple-group CFAs are used to evaluate measurement invariance (MI). In MI analyses, the overall structure of the assessment is tested for equivalence, including the relationship between items and factors and how the factors relate to each other. CFAs showing similar fit across groups are indicative of measurement invariance and increased fairness. Much like factor analysis, a sufficiently large sample size is required to conduct MI analyses in order to achieve stable estimates. For this reason, where possible and reasonable, when group comparisons were limited by smaller sample sizes, smaller groups were combined into larger groups to facilitate the analyses.

When analyzing MI using CFA, it is tested along a continuum from weaker to stronger invariance (with stronger invariance indicating a fairer test). For the CAARS 2 Content Scales, MI was explored by testing the following four models, in order of ascending stringency, as outlined in recommendations for assessments with ordered categorical items (Wu & Estabrook, 2016):

  1. Configural invariance. Groups have the same factor structure.

  2. Weak invariance. Groups have the same factor structure and thresholds. In a threshold model, the ordered categories are a discrete representation of a continuous latent variable. Thresholds are the values that represent the transition from one ordinal category to the next (Muthén, 1984).

  3. Strong invariance. Groups have the same factor structure, thresholds, and factor loadings. Factor loadings represent the strength of the relationship between the items and the latent factor.

  4. Strict invariance. Groups have the same factor structure, thresholds, factor loadings, and intercepts. Intercepts are the expected value of the item when the latent variable is equal to zero. This level is sometimes called scalar/strong invariance and is required for comparing mean scores across groups.

Testing begins with the weakest form of invariance (configural invariance) and moves sequentially through the remaining models as long as invariance is upheld at each level. A meaningful deterioration in model fit indicates differences in the measurement model between groups. Goodness-of-fit statistics are used to assess meaningful change in fit indices when comparing levels of MI. Although there are some common criteria used for continuous variables (e.g., a decrease of greater than .01 in the Comparative Fit Index [CFI, Bentler, 1990] is proposed as meaningful change; Cheung & Rensvold, 2002), there is an absence of consensus regarding what constitutes meaningful change for fit statistics when using ordinal variables. Therefore, many recommend considering a change in multiple fit indices when investigating MI (Svetina et al., 2019). For purposes of the CAARS 2 analyses, the Satorra-Bentler scaled chi-square (χ2) difference test (Satorra, 2000) and change in Comparative Fit Index (∆CFI) were used. However, considering that the chi-square test can be sensitive to large sample sizes (i.e., finding significant effects that are purely statistical artifacts; Tanka, 1987), more weight was given to change in other fit statistics. If a meaningful change occurs during the process of testing MI, MI testing stops, and more stringent models are not tested. For example, if there is a meaningful deterioration in model fit between the weak model (second level of testing) and the strong model (third level of testing), then MI is not established, and the last level is not tested. MI is established when there is an absence of a meaningful decline in fit across all MI levels. Analyses of MI in the CAARS 2 were conducted in R using the lavaan package (version 0.6-6; Rosseel, 2012; R Core Team, 2013).

Differential Test Functioning

Differential test functioning (DTF) measures whether the overall test score is truly measuring the same construct for different groups by comparing test response functions (or test characteristic curves) for each group (Chalmers et al., 2016). This analysis is an IRT-based approach to testing invariance between groups. If a DTF analysis reveals differential functioning (i.e., test characteristic curves that are unique, rather than identical for each group), the validity of the measure would be compromised, highlighting the potential for bias in the measurement process. DTF was assessed in the CAARS 2 Content Scales by examining whether there were significant demographic group differences in test functioning for each scale, independent of the larger model, and whether the significant differences were practically or clinically meaningful. It is worth noting that differential item functioning (DIF) was assessed during the item selection phase of the CAARS 2 (see chapter 6, Development), and items that displayed statistically significant and practically meaningful DIF were removed from the final item pool. DTF involves a visual inspection of test characteristic curves for each group, which can be summarized with an effect size statistic. Visual inspection was carried out for all Content scales and all group comparisons, and examples are provided within chapter 10, Fairness. Effect size statistics are provided to summarize all the DTF results. Within these analyses, meaningful differences were defined as those with at least a small effect size, using the expected test score standardized difference (ETSSD) statistic. The ETSSD can be interpreted using Cohen’s (1988) rules of thumb for small (ETSSD ≥ |0.20|), medium (ETSSD ≥ |0.50|), and large (ETSSD ≥ |0.80|) effect sizes (Meade, 2010). Analyses of DTF with the CAARS 2 were conducted in R using the mirt package (version 1.33.1; Chalmers, 2012).

Mean Group Differences

To assess whether demographic group membership meaningfully influences the obtained score on the CAARS 2, the effects of demographic group membership were analyzed by comparing mean differences across groups. Note that mean differences observed for subgroups of a population do not directly address measurement bias in the way that MI and DTF analyses do, but they do offer insight into the applied use of the test and whether adverse effects may occur through the intended use. If a test is invariant across groups and there are observed differences in the scores between groups, there is greater certainty that the scores reflect population differences rather than measurement bias. That is, the test may be equivalent in its measurement of the given construct, but it may be sensitive enough to detect real-world differences occurring independently of the test or due to an unmeasured variable.

For the CAARS 2, mean group differences were calculated using a demographically matched 1:1 randomly drawn subsample of the Normative Sample in order to maintain equal variance between the groups and provide balanced sample sizes that would improve interpretability. Analyses of variance (ANOVAs) were conducted to compare group means (specifically, this method was used for analyses with gender, race/ethnicity, and country of residence). In these instances, the samples were first matched on the combination of gender, education level (EL), language(s) spoken, clinical status, race/ethnicity, and age (except when one of these covariates was the target variable of interest). Note that matched sampling was not possible for the analyses of EL due to sample sizes; therefore, for the EL analysis, the entire Normative Sample was used for the analyses, covariates were statistically controlled for, and analyses of covariance (ANCOVAs) were conducted.

ANOVAs and ANCOVAs were conducted in R via the stats package (version 3.6.1; R Core Team, 2013). The CAARS 2 scale T-scores were treated as dependent variables, and the target demographic group membership was the independent variable in each analysis. Additional demographic characteristics were included as statistically controlled covariates via ANCOVA (where applicable). A conservative coefficient alpha level of p < .01 was used to control for Type I errors that could arise from multiple comparisons. In addition to reporting significance levels, measures of effect size (Cohen’s d and η2) are included. A description of the interpretation of these effect size measures is provided in Interpreting Correlations and Effect Sizes in chapter 9, Validity.

< Back Next >