Manual

Conners 4 Manual

Chapter 10: Methods of Evaluating Measurement Bias


Methods of Evaluating Measurement Bias

view all chapter tables | print this section

Potential measurement bias was assessed by exploring differences between groups with the following demographic characteristics: gender, race/ethnicity, country of residence, and PEL. Two main methods of evaluating bias were employed: invariance tests and mean group difference tests.

Invariance Testing

Invariance testing broadly refers to the degree that a test generalizes across groups. When a test is invariant across groups, the test measures the same construct in the same way (Vandenberg & Lance, 2000), regardless of demographic group membership. Establishing measurement invariance is a necessary step in applying the test scores to specific groups, as without evidence that there is invariance, the same score in different groups does not necessarily indicate the same result, making interpretation extremely difficult or problematic. Measurement invariance can be established using different analyses, including confirmatory factor analysis (CFA; from a classical test theory [CTT] framework) and differential test functioning (DTF; from an item-response theory [IRT] framework). The results of each type of analysis provide slightly different information. Note that because invariance testing relies on modeled data (i.e., estimating the population, rather than describing the sample), larger sample sizes are required, and greater variability of responses are desired. Therefore, the Total Sample (as described in the Standardization Phase in chapter 6, Development) is used for these analyses, because it includes a considerable number of youth from the general population, as well as clinical cases.

Measurement Invariance

In a CTT framework, multiple-group CFAs are used to evaluate measurement invariance (MI). In MI analyses, the overall structure of the assessment is tested for equivalence, including the relationship between items and factors and how the factors relate to each other. Much like factor analysis, a sufficiently large sample size is required to conduct MI analyses in order to achieve stable estimates. For this reason, where possible and reasonable, when group comparisons were limited by smaller sample sizes, smaller groups were combined into larger groups to facilitate the analyses.

When analyzing MI using CFA, it is tested along a continuum from weaker to stronger invariance. For the Conners 4 scales, MI was explored by testing the following four models, in order of ascending stringency, as outlined in recommendations for assessments with ordered categorical items (Wu & Estabrook, 2016):

  1. Configural invariance. Groups have the same factor structure.
  2. Threshold invariance. Groups have the same factor structure and thresholds. In a threshold model, the ordered categories are a discrete representation of a continuous latent variable. Thresholds are the values that represent the transition from one ordinal category to the next (Muthén, 1984).
  3. Threshold and loading invariance. Groups have the same factor structure, thresholds, and factor loadings. Factor loadings represent the strength of the relationship between the items and the latent factor.
  4. Threshold, loading, and intercept invariance. Groups have the same factor structure, thresholds, factor loadings, and intercepts. Intercepts are the expected value of the item when the latent variable is equal to zero. This level is sometimes called scalar/strong invariance and is required for comparing mean scores across groups.

Testing begins with the weakest form of invariance (configural invariance) and moves sequentially through the remaining models as long as invariance is upheld at each level. A meaningful deterioration in model fit would indicate differences in the measurement model between groups. Goodness-of-fit statistics are used to assess a meaningful change in fit indices when comparing levels of MI. While there are some common criteria used for continuous variables (e.g., a decrease of greater than .01 in the Comparative Fit Index [CFI, Bentler,1990] is proposed as meaningful change; Cheung & Rensvold, 2002), there is an absence of consensus regarding what constitutes meaningful change for fit statistics when using ordinal variables. Therefore, it is recommended to consider change in multiple fit indices when investigating MI (Svetina et al., 2019). The Satorra-Bentler scaled chi-square (χ2) difference test was used for comparing change in the chi-square statistic between nested models (Satorra, 2000). However, considering that the chi-square test can be sensitive to large sample sizes (i.e., finding significant effects that are purely statistical artefacts; Tanaka, 1987), more weight was given to change in other fit statistics. If a meaningful change occurs during the process of testing MI, MI testing stops, and more stringent models are not tested. For example, if there is a meaningful deterioration in model fit between the threshold model (second level of testing) and the threshold and loading model (third level of testing), then MI is not established, and the last level is not tested. MI is established when there is an absence of meaningful decline in fit across all MI levels. Analyses of MI were conducted in R using the lavaan package (version 0.6-6; Rosseel, 2012; R Core Team, 2013).

To test MI in the Conners 4, the Content Scales and Impairment & Functional Outcome Scales were assessed as separate models, as outlined in Internal Structure, in chapter 9, Validity. The non-ADHD-specific DSM Symptom Scales (DSM Oppositional Defiant Disorder Symptoms and DSM Conduct Disorder Symptoms) were assessed independently as single-factor models. Some items in the DSM Oppositional Defiant Disorder Symptoms and Conduct Disorder Symptoms scales had very low levels of endorsement, due to the severe behaviors described that are rarely reported in a largely general population sample. In dividing the Total Sample into demographic subgroups, sparse data limited interpretability of MI analyses; therefore, when necessary, items on these scales were recoded from four response options into two or three response options, depending on the pattern of endorsement, with the goal of preserving variability but permitting CFA models to be estimated properly.

Differential Test Functioning

Differential test functioning (DTF) measures whether the overall test score is truly measuring the same construct for different groups by comparing test response functions (or test characteristic curves) for each group (Chalmers et al., 2016). This analysis is an IRT-based approach to testing invariance between groups. If a DTF analysis reveals differential functioning (i.e., test characteristic curves that are unique for each group, rather than identical), the validity of the measure would be compromised, highlighting the potential for bias in the measurement process. DTF was assessed in the Conners 4 scales by examining whether there were significant demographic group differences in test functioning for each scale, independent of the larger model, and whether the significant differences were practically or clinically meaningful. It is worth noting that differential item functioning (DIF) was assessed during the item selection phase of the Conners 4 (see chapter 6, Development).

DTF involves a visual inspection of test characteristic curves for each group, which can be summarized with an effect size statistic. Visual inspections were carried out for all scales and for all group comparisons, and an example is provided within this chapter for reference; however, for ease of presentation, only the effect size statistics are provided within this chapter to summarize the results. Within the DTF analyses, meaningful differences were defined as those with at least a small effect size using the expected test score standardized difference (ETSSD) statistic. The ETSSD can be interpreted using Cohen’s (1988) rules of thumb for small (ETSSD ≥ |0.20|), medium (ETSSD ≥ |0.50|), and large (ETSSD ≥ |0.80|) effect sizes (Meade, 2010). Analyses of DTF were conducted in R using the mirt package (version 1.33.1; Chalmers, 2012). Similar to the MI analysis, dividing the Total Sample into demographic subgroups resulted in sparse data for the DSM Oppositional Defiant Disorder Symptoms and DSM Conduct Disorder Symptom scales, which limited the interpretability of DTF analyses. Therefore, when necessary, items on these scales were recoded from four response options into two or three response options, depending on the pattern of endorsement, with the goal of preserving variability but permitting DTF models to be estimated properly.

Mean Group Differences

To examine the generalizability of the obtained scores, the effects of demographic group membership were analyzed by a comparison of mean differences. Mean group differences assess whether demographic group membership of the rated youth meaningfully influences the obtained score. Note that mean differences observed for subgroups of a population do not directly address measurement bias in the way that MI and DTF analyses do, but they do offer insight into the applied use of the test and whether adverse effects may occur through the intended use. If a test is invariant across groups and there are observed differences in scores between groups, there is greater certainty that the scores reflect population differences rather than measurement bias. That is, the test may be equivalent in its measurement of the given construct, but it may be sensitive enough to detect real-world differences occurring independently of the test or due to an unmeasured variable.

Mean group differences were calculated using a demographically matched 1:1 randomly drawn subsample of the Normative Sample in order to maintain equal variance between the groups and provide balanced sample sizes that would improve interpretability. Analyses of variance (ANOVA) were conducted to compare group means (specifically, this method was used for analyses with gender, race/ethnicity, and country of residence). In these instances, the samples were first matched on the combination of gender, PEL, language(s) spoken, clinical status, and age (except when one of these covariates was the target variable of interest). Note that matched sampling was not possible for the analyses of PEL due to insufficient sample sizes; therefore, for the PEL analysis, the entire Normative Samples were used for the analyses, covariates were statistically controlled for, and analyses of covariance (ANCOVAs) were conducted.

ANOVAs and ANCOVAs were conducted in R via the stats package (version 3.6.1; R Core Team, 2013). The Conners 4 scale T-scores were treated as dependent variables, and the target demographic group membership was the independent variable in each analysis. Additional demographic characteristics were included as statistically controlled covariates via ANCOVA (where applicable). A conservative coefficient alpha level of p < .01 was used to control for Type I errors that could arise from multiple comparisons. In addition to significance levels, measures of effect size (Cohen’s d ratios and η2) are included. A description of the interpretation of these effect size measures is provided in Interpreting Correlations and Effect Sizes in chapter 9, Validity.


< Back Next >