Manual

CAARS 2 Manual

Chapter 6: Standardization Phase


Standardization Phase

The results of the pilot studies described above led to a pared pool of psychometrically sound, broadly accessible, and culturally fair items selected to align with that framework. The subsequent standardization phase of CAARS 2 involved the analysis of data from independent samples to finalize the test’s underlying factor structure, identify the best-performing items, determine scoring algorithms for various scales and indices, and finally, assess the resultant test’s performance with respect to classification accuracy statistics and other psychometric considerations. In addition, samples of the general population, individuals diagnosed with ADHD, individuals with other clinical diagnoses, and simulation (“faking”) samples were all used to evaluate the accuracy of newly added CAARS 2 validity scales.

The samples used in the standardization phase are described below. In response to market research findings revealing large-scale trends, particularly among lower-income and historically marginalized racial/ethnic groups, toward completing scales like the CAARS 2 on mobile devices rather than on computers (Pew Research Center, 2019), the digital administration of the CAARS 2 was changed for standardization to one item per screen. In contrast to pilot testing when participants saw five items per screen, the standardization format of one item per screen was more mobile-device friendly as it reduced the need to scroll or rotate the screen to view items properly.

Similar to the pilot phase, the items were digitally administered remotely via an email link, with an option for in-person support for individuals with limited literacy or comfort with technology. Individuals who participated in this phase of data collection provided informed consent and received a small monetary compensation for their efforts.

Samples

A total of 2,232 self-reported ratings and 2,152 observer-reported ratings were collected during the standardization phase and are included in what will be referred to as the Total Sample. Individuals were recruited online and completed the test through an emailed link. Table 6.4 displays the demographic characteristics of the rated individuals in the Total Sample for the Self-Report and Observer, as well as the demographic characteristics of the raters who completed the Observer form. The Total Sample includes individuals from the general population who did not have a clinical diagnosis (N = 1,793 for Self-Report; N = 1,837 for individuals being rated by Observers) and individuals with one or more confirmed clinical diagnoses. The latter group was comprised of individuals with ADHD only or ADHD plus co-occurring disorders (N = 255 for Self-Report; N = 170 for Observer), and individuals diagnosed with disorders other than ADHD (N = 184 for Self-Report; N = 145 for Observer). Self- and Observer-report ratings of the same individual were obtained in approximately 200 cases from the total clinical sample (that is, an individual rated themselves and an Observer rated them as well). Clinical cases were recruited directly through clinicians, who provided details of the diagnoses to confirm their status, as well as through social media channels. Individuals recruited through social media channels were administered the Structured Clinical Interview for DSM-5 Disorders (SCID-5; First et al., 2016) by SCID-trained MHS Data Collection Coordinators to confirm their self-reported diagnostic status. It was common for individuals with clinical diagnoses to report having been diagnosed with more than one mental health condition (50.3% for Self-Report; 52.1% for Observer). Individuals with a clinical disorder other than ADHD reported having been diagnosed with (in order of decreasing frequency) Generalized Anxiety Disorder, Major Depressive Disorder, Social Anxiety Disorder, Post-Traumatic Stress Disorder, Obsessive-Compulsive Disorder, Bipolar Disorder (I or II), and Substance Abuse Disorder. In order to meet test standards for fairness, samples in the Standardization phase included detailed demographic data so that the CAARS 2 could subsequently be evaluated for measurement invariance, differential item and test functioning, and mean differences with regard to the demographic variables of gender, race/ethnicity, country of residence, and education level (see chapter 10, Fairness, for more information about the nature and results of statistical tests of measurement bias as well as the results regarding fairness properties of the CAARS 2). Carefully selected subsets of the Total Sample were used to create the Normative Samples and ADHD Reference Samples (see chapter 7, Standardization, for detailed sample descriptions).

Of particular importance in meeting the development goal regarding fairness, the Normative Samples were selected to match the most recently available census data for the U.S. and Canada, stratified by age, gender, race/ethnicity, region, and education level. This representation ensured that the updated demographic compositions of these broader populations were reflected in the normative samples, making them representative, diverse, and inclusive. Moreover, the representation of individuals with mental health disorders included in the Normative Samples now more closely approximates the true prevalence rates for those disorders (see chapter 7, Standardization, for more details).

As described in Conceptualization and Initial Planning in this chapter, a development goal was to include ADHD Reference Samples. During the standardization phase, these samples were created from the larger clinical sample by gathering ratings of individuals with a confirmed diagnosis of ADHD (see chapter 7, Standardization, for more details; individuals with co-occurring diagnoses were not excluded). The ADHD Reference Samples (stratified by age and gender) provide clinicians the flexibility to compare a given set of scores to what is typical for individuals diagnosed with ADHD as well as to the general population. The ability to evaluate whether an individual’s results are more typical of the general or ADHD population as well as to specify where the results fall within the distribution of scores for a given sample (e.g., in the upper third of the ADHD population) adds considerably to the precision and utility of this assessment measure (For more details on the composition of the normative and ADHD Reference Samples, see chapter 7, Standardization).

These Normative and ADHD Reference Samples are used in the majority of analyses described throughout this manual, whereas the Total Samples (all General Population and Clinical cases combined) were used for factor analyses and IRT modeling during the standardization phase. The samples in this phase, as a whole and as subsets, were used to determine the final set of items included in the CAARS 2 Response Style Analysis (specifically, Pace, Negative Impression Index, and Inconsistency Index), Associated Clinical Concern Items, Content Scales, DSM Symptom Scales, and Impairment & Functional Outcome Items.

Click to expand

Table 6.4. Demographic Characteristics: CAARS 2 Standardization Phase Self-Report and Observer Total Samples

Demographic Self-Report Observer:
Rated Individuals
Observer:
Raters
N % N % N %
Gender Male 1,031 46.2 1,022 47.5 764 35.5
Female 1,190 53.3 1,125 52.3 1,386 64.4
Other 11 0.5 5 0.2 2 0.1
U.S. Race/Ethnicity Hispanic 229 10.3 232 10.8 238 11.1
Asian 92 4.1 69 3.2 59 2.7
Black 187 8.4 200 9.3 185 8.6
White 1,336 59.9 1,281 59.5 1,292 60.0
Other 43 1.9 41 1.9 42 2.0
Canadian Race/Ethnicity Not a visible minority 286 12.8 262 12.2 271 12.6
Visible minority 59 2.6 67 3.1 65 3.0
U.S. Region Northeast 341 15.3 337 15.7 328 15.2
Midwest 430 19.3 417 19.4 409 19.0
South 710 31.8 662 30.8 708 32.9
West 406 18.2 407 18.9 371 17.2
Canadian Region Central 208 9.3 216 10.0 224 10.4
East 24 1.1 25 1.2 25 1.2
West 113 5.1 88 4.1 87 4.0
Education Level No high school diploma 154 6.9 173 8.0 68 3.2
High school diploma/GED 559 25.0 624 29.0 509 23.7
Some college or associate degree 758 34.0 666 30.9 809 37.6
Bachelor's degree or higher 473 21.2 418 19.4 520 24.2
Graduate or professional degree 288 12.9 271 12.6 246 11.4
Diagnosis ADHD Inattentive 114 5.1 65 3.0 -- --
ADHD Hyperactive/Impulsive 10 0.4 9 0.4 -- --
ADHD Combined 131 5.9 96 4.5 -- --
Other clinical diagnoses 184 8.2 145 6.7 -- --
No Diagnosis 1,793 80.3 1,837 85.4 -- --
Relation to individual being rated Spouse -- -- -- -- 620 28.8
Friend -- -- -- -- 530 24.6
Other Family Member -- -- -- -- 975 45.3
Other -- -- -- -- 27 1.3
Length of relationship 1–5 months -- -- -- -- 14 0.7
6–11 months -- -- -- -- 16 0.7
1–3 years -- -- -- -- 140 6.5
More than 3 years -- -- -- -- 1,982 92.1
How well does the rater know the individual being rated? Moderately Well -- -- -- -- 141 6.6
Very Well -- -- -- -- 2,011 93.4
How often does the rater interact with the individual being rated? Monthly -- -- -- -- 150 7.0
Weekly -- -- -- -- 608 28.3
Daily -- -- -- -- 1,394 64.8
Age in years M (SD) 47.5 (19.3) 47.7 (19.7) 43.8 (15.8)
Total N 2,232 2,152 2,152

Analyses and Results

Response Style Analysis: Item Selection and Score Creation

During the standardization phase, the validity scale explorations evolved into a larger goal of offering a multifaceted Response Style Analysis. Samples and analyses related to the finalization of the Negative Impression Index and Inconsistency Index are described in this section, along with the introduction of a response time metric (Pace). Collectively, these metrics provide a comprehensive set of indicators that can be used to understand a rater’s response style and how it may inform the interpretation of results1.

Negative Impression Index

The Negative Impression Index was developed through the review and revision of pilot study items created for Negative Impression and ADHD Symptom Validity Test (SVT) purposes (see Pilot Phase earlier in this chapter). The purpose of this index is to identify extremely negative response styles that can accompany attempts to present an unfavorable impression (e.g., exaggerated descriptions of problems) and/or indicate problems that rarely occur at the level endorsed by individuals with ADHD. To capture these response tendencies, the preliminary pool of Negative Impression Index items comprised items related to creating generally negative impressions, items describing ADHD-like symptoms that are mistakenly assumed to be aspects of ADHD, and items infrequently endorsed by individuals with ADHD. These items were evaluated in the general population and ADHD samples, as well as in independent simulation samples collected during the standardization phase. Similar to the pilot phase, separate simulation samples of “Fake Bad Responders” (N = 102 for Self-Report; N = 94 for Observer) and “Fake ADHD Responders” (N = 100 for Self-Report; N = 94 for Observer) were collected.

Extreme responses to the Negative Impression and ADHD SVT items were compared among the general population, ADHD, and simulation samples. Items were identified as candidates for the Negative Impression Index if the response patterns of the simulation samples were markedly different from those of the general population and the ADHD samples. In particular, items were retained as candidates for the Negative Impression Index when item-level scores of 2 (“Pretty much true; Often/Quite a bit”) or 3 (“Completely true; Very often/Always”) occurred at least 25% more often for the simulated samples (the individuals pretending to have ADHD; “faking bad”) than for individuals with a genuine ADHD diagnosis2. The items also were required to be endorsed at low levels by individuals in the general population (viz., by fewer than 5% of the sample with an item-level score of 2 or 3). Additionally, items from the larger CAARS 2 item pool (e.g., items written for the Content Scales and DSM Symptom scales) were included as candidate items for the Negative Impression Index if they were endorsed at an item-level score of 3 by fewer than 5% of the individuals in the ADHD sample (i.e., identified as infrequent response items for this important clinical sample). The item-level scores for selected candidate items were summed to create a raw score, and the distribution of raw scores within each sample was compared. Note that item-level scores of 1 (“Just a little true; Occasionally”) did not contribute to the raw score; only item-level scores of 2 and 3 were counted toward the raw score as these responses were observed to be rare and extreme and therefore more likely to be indicative of impression management.

Various combinations of items identified using this method were tested to explore the set of items and raw scores that would capture the largest proportion of the Fake Bad and Fake ADHD samples, while also capturing very few “Honest Responders” (i.e., individuals in the general population and ADHD samples who were instructed to respond to all items honestly). To evaluate how well the set of items could distinguish between groups, several key statistics that summarize classification accuracy were calculated, using the approach outlined by Kessel and Zimmerman (1993). The following accuracy statistics were calculated (note that the term “test” can refer to a scale, index, or measure of any kind, and the definitions provided are broad with multiple examples, as these statistics are used for multiple purposes throughout this manual):

  • Overall Correct Classification Rate. The percentage of correct group classifications made (e.g., accurately distinguishing Honest Responders from those instructed to fake bad or fake ADHD).

  • Sensitivity (a.k.a., true positive rate). The ability of a test to correctly detect target cases in a population (e.g., the percentage of individuals feigning ADHD accurately identified at or above a given cut-off score, or the proportion of randomly generated data predicted to be random responders [see Inconsistency Index later in this chapter]). Higher values indicate a better ability of a test to correctly detect target cases in a population.

  • Specificity (a.k.a., true negative rate). The ability of a test to correctly identify non-target cases (e.g., the proportion of individuals correctly classified at or above a given cut-off score as Honest Responders who are not part of the Fake ADHD sample, or, often, the percentage of general population cases predicted to belong to the general population group). Higher values indicate a greater ability to correctly identify the absence of a feature.

  • Positive Predictive Value. The percentage of individuals correctly identified (e.g., the proportion of the entire Fake ADHD sample identified to be faking ADHD symptoms; the proportion of individuals predicted as having ADHD who, based on a known diagnosis, genuinely have ADHD). Higher values indicate greater accuracy.

  • Negative Predictive Value. The percentage of cases who are correctly identified as not belonging to the target group (e.g., individuals with genuine ADHD who scored below the cut-off and were not identified as faking ADHD). Higher values indicate greater accuracy.

  • Kappa. A statistic that ranges from -1.0 to +1.0 that assesses how accurately group membership is predicted while correcting for chance agreements. Values below 0 indicate less than chance levels of prediction, values of 0 indicate prediction at chance level, values above 0 denote greater than chance-level prediction, and values of exactly 1 represent perfect classification.

A final set of six Negative Impression Index items was selected for each of the respondent types (i.e., six for Self-Report and six for Observer; see appendix A). These items were selected by examining classification accuracy statistics for combinations of items to identify the best-performing set. Classification accuracy statistics are presented in Table 6.5 for Self-Report and Observer, showing the ability of the Negative Impression Index to accurately identify individuals as belonging to either (a) the Fake ADHD group versus a sample of individuals instructed to respond honestly (i.e., general population and ADHD samples combined; N = 2,241 for Self-Report and 2,152 for Observer), (b) the Fake ADHD group versus the sample of individuals genuinely diagnosed with ADHD (N = 255 for Self-Report and 170 for Observer), or (c) the Fake Bad group versus a sample of individuals instructed to respond honestly.

The optimal cut-off score based on these item sets for identifying potential invalid response patterns was 6 raw score points for Self-Report and 7 raw score points for Observer. The index was designed to be more specific than it is sensitive, so that the risk of false positive results (i.e., identifying an individual as possibly presenting a false negative impression when in fact they were responding honestly) is very low. In other words, a priority in the creation of this index was to minimize the risk of misclassifying an honest responder as a faker. Using the optimal raw score cut-offs, the correct classification rate for distinguishing between Fake ADHD and genuine ADHD was 82.6% for Self-Report and 79.9% for Observer. Additionally, as seen in Table 6.5, 79.4% of the Self-Report Fake Bad sample and 85.1% of the Observer Fake Bad sample were detected by the Negative Impression Index at the selected raw score cut-offs, indicating that it can be an effective index for identifying not only individuals specifically feigning ADHD (i.e., Fake ADHD samples) but also those with vague efforts to make themselves appear worse than they really are (i.e., Fake Bad samples).

Click to expand

Table 6.5. Classification Accuracy Statistics: CAARS 2 Self-Report and Observer

Classification Accuracy Statistic Self-Report Observer
Fake ADHD vs.
Honest Responders
Fake ADHD vs.
Genuine ADHD
Fake Bad vs.
Honest Responders
Fake ADHD vs.
Honest Responders
Fake ADHD vs.
Genuine ADHD
Fake Bad vs.
Honest Responders
Overall Correct Classification (%) 94.7 82.6 95.5 94.3 79.9 95.6
Sensitivity (%) 59.0 59.0 79.4 53.2 53.2 85.1
Specificity (%) 96.3 91.8 96.3 96.1 94.7 96.1
Positive Predictive Value (%) 41.3 73.8 49.1 37.0 84.7 48.5
Negative Predictive Value (%) 98.1 85.1 99.0 97.9 78.5 99.3
Kappa .46 .54 .58 .41 .52 .60
Note. Honest Raters = General Population, ADHD, and other clinical samples (N = 2,241 for Self-Report and 2,152 for Observer); Genuine ADHD = ADHD Reference Sample (N = 255 for Self-Report and 170 for Observer); Fake ADHD = individuals from the Fake ADHD simulation study (N = 102 for Self-Report and 94 for Observer); Fake Bad = individuals from the Fake Bad simulation study (N = 100 for Self-Report and 94 for Observer).

The cut-off scores for the Negative Impression Index were then cross-validated in two separate samples: a naive or “Uncoached” group (N = 54 for Self-Report; N = 52 for Observer) and a “Coached” sample (N = 57 for Self-Report; N = 51 for Observer). Samples in this validation study provided consent, were debriefed in full about the purpose of the study, and were excluded if they had received a clinical diagnosis. Both samples were given the entire CAARS 2 inventory, which included the Negative Impression Index items. In the Coached sample, participants were instructed to respond as if they had ADHD symptoms. They were provided with a list of symptoms that would be available if one were to conduct an informal internet search for symptoms of ADHD. The Uncoached sample was given the same instructions regarding faking ADHD but was not provided with a list of symptoms to assist with their false responses. Raw scores from the selected set of six Negative Impression Index items for both Self-Report and Observer were calculated and their distributions were examined. Analyses revealed that responses from the Coached and Uncoached samples were extremely similar, such that the two groups could reasonably be combined into a single Fake ADHD group, regardless of instructions received.

Overall, results supported a raw score cut-off greater than or equal to 6 points for Self-Report, which identified 65.4% of individuals in this validation sample instructed to fake ADHD symptoms. For Observer, a raw score cut-off of greater than or equal to 7 was confirmed, which flagged 43.3% of individuals instructed to fake ADHD symptoms. In contrast, fewer than 4% of the honest responders were flagged for both Self-Report and Observer when these cut-offs were employed.

Classification accuracy statistics were calculated using the selected cut-offs to compare the Fake ADHD cross-validation sample with honest responders from the general population (N = 2,241 for Self-Report and 2,152 for Observer) as well as with honest responders from the genuine ADHD sample (N = 255 for Self-Report and 170 for Observer). The results are presented in Table 6.6. The overall correct classification rate for identifying individuals with genuine or faked ADHD is 84.7% for Self-Report and 75.8% for Observer. These results support the use of the Negative Impression Index for its intended purpose of assessing reported symptom validity. In general, these classification accuracy statistics to predict feigning depend on the prevalence of feigning in the population, thereby limiting the ability of the statistics to be broadly applied. The prevalence can vary widely depending on the purpose of the evaluation and the setting. Therefore, for a more nuanced examination of the Negative Impression Index classification accuracy, the Positive Predictive and Negative Predictive Values for several different base rates (including the assumption of a 50% feigning base rate as presented in Table 6.6) are provided in Table 6.7.

Note that the sensitivity of the Negative Impression Index is higher for Self-Report than for Observer. As noted previously, the Negative Impression Index is intentionally designed to value specificity over sensitivity, to avoid misidentifying results as potentially invalid. Consequently, the absence of an elevated score on the Negative Impression Index, especially for Observers, may or may not indicate valid symptom reporting, but the presence of an elevated score is a very strong indicator that the responses are atypical and potentially invalid. Recall that obtaining information from multiple raters is valuable when gathering information with the CAARS 2 (see chapter 4, Interpretation), as Observers differ in terms of their perspective, insight, knowledge, and potential biases (which may be captured in the Negative Impression Index) depending on the nature and duration of their relationship to the individual being rated.

Inconsistency Index

The Inconsistency Index was created to identify a pattern of inconsistent responding that could impact the interpretation of CAARS 2 results (e.g., careless/random responding, comprehension difficulties, or unusual interpretations of subtle wording differences). To develop this index, correlations between CAARS 2 items within the clinical sample were inspected. Items that were both highly correlated (ideally above r = .70 but acceptable if above r = .65) and conceptually very similar with respect to their content were selected as item pairs. Correlations for these item pairs were also inspected within the General Population Sample (see Standardization Phase earlier in this chapter for sample descriptions) to ensure that strong associations were still present. For this sample, moderate, positive correlations (i.e., r = .55) were deemed acceptable for item-pair retention, given the expectation that lower variability of responses among general, as opposed to clinical, populations would produce generally lower item-intercorrelations. Through this review and item inspection process, seven item pairs that were both statistically and conceptually closely related to each other were identified for both the Self-Report and Observer forms. Correlations ranged from .68 to .75 for Self-Report and from .72 to .80 for Observer (all correlation coefficients are significant, p < .001; see appendix A for the item pairs). Although each pair of items was drawn from the same Content Scale, the various item pairs represent a broader range of content areas.

Next, the differences in item-level scores for each of these item pairs were calculated, and then the absolute value of differences greater than 1 point (e.g., if Item A was rated 3 and Item B was rated 1, the difference would be 2 points) was summed to create the raw score for the Inconsistency Index. Item-pair score differences of less than or equal to 1 point were not considered indicative of inconsistent responding as they were not sufficiently discrepant to indicate meaningful inconsistency. In order to simulate the behavior of an individual who is truly responding randomly, randomly generated data sets of CAARS 2 responses (created in R software) were used to determine a cut-off for the raw score at which one could confidently identify inconsistent responding.

The distribution of raw scores within the random datasets was then compared against the distribution of raw scores observed in the General Population and ADHD Reference Samples, as those samples were both considered to be making a good effort when responding to items. A cut-off was selected to minimize false positives; that is, priority was given to minimize how often individuals who were actually providing good effort were flagged for inconsistent responding. For Self-Report, Inconsistency Index raw scores greater than or equal to 4 successfully detected 81.5% of the simulated random data while capturing only 6.2% of the General Population Sample, 5.9% of individuals diagnosed with ADHD Inattentive Presentation, and 4.3% of individuals diagnosed with ADHD Combined Presentation. For Observer, raw scores greater than or equal to 4 successfully detected 78.9% of the simulated random data while capturing only 4.7% of the General Population Sample, 0.0% of individuals diagnosed with ADHD Inattentive Presentation, and 5.6% of individuals with a diagnosis of ADHD Combined Presentation.

The classification accuracy of the Inconsistency Index was then explored. As seen in Table 6.8, the overall correct classification rate was 87.1% for Self-Report and 87.9% for Observer. The classification accuracy statistics were derived from the General Population Sample (N = 1,663 for Self-Report and 1,773 for Observer) in contrast to the computer-generated data used to simulate an equally large sample of random responses. As previously noted, the Inconsistency Index is designed to be more specific than it is sensitive so as to reduce the risk of over-identifying responders providing their best effort. In light of that objective, these results support the valid use of the Inconsistency Index for the identification of a random, careless, or inconsistent response style.

Omitted Items

Ideally, raters provide responses to all items on a scale, as this will yield the most thorough information about the domain or domains it covers. However, blank items require decisions regarding how to score scales with omitted items. The development team for the CAARS 2 determined that scales where the number of omitted items fell beneath a specified threshold (generally less than 10% of the items on the scale) could be prorated and scored, whereas those with omissions that exceeded that threshold would not be scored. Appendix B provides further information about how omitted responses are handled in the CAARS 2.

Pace

Data related to the rate of response was examined as another method to describe response style. Response time per item, which could be summed to reflect total duration, was captured during all phases of development for online completion of the CAARS 2. To calculate pace, the total number of items on the full-length CAARS 2 was divided by the duration of the test, in minutes, creating a metric for the rate of responding in terms of the number of items per minute. The distribution of this indicator of pace was explored in the normative sample as well as in the ADHD Reference Sample. Both samples had similar ranges for pace across both Self-Report and Observer; therefore, the larger normative sample was selected for deriving a cut-off. Unusually slow or fast rates of responding were defined as the pace at 2.5 standard deviations above and below the Normative Sample’s mean. This range was selected because 2.5 standard deviations below the mean approaches 0 items per minute, which is a natural boundary for this indicator. A pace that is more rapid than 2.5 standard deviations above the mean can be interpreted as unusually fast, while a pace that is slower than 2.5 standard deviations below the mean can be interpreted as unusually slow. For Self-Report, 1.2% of the Combined Gender Normative Sample was identified as unusually fast (i.e., pace greater than 17 items per minute), and 0.1% of the Combined Gender Normative Sample was identified as unusually slow (i.e., pace less than 1 item per minute). For Observer, 1.0% and 0.1% of the Combined Gender Normative Samples were flagged as unusually fast and unusually slow, respectively. Details about the exact guidelines and applied use of this Response Style metric can be found in Step 1: Examine Response Style Analysis in chapter 4, Interpretation.

Associated Clinical Concern Items: Item Selection and Scoring

The Associated Clinical Concern Items established during the pilot phase were retained. These single-item screeners were meant to identify individuals with any lifetime history of either suicidal thoughts or self-injurious behaviors, as well as those with symptoms of anxiety or depression (item content refers to sadness/emptiness). These items highlight concerns that warrant clinical attention by identifying acute risks to an individual’s safety or considerations relevant to differential diagnosis or treatment planning.

Given the potentially critical implications for safety, the development team agreed that any endorsement (i.e., any response other than “Never”) of the suicidality or self-injury items warrants serious attention (i.e., norm-referenced information or scores are not needed to determine the need for follow-up3). Note that the suicidality item was revised during the development of the CAARS 2. During the standardization phase, the Self-Report version of this item read: “I have thought about killing myself.” After data were collected, the item was modified to more closely align with the version of the item for the Observer form by asking about suicide attempts as well as ideation. The modified item now reads: “I have thought about or attempted suicide.” Results presented in this manual pertain to the initial version of this suicidality item administered during standardization (see Clinical Group Differences in chapter 9, Validity, for more information).

The response frequencies of the anxiety/worry and sadness/emptiness (or sadness only, for Observer) screening items were examined to ensure that those diagnosed with Anxiety and Depression, respectively, were endorsing them at greater rates than those without these diagnoses (e.g., general population, individuals with ADHD but without co-occurring Depression or Anxiety diagnoses). Based on the response frequencies within nearly all of the normative age groups, item responses greater than or equal to 2 (“Pretty much true; Often/Quite a bit”) for the anxiety/worry and sadness/emptiness screening items are considered elevated (i.e., relatively infrequent within the general population; corresponds approximately to the upper quartile of the distribution). The one exception occurs with individuals aged 70 years or older, for whom responses greater than or equal to 1 (“Just a little true; Occasionally”) are considered elevated for the sadness/emptiness screening item (based on response frequencies in this age group; see appendix F).

Content Scales: Item Selection and Scoring

Using the data collected during the standardization phase, analyses were conducted to evaluate item and scale functioning for the CAARS 2 Content Scales. Analyses relied on both CTT and IRT methodologies and were conducted using packages in R (version 3.6.1; R Foundation for Statistical Computing, 2019). Items were marked for review if they met the following criteria:

  • Poor discrimination, as measured by Cliff’s delta (Cliff, 1993; Romano et al., 2006) between intended groups (e.g., an item about ADHD-related symptoms that did not demonstrate a notable mean difference between individuals with and without an ADHD diagnosis).

  • Significant associations with demographic characteristics (e.g., items that differed between construct-irrelevant groups, such as White vs. Hispanic individuals).

  • Items that did not correlate well with their intended factor (e.g., item-total correlations and factor loadings ≤ .40).

  • Items with low precision and information, as derived from IRT models.

Descriptive statistics, such as item-level means and the frequency of response distributions, were examined with regard to the aforementioned criteria for all test items, along with inter-item correlations, correlations and mean group differences between demographic groups in terms of item-level responses and raw scale scores, fit to IRT models, and tests for measurement bias (i.e., differential item functioning [DIF] using IRT) and local independence (for details about methodology, see appendix M). Items with significant DIF results were reviewed for meaningful effect sizes and inspected graphically for the nature of the DIF effect. In all instances, the items presented only with minor concerns; a very small proportion (i.e., less than 10%) of each scale contained DIF items, and the DIF effects were quite small in size. These items were retained as they were deemed not to compromise the quality of the full-length scales (items remained flagged for the purposes of developing the CAARS 2–Short and CAARS 2–ADHD Index; see chapters 11 and 12 respectively, for more details).

Confirmatory factor analyses (CFA) were conducted at the item level to cross-validate the latent structure of the item pool and its alignment with the theoretical framework (for more details on CFAs conducted, see Internal Structure in chapter 9, Validity). These models provided information about inter-item relationships and each item’s importance to the factors. The identified factors replicated the results of EFAs from the pilot phase.

After these statistical reviews, items contributing to each factor were carefully examined from a clinical perspective to determine whether factor labels reasonably captured the intended content. The majority of factor labels were retained, with two exceptions. The Emotional Lability factor was renamed Emotional Dysregulation to represent general difficulties with regulating emotional experiences and to better reflect the additional items regarding rapid and exaggerated changes in mood/emotion. The Problems with Self-Concept factor was renamed Negative Self-Concept to capture the negativity embedded in this factor. Competing models were tested, and the best-fitting model was a five-factor solution: Inattention/Executive Dysfunction, Hyperactivity, Impulsivity, Emotional Dysregulation, and Negative Self-Concept.

Items that were flagged for not meeting the specified criteria or for general poor or inconsistent performance were reviewed by the development team for construct relevance and clinical significance. Through this process, the content scale item count was reduced to 72 (from 120 items for Self-Report and 118 items for Observer).

The final items were assigned to Content Scales in alignment with the final CFA model. Raw scores were calculated for each Content Scale and were then converted to T-scores and percentiles (for more details on the process used to convert raw scores into standardized scores, please see Standardization Procedures in chapter 7, Standardization.)

DSM Symptom Scales: Item Selection and Scoring

In addition to meeting all the previously described requirements of a successful item (see the Content Scales: Item Selection & Scoring section earlier in this chapter), items for the DSM Symptom Scales were also required to effectively capture the corresponding DSM-5 Symptom A Criterion. In some instances, multiple items were required to capture a complex DSM criterion. For example, DSM Criterion 1b describes difficulty sustaining attention and cites multiple examples; this criterion was initially represented by a general item (“Difficulty staying focused”) and specific items (e.g., “Has trouble staying focused during non-work activities”). During this Standardization phase of development, sets of items created to represent each DSM criterion were reviewed. When a single item had the best statistics, it was reviewed clinically for goodness of fit with the DSM criterion. When multiple items intended to represent a given DSM criterion met item performance standards, the decision as to which item(s) to retain was made based on (a) relative empirical strengths (e.g., more pronounced clinical group differences; stronger factor loadings), (b) clinical judgment regarding DSM content alignment, and (c) impact on scale brevity. By the end of this process, each of the 18 DSM symptoms of ADHD for Criterion A was represented by a single item or pair of items on the CAARS 2.

The final DSM ADHD Inattentive Symptoms Scale is composed of a subset of items from the CAARS 2 Inattention/Executive Dysfunction Content Scale, whereas the final DSM ADHD Hyperactive/Impulsive Symptoms Scale is composed of a subset of items from both the CAARS 2 Hyperactivity and Impulsivity Content Scales. The DSM ADHD Inattentive Symptoms Scale and DSM Hyperactive/Impulsive Symptoms Scale were combined to create the DSM Total ADHD Symptoms Scale. The item responses for each CAARS 2 DSM Symptom Scale were summed to create raw scores, which were converted to T-scores and percentiles. For more details on converting raw scores into standardized scores, please see Standardization Procedures in chapter 7, Standardization.

In addition to creating standardized, norm-referenced scores for the DSM Symptom Scales, a raw count of symptoms was also calculated. Given that the DSM criteria specify that symptoms within the Inattentive and the Hyperactive/Impulsive domains must be present often, a rating equivalent to or greater than often (i.e., item level endorsement greater than or equal to 2 on the CAARS 2) must be made for a given CAARS 2 DSM item to count toward the Symptom Count. Most DSM criteria are represented by a single CAARS 2 item. However, some DSM symptom criteria are represented by a pair of items requiring “either” endorsed or “both” endorsed (depending on the nature of the DSM criterion and the wording of the CAARS 2 items). For example, for Criteria 1a, 2a, 2e, and 2i, an endorsement of either one of the two items with an item response of 2 (“Pretty much true; Often/Quite a bit”) or 3 (“Completely true; Very Often/Always”) is sufficient to count. In contrast, for DSM Criterion 1d, both items must be endorsed to count as meeting the criterion. Note that in computing the CAARS 2 DSM ADHD Inattentive Symptom Count, DSM Criterion 1d is included in the Symptom Count if at least one of the two items is endorsed with an item response of 2 (“Pretty much true; Often/Quite a bit”) or 3 (“Completely true; Very Often/Always”), so long as the other item in the pair is endorsed at or above a minimum item response of 1 (“Just a little true; Occasionally”). For details about the applied use of Symptom Counts, please see chapter 4, Interpretation.

CAARS 2–ADHD Index: Item Selection and Scoring

After the final CAARS 2 items were confirmed, the development of the ADHD Index began. Machine learning was invoked as the means of identifying items for the new index. All CAARS 2 items were included in an analysis to select the items that most efficiently distinguished between individuals with and without ADHD. Twelve items were selected for use in this index and probability scores were determined using data from the Normative and ADHD Reference Samples (for more information, see chapter 12, CAARS 2–ADHD Index). In addition, the ADHD Index is now available as a stand-alone form that can be used as a quick screener for ADHD, identifying those in need of a more comprehensive evaluation.

Impairment & Functional Outcome Items: Item Selection and Scoring

Impairment & Functional Outcome Items were also analyzed in the standardization phase. Criteria for rejection from the final CAARS 2 consisted of (a) lack of meaningful group differences (specifically the absence of higher means in clinical groups compared to the general population), and the presence of meaningful differences across demographic subgroups (such as gender and race); (b) lack of variability in the distribution of response options (e.g., an item was unsuccessful if an overwhelming proportion of the General Population Sample selected 3 [“Completely true; Very Often/Always”], indicating this item is not a notable impairment unique to having a clinical diagnosis); and (c) inter-item correlations that demonstrate redundancy (e.g., r > .90). Because this type of item was new to the CAARS 2, some constructs were represented by two items with the intention of using data to guide the selection of the best-performing item for the final CAARS 2. Flagged items were reviewed by the development team for psychometric performance alongside utility in a clinical setting. Statistical and clinical review of items and the desire for balance between brevity and comprehensiveness resulted in reducing the set of items from 22 to 13 Impairment & Functional Outcome Items on the final Self-Report and Observer forms.

Because the Impairment & Functional Outcome Items each represent a different construct, the development team agreed it was not logical to create a combined scale score. Rather, individual item responses were chosen to be reported, along with a norm-based determination of whether a response could be considered elevated. The cumulative frequency distributions of these items were calculated for each normative group (available by age and gender; see chapter 7, Standardization, for details about this reference sample). After a review of these item-level distributions, Impairment & Functional Outcome Item responses were deemed “Elevated” and were to be considered as potential evidence of impairment when endorsed at infrequent levels in the CAARS 2 Normative Sample (e.g., endorsed by the upper quartile of individuals, or fewer than 25% of individuals provided a rating that was the same level or higher). Appendix F displays the frequency distributions of the normative samples for the Impairment & Functional Outcome Items.


1 Omitted Items are also provided as part of the Response Style Analysis. Given that this metric did not stem from item development and instead relied on behavioral data, it is not discussed here. Please see chapter 4, Interpretation, for a discussion of this metric.

2 Some of the items considered in this step were positively worded and therefore reverse-scored (e.g., an endorsement of 0 [“Not true at all; Never/Rarely”] was scored as 3 points).

3 Although clinicians are urged to follow up with any endorsement of these two critical items regardless of normative frequencies, interested readers can review response distributions in appendix F.

< Back Next >