Manual

CAARS 2 Manual

Chapter 11: Reliability


Reliability

Reliability refers to the amount of measurement error, which is determined by examining the consistency of measurements obtained across different administrations or parts of the instrument (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014). Multiple indicators of reliability are provided for scores from the CAARS 2–Short, including internal consistency, test information, test-retest reliability, and inter-rater reliability. See chapter 8, Reliability, for a detailed description of each type of reliability discussed in this section.

Internal Consistency and Standard Error of Measurement

Internal consistency estimates for the CAARS 2–Short are presented in Tables 11.10 to 11.11 for the Normative and ADHD Reference Samples (see chapter 7, Standardization, for a description of the Normative and ADHD Reference Samples; see chapter 8, Reliability, for a more in-depth description of the coefficients in this section).

The reliability coefficients presented in Tables 11.10 to 11.11 indicate that the CAARS 2–Short meets or exceeds guidelines for internal consistency for all age groups. Across all age groups and genders in the Normative Sample, the median omega of the CAARS 2–Short scale scores was .91 (ranging from .80 to .97) for Self-Report, and .93 (ranging from .78 to .98) for Observer. For the ADHD Reference Sample, the median omega was .89 (ranging from .84 to .92) for Self-Report, and .92 (ranging from .87 to .95) for the Observer. In summary, multiple metrics indicate that the CAARS 2–Short scale scores provide consistent and reliable estimates of the constructs being measured.

Internal consistency values provide a measure of a test’s reliability, but these values also have practical applications through the calculation of the standard error of measurement (SEM). SEM values were calculated for the CAARS 2–Short using the SD of the reference sample’s T-scores (i.e., SD = 10), along with the omega reliability coefficient. Overall, the median SEM was 3.07 for Self-Report and 2.69 for Observer in the Normative Sample, and 3.28 for Self-Report and 2.76 for Observer in the ADHD Reference Sample (see Tables 11.10 and 11.11). The low values reported here indicate a very small standard error of measurement, or very little error in the estimated true scores.

Practical applications of SEM for the CAARS 2–Short include the calculation of confidence intervals (see chapter 4, Interpretation, for a description of how confidence intervals can be used in the interpretation of results). For the CAARS 2–Short, 90% and 95% confidence intervals were calculated based on the SEM for each scale. (Note that confidence intervals for the Principal Normative Sample are automatically included in the digital reports; see Scoring Options in chapter 3, Administration and Scoring. See appendix C for confidence intervals based on SEM for the ADHD Reference Samples.)

Click to expand

Table 11.10. Internal Consistency and Standard Error: CAARS 2–Short Self-Report Normative Samples

CAARS 2–Short Scale Number of Items Age Group Combined Gender Male Female
N α ω SEM N α ω SEM N α ω SEM
Inattention/​Executive Dysfunction 12 18-24 110 .94 .94 2.38 55 .94 .94 2.45 55 .95 .95 2.30
25-29 110 .97 .97 1.83 55 .97 .97 1.84 55 .97 .97 1.74
30-39 220 .95 .96 2.12 110 .96 .96 2.10 110 .95 .95 2.16
40-49 220 .96 .96 1.89 110 .96 .96 2.02 110 .97 .97 1.75
50-59 220 .96 .96 2.06 110 .95 .95 2.18 110 .97 .97 1.81
60-69 220 .96 .96 2.07 110 .95 .95 2.23 110 .97 .97 1.84
70+ 220 .94 .94 2.48 94 .94 .94 2.35 126 .93 .93 2.62
Hyperactivity 7 18-24 110 .88 .89 3.36 55 .85 .86 3.76 55 .90 .91 3.04
25-29 110 .91 .92 2.90 55 .92 .92 2.79 55 .91 .91 2.92
30-39 220 .89 .89 3.30 110 .88 .89 3.34 110 .90 .90 3.09
40-49 220 .88 .89 3.34 110 .87 .87 3.57 110 .90 .91 3.07
50-59 220 .89 .89 3.24 110 .85 .86 3.71 110 .93 .93 2.60
60-69 220 .90 .90 3.15 110 .88 .88 3.45 110 .92 .92 2.84
70+ 220 .87 .87 3.63 94 .87 .87 3.55 126 .86 .87 3.66
Impulsivity 7 18-24 110 .90 .90 3.14 55 .90 .90 3.14 55 .90 .91 3.02
25-29 110 .90 .91 3.07 55 .89 .89 3.34 55 .92 .93 2.74
30-39 220 .88 .88 3.47 110 .89 .89 3.36 110 .86 .87 3.66
40-49 220 .88 .88 3.42 110 .88 .88 3.45 110 .89 .89 3.31
50-59 220 .89 .89 3.35 110 .86 .86 3.74 110 .91 .91 2.95
60-69 220 .90 .90 3.14 110 .87 .87 3.55 110 .92 .93 2.70
70+ 220 .86 .86 3.76 94 .87 .87 3.56 126 .83 .84 4.05
Emotional Dysregulation 6 18-24 110 .92 .92 2.87 55 .93 .93 2.63 55 .91 .91 3.07
25-29 110 .89 .89 3.30 55 .89 .89 3.25 55 .89 .89 3.30
30-39 220 .92 .92 2.75 110 .93 .93 2.61 110 .91 .92 2.91
40-49 220 .92 .92 2.82 110 .90 .90 3.10 110 .94 .94 2.46
50-59 220 .90 .91 3.06 110 .88 .88 3.43 110 .93 .93 2.62
60-69 220 .92 .92 2.79 110 .89 .89 3.30 110 .95 .95 2.31
70+ 220 .89 .89 3.28 94 .91 .92 2.90 126 .87 .87 3.55
Negative Self-Concept 5 18-24 110 .88 .88 3.51 55 .86 .86 3.74 55 .87 .87 3.55
25-29 110 .92 .92 2.86 55 .90 .91 3.02 55 .93 .93 2.65
30-39 220 .91 .91 2.99 110 .91 .91 3.03 110 .91 .91 2.98
40-49 220 .90 .90 3.14 110 .89 .89 3.25 110 .91 .91 2.94
50-59 220 .88 .88 3.41 110 .88 .89 3.39 110 .89 .90 3.21
60-69 220 .88 .88 3.41 110 .83 .85 .42 110 .92 .92 2.80
70+ 220 .81 .82 4.25 94 .83 .84 .37 126 .78 .80 4.50
Note. α = alpha. ω = omega. SEM = standard error of measurement.
Click to expand

Table 11.11. Internal Consistency and Standard Error: CAARS 2–Short Observer Normative Samples

CAARS 2–Short Scale Number of Items Age Group Combined Gender Male Female
N α ω SEM N α ω SEM N α ω SEM
Inattention/​Executive Dysfunction 12 18-24 110 .95 .95 2.26 55 .94 .94 2.38 55 .95 .96 2.11
25-29 110 .97 .97 1.83 55 .96 .96 2.00 55 .97 .97 1.69
30-39 220 .97 .97 1.63 110 .98 .98 1.55 110 .97 .97 1.71
40-49 220 .97 .97 1.80 110 .97 .97 1.69 110 .96 .96 1.96
50-59 220 .97 .97 1.86 110 .97 .97 1.83 110 .97 .97 1.80
60-69 220 .96 .96 2.10 110 .96 .96 2.09 110 .95 .95 2.22
70+ 220 .95 .95 2.22 94 .95 .95 2.25 126 .95 .95 2.17
Hyperactivity 7 18-24 110 .92 .92 2.82 55 .89 .90 3.19 55 .94 .94 2.46
25-29 110 .94 .95 2.31 55 .94 .95 2.33 55 .75 .78 4.64
30-39 220 .95 .95 2.24 110 .94 .94 2.36 110 .96 .96 2.02
40-49 220 .93 .93 2.67 110 .93 .93 2.56 110 .91 .91 3.02
50-59 220 .92 .92 2.78 110 .92 .92 2.77 110 .92 .92 2.74
60-69 220 .91 .91 2.95 110 .91 .91 2.94 110 .90 .91 3.04
70+ 220 .91 .92 2.90 94 .87 .87 3.54 126 .93 .93 2.56
Impulsivity 7 18-24 110 .92 .92 2.85 55 .91 .91 3.05 55 .93 .93 2.60
25-29 110 .93 .93 2.69 55 .89 .89 3.25 55 .96 .96 1.99
30-39 220 .94 .94 2.51 110 .94 .94 2.42 110 .93 .93 2.57
40-49 220 .92 .92 2.76 110 .92 .92 2.88 110 .93 .93 2.69
50-59 220 .93 .93 2.61 110 .94 .94 2.35 110 .91 .92 2.87
60-69 220 .90 .90 3.11 110 .90 .90 3.14 110 .90 .90 3.18
70+ 220 .91 .91 2.96 94 .89 .89 3.28 126 .93 .93 2.72
Emotional Dysregulation 6 18-24 110 .93 .93 2.66 55 .91 .91 2.98 55 .94 .94 2.43
25-29 110 .94 .94 2.50 55 .94 .94 2.48 55 .94 .94 2.45
30-39 220 .93 .93 2.64 110 .95 .95 2.28 110 .91 .91 2.97
40-49 220 .95 .95 2.24 110 .95 .95 2.28 110 .95 .95 2.16
50-59 220 .95 .96 2.11 110 .96 .96 1.97 110 .95 .95 2.23
60-69 220 .92 .92 2.79 110 .92 .92 2.81 110 .91 .92 2.91
70+ 220 .93 .93 2.69 94 .93 .93 2.66 126 .93 .93 2.67
Negative Self-Concept 5 18-24 110 .86 .87 3.57 55 .84 .85 3.88 55 .88 .88 3.41
25-29 110 .89 .90 3.17 55 .87 .89 3.36 55 .92 .92 2.80
30-39 220 .89 .89 3.25 110 .89 .89 3.25 110 .89 .90 3.22
40-49 220 .89 .89 3.27 110 .89 .90 3.21 110 .89 .89 3.25
50-59 220 .84 .84 3.97 110 .79 .79 4.53 110 .85 .86 3.74
60-69 220 .89 .89 3.26 110 .88 .88 3.39 110 .91 .91 3.02
70+ 220 .86 .87 3.67 94 .89 .90 3.12 126 .82 .83 4.14
Note. α = alpha. ω = omega. SEM = standard error of measurement.
Click to expand
Click to expand

Test Information

An analysis of test information was conducted on the Total Sample (i.e., all General Population and all clinical cases combined; see chapter 6, Development, for a description of this sample), to provide maximum information in estimating these functions via the mirt package in R (Chalmers, 2012). As seen in Figure 11.3, the Self-Report and Observer forms of the CAARS 2–Short demonstrate high precision across the range of the trait being measured for the Inattention/Executive Dysfunction scale (note that this scale is presented as an example; figures for all the other scales are provided in appendix L). The peak of the curve for the scales is approximately 2 SD above the mean, and the area under the test information curve is wide, such that precision is sufficiently high for a wide range of scores along the continuum of each construct. The high degree of precision and small degree of error provide strong evidence for the reliability of the CAARS 2–Short scales.

Click to expand

Figure 11.3. Test Information Functions for Inattention/Executive Dysfunction: CAARS 2–Short

Test-Retest Reliability

The test-retest reliability of the CAARS 2–Short was assessed by computing the correlation of scores obtained on two separate administrations over a 2- to 4-week interval (14 to 30 days) within a subset of individuals from the General Population sample (N = 88 for Self-Report and N = 61 for Observer; refer to appendix J for demographic characteristics of the test-retest samples). Measures of relatively stable constructs (like ADHD) are expected to have high correlations, indicating little change in scores from one administration to the next.

The obtained correlations, as well as those corrected for variation (Bryant & Gokhale, 1972), are provided in Tables 11.14 and 11.15. Corrected correlations ranged from .78 to .95 for Self-Report and .76 to .90 for Observer (all p < .001). As further evidence of score stability over the course of the retest period, mean scores from each time point are closely aligned (as seen in the very small Cohen’s d values). The stable nature of the scores, as demonstrated by the test-retest reliability coefficients, provides assurance that changes observed in CAARS 2–Short scores over time are due to true changes, as opposed to imprecise measurement.

Click to expand

Table 11.14. Test-Retest Reliability: CAARS 2–Short Self-Report

CAARS 2–Short Scale Obtained
r
Corrected
r
Time 1 Time 2 Cohen's
d
M Mdn SD M Mdn SD
Inattention/​Executive Dysfunction .85 .95 47.5 46 7.0 48.2 47 7.5 0.00
Hyperactivity .77 .90 47.4 45 7.4 47.7 45 7.9 0.04
Impulsivity .86 .92 48.0 46 8.0 48.3 46 9.2 0.04
Emotional Dysregulation .90 .87 49.4 46 10.7 49.4 47 10.6 0.00
Negative Self-Concept .83 .78 49.7 47 10.2 50.6 49 11.5 0.08
Note. N = 88. Time between administrations = 2 to 4 weeks (14 to 30 days). All correlations significant, p < .001. Guidelines for interpreting |r|: very weak < .20; weak=.20 to .39; moderate = .40 to .59; strong = .60 to .79; very strong ≥ .80. Guidelines for interpreting Cohen's |d|: negligible effect size < 0.20; small effect size = 0.20 to 0.49; medium effect size = 0.50 to 0.79; large effect size ≥ 0.80. Positive d-ratio values indicate higher scores at Time 2 than Time 1.
Click to expand

Table 11.15. Test-Retest Reliability: CAARS 2–Short Observer

CAARS 2–Short Scale Obtained
r
Corrected
r
Time 1 Time 2 Cohen's
d
M Mdn SD M Mdn SD
Inattention/​Executive Dysfunction .85 .90 49.2 47 8.8 49.2 46 8.9 0.01
Hyperactivity .89 .89 48.5 45 9.7 48.8 45 10.7 0.03
Impulsivity .71 .79 48.1 45 8.6 48.1 46 9.0 0.00
Emotional Dysregulation .81 .82 49.1 45 10.0 49.0 47 9.6 -0.01
Negative Self-Concept .82 .76 49.9 46 11.0 50.0 47 11.0 0.01
Note. N = 61. Time between administrations = 2 to 4 weeks (14 to 30 days). All correlations significant, p < .001. Guidelines for interpreting |r|: very weak < .20; weak = .20 to .39; moderate = .40 to .59; strong = .60 to .79; very strong ≥ .80. Guidelines for interpreting Cohen's |d|: negligible effect size < 0.20; small effect size = 0.20 to 0.49; medium effect size = 0.50 to 0.79; large effect size ≥ 0.80. Positive d-ratio values indicate higher scores at Time 2 than Time 1.

Inter-Rater Reliability

Inter-rater reliability refers to the degree of agreement between two raters who are rating the same individual. Two inter-rater studies were conducted with the CAARS 2: (a) two Observers of the same rater type rated the same individual (e.g., two friends), and (b) two different types of raters rated the same individual (e.g., self-report/observer, or parent/friend).

Study 1: Two Observers, Same Rater Type. In the first inter-rater study, dyads of Observers (N = 29) completed the CAARS 2–Short about the same individual (refer to appendix J for the demographic characteristics of the inter-rater samples).

The obtained inter-rater correlation coefficients, as well as those corrected for range restriction, are provided in Table 11.16. Results of the inter-rater study were indicative of moderate to strong levels of consistency within rater dyads across all CAARS 2–Short scales (corrected r ranged from .40 to .69, all p < .001). Table 11.16 also shows the means, medians, and standard deviations for each rater, highlighting the alignment of average scores between the two raters. This pattern of results indicates that different raters can provide different perspectives and is a reminder of the value of obtaining ratings from multiple raters and evaluating multiple sources of information when conducting a comprehensive assessment.

Click to expand

Table 11.16. Inter-Rater Reliability Study 1 (Two Observers, Same Rater Type): CAARS 2–Short

CAARS 2–Short Scale Obtained
r
Corrected
r
Rater 1 Rater 2 Cohen's
d
M Mdn SD M Mdn SD
Inattention/​Executive Dysfunction .69 .69 59.2 59 10.5 59.2 58 9.4 0.01
Hyperactivity .53 .41 55.7 54 11.7 57.5 56 11.8 0.16
Impulsivity .52 .40 57.0 57 11.8 55.6 49 11.7 -0.12
Emotional Dysregulation .75 .57 57.4 56 12.3 56.4 53 13.3 -0.08
Negative Self-Concept .71 .50 60.5 62 13.5 61.7 62 12.7 0.09
Note. N = 29. Reported p-values are for corrected r. Guidelines for interpreting |r|: very weak < .20; weak = .20 to .39; moderate = .40 to .59; strong = .60 to .79; very strong ≥ .80. Guidelines for interpreting Cohen's |d|: negligible effect size < 0.20; small effect size = 0.20 to 0.49; medium effect size = 0.50 to 0.79; large effect size ≥ 0.80.

Study 2: Different Type of Raters. In the second inter-rater study, comparisons were made across different types of raters, including comparison of Self-Report and Observer report, as well as comparison of different types of Observers (e.g., spouse and friend). As the CAARS 2–Short Self-Report and Observer both measure similar constructs, similarity in scores across the different types of Observers, as well as between Observer ratings and self-reported ratings, would provide additional evidence of the reliability of the test scores. Although some degree of similarity in ratings is expected between informants, given that they are rating the same individual on the same constructs, it is nonetheless expected that there will be a certain degree of incongruence between their ratings, because different informants see the individual in different contexts and may have different perceptions of or experiences with the individual’s behavior.

For the CAARS 2–Short, correlation coefficients (with corrections for range restriction) were calculated between scores for the following pairs of raters: (a) Self-Report and Observer (N = 211), and (b) two observers with differing relations to the individual (N = 47; all ratings completed within a 30-day period). Dyads with two observers were required to represent different relationship types for Study 2 (e.g., a spouse/romantic partner and a friend, but not two friends). Refer to appendix J for the demographic characteristics of the raters and the individual being rated.

Results were similar to the results found in the full-length CAARS 2 (see Inter-Rater Reliability in chapter 8, Reliability) and are provided in Table 11.17 (Self-Report/Observer) and Table 11.18 (Observer/Observer). The corrected correlations were weak to moderate in the Self-Report/Observer dyads (ranging from r = .35 to .44, p < .001), as well in the Observer/Observer dyads (ranging from r = .25 to .48, p < .001).

Click to expand

Table 11.17. Inter-Rater Reliability Study 2 (Different Rater Types: Self-Report and Observer): CAARS 2–Short

CAARS 2–Short Scale Obtained
r
Corrected
r
Self-Report Observer Cohen's
d
M Mdn SD M Mdn SD
Inattention/​Executive Dysfunction .58 .42 64.4 64 12.5 62.1 61 12.4 -0.18
Hyperactivity .52 .35 61.1 61 12.7 58.0 57 12.9 -0.25
Impulsivity .51 .37 61.1 61 12.5 56.3 55 11.9 -0.39
Emotional Dysregulation .50 .40 59.3 60 11.8 57.5 56 11.3 -0.16
Negative Self-Concept .54 .44 60.3 62 10.2 63.3 63 12.8 0.26
Note. N = 211. All correlations significant at p < .001. Guidelines for interpreting |r|: very weak < .20; weak = .20 to .39; moderate = .40 to .59; strong = .60 to .79; very strong ≥ .80. Guidelines for interpreting Cohen's |d|: negligible effect size < 0.20; small effect size = 0.20 to 0.49; medium effect size = 0.50 to 0.79; large effect size ≥ 0.80. Positive d-ratio values indicate higher scores for the Observer than Self-Report.
Click to expand

Table 11.18. Inter-Rater Reliability Study 2 (Different Rater Types: Two Types of Observers): CAARS 2–Short

CAARS 2–Short Scale Obtained
r
Corrected
r
Observer 1 Observer 2 Cohen's
d
M Mdn SD M Mdn SD
Inattention/​Executive Dysfunction .64 .48 58.4 56 12.7 60.6 61 11.9 0.18
Hyperactivity .43 .25 56.8 55 13.1 60.2 58 14.0 0.26
Impulsivity .50 .40 55.3 54 10.7 57.6 55 12.3 0.21
Emotional Dysregulation .60 .44 56.9 56 12.4 60.8 62 12.5 0.32
Negative Self-Concept .53 .39 59.0 57 11.5 60.8 60 13.0 0.14
Note. N = 47. All obtained correlations significant at p < .01; p-values in table above refer to the corrected correlation coefficients. Guidelines for interpreting |r|: very weak < .20; weak = .20 to .39; moderate = .40 to .59; strong = .60 to .79; very strong ≥ .80. Guidelines for interpreting Cohen's |d|: negligible effect size < 0.20; small effect size = 0.20 to 0.49; medium effect size = 0.50 to 0.79; large effect size ≥ 0.80. Positive d-ratio values indicate higher scores for Observer 2 than Observer 1.

Overall, these findings suggest moderate differences between the ratings of Self-Report and Observers, with Self-Report typically yielding the highest scores. A number of factors may reduce the level of agreement between different raters, including setting differences, level of insight, and nature of the relationship. Individuals also have unique insight into their own experiences and behaviors that may not be readily evident to observers, especially for behaviors that may be more inwardly felt than outwardly expressed. Results serve to highlight the importance of examining information from multiple sources when conducting a comprehensive assessment.

< Back Next >