Chapter 8: Reliability

Manual

CAARS 2 Manual

Chapter 8: Inter-Rater Reliability

Inter-Rater Reliability

view all chapter tables | print this section

Inter-rater reliability refers to the degree of agreement between two raters who are rating the same individual. Estimates of inter-rater reliability help describe levels of consistency between raters, typically indexed by a correlation coefficient like Pearson’s r (LeBreton & Senter, 2008). Two inter-rater studies were conducted with the CAARS 2: (a) two Observers of the same rater type (e.g., two friends) rated the same individual; and (b) two different types of raters (e.g., self-report/observer, or parent/friend) rated the same individual.

Study 1: Two Observers, Same Rater Type. In the first inter-rater study, dyads of Observers (N = 29) completed the CAARS 2 about the same individual. The dyads were comprised of either two relatives (79.3%; most frequently included a parent and a sibling, two different parents, or two different children) or two friends (20.7%). Raters were paired based on similarity in the setting in which they could observe the rated individual (e.g., siblings, parents, and children may have similar exposure to home life, while friends may see the individual in social settings primarily). Ratings were obtained from the dyads on two separate administrations (all ratings completed within a 30-day period). Demographic characteristics of these samples are provided in appendix J.

Scores on the CAARS 2 scales for the paired raters were compared via Pearson’s correlations, ranging from -1 to 1, with higher values indicating greater consistency or agreement between raters. The obtained inter-rater reliability coefficients and those corrected for range variation (Bryant & Gokhale, 1972), as well as the means, medians and standard deviations for each dyad, are provided in Table 8.7. Results of the inter-rater study were indicative of moderate to strong levels of consistency within rater dyads across all CAARS 2 scales (i.e., corrected r ranged from .40 to .70). This pattern of results indicates that different raters can provide different perspectives and is a reminder of the value of obtaining ratings from multiple raters and evaluating multiple sources of information when conducting a comprehensive assessment.

Click to expand

Table 8.7. Inter-Rater Reliability Study 1 (Two Observers, Same Rater Type): CAARS 2

Scale		Obtained r	Corrected r	p	Rater 1			Rater 2			Cohen's d
Scale		Obtained r	Corrected r	p	M	Mdn	SD	M	Mdn	SD	Cohen's d
Content Scales	Inattention/Executive Dysfunction	.70	.70	< .001	59.9	60	10.2	61.0	61	9.8	0.11
	Hyperactivity	.59	.45	.019	55.1	53	12.0	58.2	57	12.2	0.26
	Impulsivity	.55	.40	.039	56.1	55	12.5	54.8	51	12.1	-0.11
	Emotional Dysregulation	.84	.64	< .001	57.7	57	13.3	56.9	53	13.8	-0.06
	Negative Self-Concept	.76	.59	.001	60.3	61	13.3	61.3	61	12.2	0.08
DSM Symptom Scales	ADHD Inattentive Symptoms	.62	.59	.001	59.5	59	10.6	60.3	60	10.1	0.07
	ADHD Hyperactive/Impulsive Symptoms	.53	.40	.039	55.1	52	11.7	57.0	55	12.2	0.16
	Total ADHD Symptoms	.61	.49	.010	57.9	54	10.5	59.3	56	10.7	0.13

Note. N = 29. Reported p-values are for corrected r. Guidelines for interpreting |r|: very weak < .20, weak = .20 to .39, moderate = .40 to .59, strong = .60 to .79, very strong ≥ .80. Guidelines for interpreting Cohen's |d|: negligible effect size < 0.20; small effect size = 0.20 to 0.49; medium effect size = 0.50 to 0.79; large effect size ≥ 0.80.

Study 2: Different Types of Raters. In the second inter-rater study, comparisons were made across different types of raters, including comparison of Self-Report and Observer, as well as comparison of different types of Observers (e.g., spouse and friend). As the CAARS 2 Self-Report and Observer both measure the same constructs, similarity in scores across the different types of Observer raters, as well as between Observer ratings and self-reported ratings, would provide additional evidence of the reliability of the test scores. Although some degree of similarity in ratings is expected between informants, given that they are rating the same individual on the same constructs, it is nonetheless expected that there will be a certain degree of incongruence between their ratings, because different informants see the individual in different contexts and may have different perceptions of or experiences with the individual’s behavior.

For the CAARS 2, correlation coefficients (Pearson’s r, LeBreton & Senter, 2008) were calculated between scores for the following pairs of raters: (a) Observer and Self-Report (N = 211), and (b) two Observers with differing relations to the individual (N = 47; all ratings completed within a 30-day period). Dyads with two observers were required to represent different relationship types for Study 2 (e.g., a spouse/romantic partner and a friend, but not two friends). Refer to appendix J for the demographic characteristics of the raters and the individual being rated.

The obtained correlation coefficients between different rater types, as well as those corrected for range variation (Bryant & Gokhale, 1972), are provided in Table 8.8 (Self-Report/Observer) and Table 8.9 (Observer/Observer). The corrected correlations were weak to moderate in the Self-Report/Observer dyads (median r = .44, ranging from r = .37 to .46, p < .001), as well in the Observer/Observer dyads (median r = .45, ranging from r = .26 to .58, p < .001).

Click to expand

Table 8.8. Inter-Rater Reliability Study 2 (Different Rater Types: Self-Report and Observer): CAARS 2

Scale		Obtained r	Corrected r	Self-Report			Observer			Cohen's d
Scale		Obtained r	Corrected r	M	Mdn	SD	M	Mdn	SD	Cohen's d
Content Scales	Inattention/Executive Dysfunction	.59	.46	65.5	67	11.8	62.8	62	11.7	-0.23
	Hyperactivity	.54	.37	61.5	61	12.8	57.8	56	12.5	-0.29
	Impulsivity	.53	.39	61.8	62	12.4	56.0	55	11.8	-0.47
	Emotional Dysregulation	.54	.44	59.4	61	11.6	57.5	57	11.3	-0.17
	Negative Self-Concept	.56	.45	60.3	61	10.5	63.2	63	12.7	0.26
DSM Symptom Scales	ADHD Inattentive Symptoms	.57	.44	64.8	65	11.7	62.7	63	12	-0.18
	ADHD Hyperactive/Impulsive Symptoms	.55	.37	62.3	63	13.3	57.4	55	12.1	-0.38
	Total ADHD Symptoms	.57	.45	64.4	65	12.1	60.7	61	11.3	-0.31

Note. N = 211. All correlations significant at p < .001. Guidelines for interpreting |r|: very weak < .20, weak = .20 to .39, moderate = .40 to .59, strong = .60 to .79, very strong ≥ .80. Guidelines for interpreting Cohen's |d|: negligible effect size < 0.20; small effect size = 0.20 to 0.49; medium effect size = 0.50 to 0.79; large effect size ≥ 0.80. Positive d-ratio values indicate higher scores for the Observer than Self-Report.

Click to expand

Table 8.9. Inter-Rater Reliability Study 2 (Different Rater Types: Two Types of Observers): CAARS 2

Scale		Obtained r	Corrected r	p	Observer 1			Observer 2			Cohen's d
Scale		Obtained r	Corrected r	p	M	Mdn	SD	M	Mdn	SD	Cohen's d
Content Scales	Inattention/Executive Dysfunction	.69	.58	< .001	58.9	59	11.9	61.2	60	11.3	0.20
	Hyperactivity	.43	.26	.080	56.5	54	12.7	59.3	56	13.6	0.21
	Impulsivity	.51	.38	.011	55.4	53	11.5	57.1	54	12.6	0.14
	Emotional Dysregulation	.64	.46	.002	57.2	55	12.3	60.9	59	13.1	0.29
	Negative Self-Concept	.63	.49	< .001	58.5	57	11.7	60.1	59	12.3	0.13
DSM Symptom Scales	ADHD Inattentive Symptoms	.62	.50	< .001	58.6	56	12.0	61.2	61	11.3	0.22
	ADHD Hyperactive/Impulsive Symptoms	.44	.29	.054	56.3	52	12.7	58.4	55	12.7	0.17
	Total ADHD Symptoms	.53	.44	.003	57.9	57	11.6	60.5	61	11.1	0.23

Note. N = 47. All obtained correlations significant at p < .01; p-values in table above refer to the corrected correlation coefficients. Guidelines for interpreting |r|: very weak < .20, weak = .20 to .39, moderate = .40 to .59, strong = .60 to .79, very strong ≥ .80. Guidelines for interpreting Cohen's |d|: negligible effect size < 0.20; small effect size = 0.20 to 0.49; medium effect size = 0.50 to 0.79; large effect size ≥ 0.80. Positive d-ratio values indicate higher scores for the Observer 2 than Observer 1.

Overall, these findings suggest modest differences between the ratings of Self-Report and Observer dyads, with self-reported ratings typically yielding higher scores. Additionally, Observers of different types had a similar pattern of results in that relationships were modest, such that there was some agreement while also providing unique insight. The low-to-moderate agreement observed between Observer and Self-Report results may be due to a variety of factors, such as differences in setting (e.g., consistency of behaviors can vary), in level of insight (e.g., ability to observe internal processes, and/or one’s own self-awareness), and in the nature of the relationship (e.g., willingness to disclose may vary for one’s parent as opposed to one’s spouse). Discrepancies between raters’ scores, as seen in these studies, emphasize the importance of consulting multiple sources to capture unique information that can reveal relevant differences. Self-report captures an individual’s unique insight into their own experiences and behaviors that may not be readily evident to observers, especially for behaviors that may be more inwardly felt than outwardly expressed (e.g., subjective feelings of restlessness). Two observers with different roles in the individual’s life may be drawing upon their experiences with the individual being rated in dissimilar contexts or from different vantage points, which may lead to slightly different responses between rater types and thereby reduce the similarity of their ratings. Results serve to highlight the importance of examining information from multiple sources when conducting a comprehensive assessment.

< Back

Next >