Manual

Conners 4 Manual

Chapter 8: Inter-Rater Reliability


Inter-Rater Reliability

view all chapter tables | print this section

Inter-rater reliability refers to the degree of agreement between two raters who are rating the same youth. Estimates of inter-rater reliability help to describe levels of consistency between raters, typically indexed by a correlation coefficient (in this case, Pearson’s r; LeBreton & Senter, 2008). Two inter-rater studies were conducted with the Conners 4: (1) two raters of the same rater type rated the same youth (i.e., two parents or two teachers), and (2) two raters of different rater types rated the same youth (i.e., parent/teacher, parent/self-report, or teacher/self-report).

Inter-Rater Reliability Study 1. In the first inter-rater study, dyads of parent raters (N = 68) completed the Conners 4 Parent, and teacher dyads (N = 34) completed the Conners 4 Teacher. Parent dyads were comprised of either a mother and father rating the same child (88.2%), or a mother and a non-biological father (11.8%) rating the same child. For teacher dyads, the teachers in each pair were typically providing instruction in different class content (e.g., math and literacy, or science and writing). Ratings were obtained from the parent and teacher dyads on two separate administrations over a 2- to 4-week interval (14 to 30 days). In appendix F, please see Table F.3 for demographic characteristics of the youth being rated and Table F.4 for demographic characteristics of the parent and teacher raters.

Scores on the Conners 4 scales for the paired raters were correlated. The reliability coefficients are Pearson’s correlations, ranging from -1 to 1, with higher values indicating greater consistency or agreement between raters. Correlation coefficients provide us with a statistical measure of the degree of association between two variables. Although there are several approaches to interpretation, the correlation coefficients (r) are categorized herein as follows: absolute values lower than .20 are classified as very weak; values of .20 to .39 are considered weak; values of .40 to .59 are moderate; values of .60 to .79 are strong; and absolute values greater than or equal to .80 are very strong (Evans, 1996).

The obtained inter-rater reliability coefficients and those corrected for range variation (Bryant & Gokhale, 1972), as well as the means, medians, and SDs for each dyad are provided in Table 8.15 for the Parent inter-rater study, and Table 8.16 for the Teacher inter-rater study. Results of the Parent inter-rater study were indicative of high levels of consistency between raters across all scales, with slightly lower corrected correlations for Anxious Thoughts and Family Life scales (evidence that parents of the same child can still have differing perspectives, i.e., corrected r ranged from .58 to .93, all p < .001). In Table 8.16, the inter-rater study results for teacher dyads showed more moderate to strong associations (i.e., corrected r ranged from .38 to .69); this pattern of results is a reminder of the value of obtaining ratings from multiple teachers and evaluating multiple sources of information when conducting a comprehensive assessment.

The consistency between raters was further evaluated by calculating the difference between scores for parent or teacher dyads in the inter-rater samples. Table 8.17 and 8.18 present the percentage of the sample displaying differences in scores of greater than, or equal to, 10 points (i.e., 1 SD or greater), as well as the absolute mean differences (i.e., the difference between ratings from pairs of parents or teachers averaged across the sample). The results suggest high consistency between parent ratings (scores on the Conners 4 scales were less than 1 SD apart for 79.4% to 100.0% of parent dyads) and moderate to high consistency between teacher ratings (scores on the Conners 4 scales were less than 1 SD apart for 61.8% to 82.4% of teacher dyads). The absolute mean and SD differences were also close to zero (i.e., on average, there was close to no difference in scores between the raters, and there was similar dispersion in scores), providing further evidence of inter-rater consistency, given the high degree of agreement between the same type of raters.






Inter-Rater Reliability Study 2. In the second inter-rater study, comparisons were made across different types of raters. As the Conners 4 Parent, Teacher, and Self-Report all measure similar constructs, similarity in scores across the different types of raters would provide evidence of the reliability of the test scores. Although some degree of similarity in ratings is expected between informants, given that they are rating the same youth on the same constructs, it is nonetheless expected that there be a certain degree of incongruence between their ratings, because different informants see the youth in different contexts, and may have dissimilar perceptions of, or experiences with, the youth’s behavior.

For the Conners 4, correlation coefficients (Pearson’s r, LeBreton & Senter, 2008) were calculated between scores for the following pairs of raters: (1) parent and teacher, (2) parent and self-report, and (3) teacher and self-report. To examine these relationships, parents, teachers, and youth provided ratings of the same youth (N = 62; all ratings completed within a 30-day period). Refer to appendix F, Table F.4 for the demographic characteristics of the youth being rated (recall that in this study the same youth is being rated by themselves, a parent, and a teacher), and of the parent and teacher raters.

The obtained correlation coefficients between different rater types, as well as those corrected for range restrictions (Bryant & Gokhale, 1972), are provided in Table 8.19 (Parent/Teacher), Table 8.20, (Parent/Self-Report), and Table 8.21 (Teacher/Self-Report).

For the Parent/Teacher dyads, the majority of the corrected correlations were moderate in size (median r = .46; ranging from r = .28 to .79), with two (Depressed Mood r = .36, Anxious Thoughts r = .28) being weakly correlated, and one (DSM Conduct Disorder Symptoms) being strongly correlated (r = .79). This strong correspondence for the DSM Conduct Disorder Symptoms scale is likely due to infrequent occurrence, as most raters do not report observing Conduct Disorder symptoms as they are severe behaviors. This similarity between raters in the infrequent endorsement of behaviors associated with Conduct Disorder is likely what accounted for the strong correlation between ratings on this scale.

The majority of corrected correlations were weak in the Parent/Self-Report dyads (median r = .30; ranging from r = .19 to .72; DSM Conduct Disorder Symptoms r = .72), as well in the Teacher/Self-Report dyads (median r = .23; ranging from r = .16 to .61; DSM Conduct Disorder Symptoms r = .61). Notably, for both the Parent/Self-Report and Teacher/Self-Report dyads, Impulsivity scores were very weakly correlated (r = .16 and r = .11, respectively), as were scores for Peer Interactions (r = .12 and r = .16, respectively). This low agreement between raters may be due to a variety of reasons, one of which is setting differences. Youth can report on their own behaviors across multiple settings, whereas parents and teachers can only report on those behaviors observed in specific environments, such as home and school. Another reason for low agreement for Impulsivity may be the youth’s unique insight into their own impulsive behaviors, especially in adolescence when these behaviors may be more inwardly felt than outwardly expressed, while parents and teachers might be less aware of the effort needed to restrain such behaviors. Similarly, for Peer Interactions, youth may have different insights into their friendships and experiences with their peers than what their parents or teachers would have access to observe.





In addition, the congruence of scores was analyzed by examining percentages of the samples with similar scores. These results are summarized in Tables 8.22 to 8.24.

As seen in Table 8.22, it was a frequent occurrence in the Parent/Teacher sample for there to be congruence in scores; that is, scores of ratings of the youth often differed by less than 1 SD (ranging from 58.1% to 75.8% of the sample, across all scales). Notably, parents provided slightly higher ratings than teachers (as evidenced by the small negative mean differences across all scales). The results suggest that scores on the Conners 4 scales have moderate consistency across parent and teacher ratings, providing supporting evidence for inter-rater agreement, but also showing the importance of a comprehensive evaluation that engages multiple raters who draw upon different perspectives and experiences with the youth.

A similar pattern was seen in the comparisons between parent and self-reported ratings, as presented in Table 8.23. Parents and youth tended to provide largely similar ratings, yielding scores that are within 1 SD of each other for 56.5% to 87.0% of this sample. Additionally, the differences between their scores were modest, ranging from -0.2 to -4.6 points, with higher ratings coming from the parent raters.

Similarly, for the Teacher/Self-Report comparisons, aligned scores that are within 1 SD of each other were observed for 51.6% to 87.1% of this sample (see Table 8.24), with slightly higher scores coming from Self-Report ratings.

Overall, these findings suggest moderate differences between the ratings of parents, teachers, and youth, with parents typically yielding the highest scores of the three rater types. Parents may have a unique opportunity for insight that teachers may be lacking, such as the youth’s home-life behaviors, while teachers or youth rating themselves have more opportunity for peer comparison, providing a different perspective. These findings demonstrate, when looking across different types of raters, moderate inter-rater reliability for the Conners 4 scales, capturing consistency in measurement across raters, as well as the varying perspectives of raters with differing types of relationships to the youth being rated. Although a high degree of similarity is expected between raters, perfect agreement is improbable; instead, discrepancies between raters’ scores, as seen in this study, emphasize the importance of consulting multiple sources to capture unique information that can reveal nuanced differences. The raters may be drawing upon their experiences with the youth being rated in dissimilar contexts, which may lead to slightly different responses between rater types. Results serve to highlight the importance of examining information from multiple sources when conducting a comprehensive assessment.





In addition to the differences already described regarding the Conners 4 scales, the Conners 4 Critical & Indicator Items were also evaluated for differences among Parent, Teacher, and Self-Report scores using the same inter-rater sample (note that the Severe Conduct Critical Items are not presented as these items are subsumed in the DSM Conduct Disorder Symptoms scale). The frequency of endorsement of the Critical & Indicator Items was examined for each rater group, with the expectation that varying perspectives and insight into sensitive topics may yield quite distinct patterns of responses. For the Self-Harm Critical Items to be considered endorsed, responses had to be greater than or equal to 1, “Just a little true (Occasionally),” such that any degree of endorsement flagged a Critical Item for inclusion in the frequency count. For the Sleep Problems Indicator, endorsement was defined as responses greater than 2, with the exception of the appearing/feeling tired item which requires a rating of 3, “Completely true (Very often/Always)” (on Self-Report only), in accordance with how these behaviors are flagged within the Conners 4 reports (see chapter 4, Interpretation).

As seen in Table 8.25, for the Self-Harm Critical Items, there was greater endorsement on the Self-Report relative to Parent and Teacher raters. Most notably, the Critical Item about deliberate self-harm is endorsed more than twice as often in Self-Report ratings (19.4%) relative to Parent ratings (8.1%). These items reflect deeply sensitive and personal behaviors or attitudes, to the extent that a youth may not wish to disclose this information to a parent or teacher. The discrepancies between the reports of different rater types, specifically when inquiring about self-harm thoughts and behaviors, highlight the importance of collecting information from multiple informants.


Click to expand

Table 8.25. Endorsement of Critical Items & Indicator Items by Rater Type

Item Stem

Parent (%)

Teacher (%)

Self-Report (%)

Self-Harm Critical Items

Harming self deliberately

8.1

1.6

19.4

Talking about suicide

8.1

4.8

Planning or attempting suicide

4.8

3.2

14.5

Thinking about harming self

29.0

Sleep Problems Indicator

Having trouble sleeping

35.5

40.3

Appearing/Feeling tired

6.5

8.1

21.0

Note. Self-Harm Critical Items are considered endorsed with an item response of ≥ 1, while Sleep Problems Indicator items are considered endorsed with an item response of ≥ 2 for trouble sleeping, an item response of ≥ 2 for tiredness for Parent and Teacher, and an item response of 3 for tiredness for Self-Report.


< Back Next >