Statistics used to index interrater similarity are prevalent in many areas of the social sciences, with multilevel research being one of the most common domains for estimating interrater similarity. Multilevel research spans multiple hierarchical levels, such as individuals, teams, departments, and the organization. There are three main research questions that multilevel researchers answer using indices of interrater agreement and interrater reliability: (a) Does the nesting of lower-level units (e.g., employees) within higher-level units (e.g., work teams) result in the non-independence of residuals, which is an assumption of the general linear model?; (b) Is there sufficient agreement between scores on measures collected from lower-level units (e.g., employees perceptions of customer service climate) to justify aggregating data to the higher-level (e.g., team-level climate)?; and (c) Following data aggregation, how effective are the higher-level unit means at distinguishing between those higher levels (e.g., how reliably do team climate scores distinguish between the teams)?
Interrater agreement and interrater reliability refer to the extent to which lower-level data nested or clustered within a higher-level unit are similar to one another. While closely related, interrater agreement and reliability differ from one another in how similarity is defined. Interrater reliability is the relative consistency in lower-level data. For example, to what degree do the scores assigned by raters tend to correlate with one another? Alternatively, interrater agreement is the consensus of the lower-level data points. For example, estimates of interrater agreement are used to determine the extent to which ratings made by judges/observers could be considered interchangeable or equivalent in terms of their values.
Thus, while interrater agreement and reliability both estimate the similarity of ratings by judges/observers, but they define interrater similarity in slightly different ways, and these statistics are suited to address different types of research questions. The first research question that these statistics address, the issue of non-independence, is typically measured using an interclass correlation statistic that is a function of both interrater reliability and agreement. However, in the context of non-independence, the intraclass correlation is most often interpreted as an effect size. The second multilevel research question, concerning adequate agreement to aggregate lower-level data to a higher level, would require a measure on interrater agreement, as the research is looking for consensus among raters. Finally, the third multilevel research question, concerning the reliability of higher-level means, not only requires a different variation of the intraclass correlation, but is also a function of both interrater reliability and agreement. Multilevel research requires researchers to appropriately apply interrater agreement and/or reliability statistics to their data, as well as follow best practices for calculating and interpreting these statistics.