Inter-rater reliability (IRR) within the scope of qualitative research is a measure of or conversation around the “consistency or repeatability” of how codes are applied to qualitative data by multiple coders (William M.K. Trochim, Reliability). In qualitative coding, IRR is measured primarily to assess the degree of consistency in how a code system is applied. However, it also provides an assessment that informs the process of identifying where acceptable consistency is found—and what that looks like—and where unacceptable levels of consistency are found and, thus providing guidance on what steps may be taken to increase consistency.
There are two main challenges in understanding and attaining acceptable inter-rater reliability. The first has been referred to as the unitization problem, Campbell et al. (2013). That is, what unit will be used to compare the application of the codes being tested for inter rater reliability? A unit can be anything from an entire document to a character, but it is important to select a unit that can be appropriately coded and compared across coders.
For example, let’s say a team wishes to compare single character coding where they are taken from a document and coded by multiple coders. Big challenge here is that they are devoid of context in contrast to the process of comparing the coding of an entire semi-structured interview as a single unit. One problem of coding an entire interview is the chance of missing any nuanced differences in coding and is essentially the process of categorizing the entire document—closer to the use of descriptors within Dedoose than coding. Given the nature of semi-structured interviews, these units are not naturally clear and “require the subjective interpretation of a coder,” see Campbell et al. (2013), pg. 302. As such, different coders may select different segments of text as a unit. For example, one coder might choose to code a paragraph with multiple codes and another might select multiple sentences within that paragraph and apply individual codes to each. Selecting units in this manner, since direct comparison is not uniform across the individual methodologies, presents some challenges. The method used in Dedoose closely follows that suggested by Campbell et al. (2013) in how IRR is calculated within the Dedoose Training Center. This process requires one researcher code qualitative data and others then code the same excerpts created by the initial coder. Providing sufficient context for these units is also critical in allowing the later coders to have enough information to make good coding decisions…often a problem where the Training Center design falls short.
As reliability is calculated, a new problem emerges; the issue of the code system’s discriminant capability, Campbell et al. (2013). Assuming sufficient reliability has not been achieved on its first evaluation, working toward improving a coding scheme’s discriminant capability or reducing coding errors is the natural next step. Coding errors can appear for a variety of reasons: the code system itself may need refinement, a primary investigator (PI) may have a better understanding of the coding structure and theoretical aim of the project than another coder, a combination of both, or other factors. Regardless, this is an important issue that needs to be addressed (Campbell et al. use a process they call intercoder agreement, we suggest reading Coding In-depth Semistructured Interviews: Problems of Unitization and Intercoder Reliability and Agreement for more information) and calculating IRR will aid in finding excerpts and codes which need to be reconciled.
Taken together, it is important to understand that calculating IRR is just a part of the process for developing a reliable code system and not necessarily an indicator of a coder’s success or failure to understand how to apply a code system. As Syed & Nelson (2015), pg. 375 remind us, “we highlight how establishing reliability must be seen as an evolving process, rather than simply a focus on the end product”. Additionally, there are other issues to consider. One example focuses on decisions regarding a standard for how much of a data set should be tested for IRR. Is an acceptably high IRR score for 10% of the files in a project enough? Should more or less be tested? Ultimately, this is all up to the researcher and they should be prepared to defend their practices. As Campbell et al. (2013), pg. 295 remind us as they write about inter-rater reliability, “the concern is whether different coders would code the same data the same way” and the researchers should test IRR and reconcile coding errors until they are satisfied.
As we hope this article makes clear, calculating inter rater reliability is often part of the code system development process and should be done iteratively. Given the variety in projects and lack of widely accepted methods of adding the rigor of reliability to a project, it is up to the researcher to think carefully as to what the aims of their project is and how to best use reliability. A Kappa statistic may not be appropriate for every project. In your exploration to find the best fit for your project, we suggest reviewing blind coding and cloning coded documents for manual comparison.
Campbell, J. L., Osserman, J., Pedersen, O.K., Quincy, C. (2013). Coding In-Depth Semistructured Interviews: Problems of Unitization and Intercoder Reliability and Agreement. Social Methods & Research, 42(3), 294-320
Syed, M., & Nelson, S. C. (2015). Guidelines for establishing reliability when coding narrative data. Emerging Adulthood, 3(6), 375‐387.
Trochim, W. M. (2006). Reliability. Retrieved December 21, 2016, from http://www.socialresearchmethods.net/kb/reliable.php