Insights

Inter-rater reliability as a tool to reduce bias in surveys

Assessing inter-rater reliability improves field team training and our understanding of the context.

Part of designing a good survey instrument is to minimize bias to ensure that the data collected by one enumerator is comparable to the data collected by another. But no matter how much effort researchers put into the design of good instruments, sometimes the judgement of data collectors comes into play.

This is particularly true of observational assessments, where enumerators need to match what they see or hear to predetermined multiple choice options. Imagine for example a survey in which enumerators have to observe an agricultural plot to determine if given farming practices are being implemented. Or a situation in the context of an early childhood development study, where the enumerator needs to categorize the fine motor development of a child.

In situations where enumerators assess the same responses or observations in a different way, you get bias. This bias can undermine the reliability of the survey and the validity of the findings.

We can measure how similar or dissimilar the judgement of enumerators is on a set of questions using what’s called inter-rater reliability, or IRR for short.

What is inter-rater reliability?

The concept is very simple. IRR involves deploying a pair of enumerators to collect data on the same observation using the same survey items. The data they collect can then be used to compare the extent of agreement between enumerators (the raters). In some cases, where the variables of interest are stable over time (for example someone’s height), the measurements can be taken at different times. In others, for example when observing the types of tasks a toddler can perform, the ratings need to be taken at the same time to ensure they are comparable.

IRR can be measured through a set of statistical tools that are used to estimate the extent of agreement between two ratings. The choice and use of measures will depend on several factors, including the type of variable under consideration (categorical or continuous) and how pairs of raters are set up.

Inter-rater reliability for quality assurance

Assessing inter-rater reliability and discussing the findings with our enumerators has become a Laterite standard practice for projects that involve observational assessments. What we get out of it is this:

IRR highlights priorities for refresher training and feedback sessions. After field testing, items with high disagreement rates become the focus of our refresher training sessions. Scenarios that led to those disagreements are extensively discussed with the field team to deepen their understanding of the research instrument.
IRR is an opportunity to monitor enumerator performance. In many cases, we continue assessing inter-rater reliability after the pilot during data collection. This ensures that the consistency of ratings is maintained or improved and allows supervisors to focus their attention and guidance on the enumerators that need it most.
IRR is a chance to learn from enumerators about how the instrument can be adapted to the local context. Laterite works with experienced enumerators who always come up with good suggestions on how to improve an instrument to fit the local context. Because they have been in the field and exposed to various types of respondents, they know which questions will not be well understood or which types of answers could be missing. In the same vein, when discussing disagreements across raters, we always come across additional insights about the context. As an example, we recently had a discussion about which types of everyday objects could be considered handmade toys in the household in the context of Rwanda.

The level of detail we get from looking at inter-rater reliability contributes to Laterite’s understanding of the context we work in and strengthens our ability to collect quality data. And that is why we believe that regularly using inter-rater reliability measures should be standard practice for researchers looking to increase the quality, reproducibility and consistency of their results.

This blog was written by Laura Le Saux, Senior Research Associate based in Kigali, Rwanda.

Suggested IRR resources:

Hallgren, K., 2012. Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial
Bujang, M.A., N. Baharum, 2017. Guidelines of the minimum sample size requirements for Cohen’s Kappa

NB: Assessing inter-rater reliability can have other uses, notably in the process of validating an instrument, which were not the focus of this post.

Back