Abstract,tech,background.,floating,numbers,hud,background.,matrix,particles,grid

Insights

Estimating program participation with fuzzy merging

The research team shares their experience matching datasets to recover information on program participation

Self-reported participation in programs can prove inaccurate in certain contexts. This is particularly true in settings where respondents participate in multiple programs, offered by different providers, across time. Although these programs serve different objectives, the providers and the program might not be easy to distinguish from the perspective of participants.

The good news is that there are ways to recover participation information that is as accurate as possible. We present a solution based on a case study where we estimate participation by using a combination of survey and administrative data.

Example from a data collection project in Rwanda

Here is the problem we faced: we recently ran a survey in Rwanda with participants of a program rolled out in 2018, about 5 years ago. Our estimate was that about 10% of these participants had also joined a more recent version of the same program in 2021. We came to this estimate by counting the number of people in the original program who lived in villages covered by the more recent version. However, when we ran the survey, 60% of respondents from the 2018 sample said they were also part of the 2021 intervention, suggesting an excess of self-reported participation. In our context, understanding the impact of the program is crucial. Incorrectly estimating the number of people who benefited from the recent version of the intervention might lead to a misinterpretation of the previous version’s impact. It could in turn have consequences on the decision to scale the program up or down.

To address this discrepancy, we decided to cross-check whether participants included in our survey sample were also listed as participants in the administrative data of the 2021 program, provided by the implementing partner. This exercise is important because names, location and other identifiers can differ across records (due to spelling differences, differences in the order of names, the format of the data, etc).

Fuzzy merging in our work

We did this cross-check using a fuzzy merging strategy. Fuzzy merging is a technique that can be used to identify whether records across two different sources of data refer to the same individuals, drawing on similarities on the information we have about them. Aside from finding perfect matches, fuzzy merging helps us deal with situations wherein the match is not perfect. It does this by providing an estimate of how similar a pair of records are across two different data sources.

For the purposes of this study, we compared observations from our survey with administrative data from the implementing partner. Both datasets contained information on the location of the household, the name of the head of household, and details of other household members. With this information, we were able to merge 15% of our survey observations directly with administrative entries with perfectly matched characteristics.

For the cases where the match was not perfect, we also turned to fuzzy merging. After cleaning the data to ensure that the format was comparable between the two datasets, we used a fuzzy merging technique known as Bigram string comparator. This helps identify matches that may not be identical but are close enough (aka imperfect string matches). This method generates matching probabilities describing the proximity between two observations by counting the number of common bigrams (i.e. two following characters in a text variable) between two variables and dividing by the average number of bigrams among them. It generates a score where 0 means no match at all and 1 indicates a perfect match. For observations with matching scores lower than 1 but at least one variable with a matching score above 0.95, we did a manual check to confirm whether the two observations were in fact the same individual. After this step, we identified another 9% of surveys matching the administrative ones.

In total, we estimated that 24% of survey respondents participated in the 2021 version of the program, which was much less than the 60% of self-reports, but also significantly higher than our initial estimate. Some 5% of households that matched between our survey data and the implementing partner’s data had reported not participating in the 2021 program at all. There’s a possibility that participants were missed in cases where administrative data had erroneous information or where different household members were captured in the administrative data compared to the survey data.

Importance of measuring program participation

Measuring participation is key for understanding a program’s effectiveness and whether the intended participants were reached. In this case, fuzzy merging helped us correct an error in survey responses about program participation that would otherwise have been missed. Our experience shows that fuzzy merging can improve data quality by reducing duplicates and identifying inconsistencies. However, it’s a process that requires patience, careful attention to detail, sound judgement and good administrative data.

For anyone interested to learn more about fuzzy merging, here are some helpful resources:

“Fuzzy Matching: Algorithms and Applications” by Christen, P. (2012)
“A survey of fuzzy matching algorithms for data cleansing” by Wang, X., & Jia, X. (2017)
“A comparison of string distance metrics for name-matching tasks” by Cohen, W. W., & Ravikumar, P. (2003)
“Fuzzy Matching in Python” by Martin, S. (2016)
“Fuzzy matching in R with stringdist” by van der Loo, M. P. J. (2014)
“Fuzzy merge” by Innovations for Poverty Action

This blog post was co-written by Julien Christen, Research Associate and Kevin Kimenyi, Research Analyst both from Laterite Rwanda.

Back