Story

Inference in Regression Discontinuity Designs with Clustered Data

ComputingMath & Economics

Key takeaway

Researchers have a new way to analyze data from clustered studies, which can help draw better conclusions about things like policy effects on communities. This could lead to more accurate and impactful findings from real-world studies.

Read the paper

Quick Explainer

The core idea is to extend the standard regression discontinuity (RD) framework to handle clustered data, where observations are correlated within clusters but independent across them. The authors propose a unified theoretical model that derives high-level conditions for the asymptotic normality of the local linear RD estimator, even with complex clustering patterns. They also introduce a novel "clustered nearest-neighbor" variance estimator that accounts for the clustering structure and outperforms existing approaches when the conditional expectation function has substantial curvature. This work provides practical guidance for practitioners on conducting valid inference in RD designs with diverse cluster sizes and dependence structures encountered in real-world applications.

Deep Dive

Technical Deep Dive: Inference in Regression Discontinuity Designs with Clustered Data

Overview

This technical deep dive summarizes the key findings from a preprint manuscript on regression discontinuity (RD) designs with clustered data. The paper introduces a general model-based framework for such settings and derives high-level conditions under which the standard local linear RD estimator is asymptotically normal. It further proposes a novel nearest-neighbor-type variance estimator and illustrates its properties across a diverse set of empirical applications.

Problem & Context

Clustered sampling is prevalent in empirical RD designs, but formal theoretical results have been limited.
Existing literature offers little guidance on the conditions under which RD estimators are asymptotically normal with clustered data, or how to conduct valid inference in such settings.
The authors aim to provide a unified theory for local linear RD estimators under arbitrary clustering patterns encountered in practice.

Methodology

The authors consider a sharp RD design where the observed data is divided into G clusters, with n_g units in cluster g.
Observations are independent across clusters but can be dependent within a cluster.
The local linear RD estimator is defined as a weighted average of the outcome variable, where the weights depend on the running variable, kernel function, and bandwidth.
High-level conditions are derived to ensure asymptotic normality of the RD estimator, expressed in terms of restrictions on the cluster sizes within the estimation window.
Four stylized asymptotic frameworks are introduced to capture how the asymptotic behavior depends on the effective number of units per cluster, dependence structure of the running variable, and assumptions on the within-cluster covariance.
A novel clustered nearest-neighbor (CNN) standard error is proposed, which accounts for the clustering structure and exploits independence between clusters.

Data & Experimental Setup

The theoretical analysis is complemented by four empirical applications that exemplify the different asymptotic frameworks:

Motivating Example for Asymptotic Framework I: Evaluation of Mexico's disaster fund program, with around 1000 municipalities and 1.5 requests per municipality on average.
Motivating Example for Asymptotic Framework II: Analysis of French two-round elections, with 2300 districts and 3 observations per district on average.
Motivating Example for Asymptotic Framework III: Study of the causal effect of electoral defeat on political participation, with 250 observations per state on average.
Motivating Example for Asymptotic Framework IV: Analysis of OSHA's workplace safety enforcement policy, with 707 peer groups of facilities and 16 facilities per group on average.

Results

Under the high-level conditions, the local linear RD estimator is shown to be asymptotically normal, with a convergence rate that can be slower than the i.i.d. case depending on the clustering patterns.
The authors derive the exact limit of the conditional variance under additional assumptions on the covariance structure, showing it depends on both the variances of individual units and the covariances within clusters.
The proposed CNN standard error is proven to be consistent under the high-level conditions, outperforming existing alternatives in finite samples when the curvature of the conditional expectation function is substantial.

Interpretation

The results provide a unified theoretical framework for RD analysis with clustered data, generalizing previous work that considered only limited forms of clustering.
The four stylized asymptotic frameworks connect the high-level conditions to empirically relevant settings, guiding practitioners on when different approaches to inference may be appropriate.
The novel CNN standard error offers a flexible and consistent way to conduct inference, accommodating a wide range of clustering patterns encountered in applied work.

Limitations & Uncertainties

The high-level conditions, while general, may still not cover all possible clustering structures, such as settings with a fixed number of large clusters.
The theoretical analysis focuses on the local linear RD estimator; extensions to other nonparametric variants (e.g., higher-order polynomials, optimized RD estimators) are not explicitly considered.
The empirical illustrations use pre-existing datasets; further validation on additional applications would strengthen the practical relevance of the findings.

What Comes Next

Exploring the performance of the proposed CNN standard error in a broader class of conditional inference problems under misspecification, beyond the RD setting.
Investigating the sensitivity of the asymptotic results to violations of the smoothness assumptions on the conditional expectation function.
Developing data-driven methods for selecting the tuning parameters (e.g., number of nearest neighbors) in the CNN standard error procedure.

Source

Inference in Regression Discontinuity Designs with Clustered Data
PreprintarXiv econ3/20/2026