Skip to contents

This helper replicates the constrained Poisson-spline procedure from the research code (Rosenman et al., 2025) but in a package-friendly, testable format. It is not exported - downstream functions access it internally via .density_estimation().

Usage

.density_estimation(
  hist_df,
  manip_region,
  cutoff,
  knot_options = 3:15,
  mod_types = c("smooth", "spike_integer"),
  lambda = 100,
  eps = 1e-06,
  folds = NULL,
  num_folds = 5L,
  make_plot = FALSE,
  seed = NULL,
  solver = .get_rdpartial_opt("solver"),
  cvx_opts = .get_rdpartial_opt("cvx_opts")
)

Arguments

hist_df

data.frame with columns x (numeric support points) and freq (integer counts).

manip_region

Numeric length-2 vector giving the suspected manipulation interval (lower, upper) inclusive. Must contain the cutoff on its upper bound.

cutoff

Numeric scalar - RDD threshold.

knot_options

Integer vector of candidate spline knot counts; default 3:15.

mod_types

Character vector specifying which model structures to cross-validate over. Allowed values are:

  • "smooth" - plain spline;

  • "spike_integer" - spline + indicator for integer spikes;

  • "spike_half" - spline + integer + half-integer spikes.

lambda

Penalty weight on total predicted mass inside the manipulation region (higher ⇒ stricter mass match). Default 100 (paper default).

eps

Numeric slack for the CVXR continuity constraints (default 1e-6).

folds

Optional integer vector assigning each donut bin to a CV fold. Length must equal nrow(hist_df). When NULL, a random num_folds-way split is generated.

num_folds

Integer number of folds if folds is NULL (default 5).

make_plot

Logical; if TRUE return a ggplot2 object visualising the final fit.

seed

Optional integer for reproducible fold assignment.

solver

CVXR solver to use. Defaults to .get_rdpartial_opt("solver").

cvx_opts

List of solver options passed to CVXR::solve(). Defaults to .get_rdpartial_opt("cvx_opts").

Value

A list with components:

  • counts - data.frame(x, n_true) within manip_region.

  • avg_cv_sse - cross-validated SSE (per bin) for the chosen model.

  • avg_sse - in-sample SSE outside manip_region.

  • knots - optimal knot count.

  • model - optimal model type (one of mod_types).

  • plot - ggplot object when make_plot = TRUE.

Details

The algorithm treats the running-variable histogram as the realisation of a Poisson process. A flexible B-spline (possibly augmented with indicator spikes at integer or half-integer values) is fit to the donut region - bins outside the user-supplied manipulation interval. Mass preservation and continuity constraints are imposed through CVXR, and the optimal number of knots / spike structure is chosen by $K$-fold cross-validation. The final step predicts the expected counts in the manipulation interval, which serve as the "true" (non-manipulated) totals fed into the partial-bounds estimators. Bins with zero counts receive a tiny offset before logging so that the inequality constraints remain finite.