Estimate the number of non-manipulated observations to the right of the cutoff
dot-density_estimation.Rd
This helper replicates the constrained Poisson-spline procedure from the
research code (Rosenman et al., 2025) but in a package-friendly, testable
format. It is not exported - downstream functions access it internally
via .density_estimation()
.
Usage
.density_estimation(
hist_df,
manip_region,
cutoff,
knot_options = 3:15,
mod_types = c("smooth", "spike_integer"),
lambda = 100,
eps = 1e-06,
folds = NULL,
num_folds = 5L,
make_plot = FALSE,
seed = NULL,
solver = .get_rdpartial_opt("solver"),
cvx_opts = .get_rdpartial_opt("cvx_opts")
)
Arguments
- hist_df
data.frame
with columnsx
(numeric support points) andfreq
(integer counts).- manip_region
Numeric length-2 vector giving the suspected manipulation interval (lower, upper) inclusive. Must contain the cutoff on its upper bound.
- cutoff
Numeric scalar - RDD threshold.
- knot_options
Integer vector of candidate spline knot counts; default
3:15
.- mod_types
Character vector specifying which model structures to cross-validate over. Allowed values are:
"smooth" - plain spline;
"spike_integer" - spline + indicator for integer spikes;
"spike_half" - spline + integer + half-integer spikes.
- lambda
Penalty weight on total predicted mass inside the manipulation region (higher ⇒ stricter mass match). Default
100
(paper default).- eps
Numeric slack for the CVXR continuity constraints (default
1e-6
).- folds
Optional integer vector assigning each donut bin to a CV fold. Length must equal
nrow(hist_df)
. WhenNULL
, a randomnum_folds
-way split is generated.- num_folds
Integer number of folds if
folds
isNULL
(default5
).- make_plot
Logical; if
TRUE
return aggplot2
object visualising the final fit.- seed
Optional integer for reproducible fold assignment.
- solver
CVXR solver to use. Defaults to
.get_rdpartial_opt("solver")
.- cvx_opts
List of solver options passed to
CVXR::solve()
. Defaults to.get_rdpartial_opt("cvx_opts")
.
Value
A list with components:
counts
-data.frame(x, n_true)
withinmanip_region
.avg_cv_sse
- cross-validated SSE (per bin) for the chosen model.avg_sse
- in-sample SSE outsidemanip_region
.knots
- optimal knot count.model
- optimal model type (one ofmod_types
).plot
-ggplot
object whenmake_plot = TRUE
.
Details
The algorithm treats the running-variable histogram as the realisation of a
Poisson process. A flexible B-spline (possibly augmented with indicator
spikes at integer or half-integer values) is fit to the donut region -
bins outside the user-supplied manipulation interval. Mass preservation and
continuity constraints are imposed through CVXR
, and the optimal number of
knots / spike structure is chosen by $K$-fold cross-validation. The final
step predicts the expected counts in the manipulation interval, which serve
as the "true" (non-manipulated) totals fed into the partial-bounds
estimators. Bins with zero counts receive a tiny offset before logging so
that the inequality constraints remain finite.