Calibration is a first-class metric
NEES exposed overconfidence in filters that appeared acceptable on point error alone.
Quantitative Inference Portfolio
A controlled benchmark of Kalman, particle, and deterministic flow filters evaluated on calibration, stability, and computational cost.
NEES exposed overconfidence in filters that appeared acceptable on point error alone.
Joseph-form covariance updates improved robustness but did not eliminate nonlinear divergence mechanisms.
Flow-based proposals improved robustness in hard regimes with measurable runtime overhead.
High-dimensional behavior determined practical feasibility more than single-scenario leaderboard performance.
The table combines accuracy, calibration, and systems costs. RMSE alone is not sufficient for decision-critical inference.
| Method | RMSE | NEES (target ~ 1) | Runtime (s) | Memory (MB) | Verdict |
|---|---|---|---|---|---|
| EKF | 1.660 | 40.034 | 0.990 | 450.5 | Fast but severely miscalibrated in nonlinear settings. |
| UKF | 1.316 | 6.398 | 1.032 | 452.8 | Better than EKF, still overconfident under stress. |
| Bootstrap PF | 0.906 | 1.253 | 1.912 | 494.0 | Best calibrated among NEES-logged methods with strong accuracy. |
| PF-PF (LEDH) | 2.492 | n/a | 10.857 | n/a | Stable proposals, high runtime cost in current configuration. |
| LEDH | 3.543 | n/a | 5.197 | n/a | Deterministic flow behavior with moderate compute overhead. |
| Kernel-PFF (matrix) | 3.322 | n/a | 14.642 | n/a | High-dimensional robustness signal, but expensive in runtime. |
Interpretation: Bootstrap PF gave the strongest combined signal on accuracy and calibration where NEES was logged, while deterministic flows showed useful stress robustness at significantly higher compute cost. For downstream decision systems, calibration quality is as important as error magnitude.
NEES = (x - x̂)T P-1 (x - x̂)
A filter with low RMSE but inflated NEES is statistically overconfident and unsafe for downstream decision-making.
As dimensionality increases, method choice is governed by an accuracy-efficiency frontier, not a single scalar metric.
| Method | Time Complexity (approx.) | Memory Complexity (approx.) |
|---|---|---|
| KF / EKF / UKF | O(d^3) | O(d^2) |
| Bootstrap PF | O(Nd) | O(Nd) |
| PF-PF / LEDH | O(Nd^2) | O(Nd) |
| Kernel-PFF (matrix) | O(N^2d) | O(N^2) |
Runtime and memory were explicitly profiled per method in the benchmark pipeline.
This benchmark isolates three axes:
Each experiment evaluates:
KF stability and covariance update behavior under controlled assumptions.
Approximate Gaussian filters tested under range-bearing nonlinearity and calibration pressure.
PF-PF and flow-based methods evaluated under stronger degeneracy and transport constraints.
First-order linearization produced unstable covariance behavior and inflated NEES when measurement geometry became strongly nonlinear.
Weight collapse reduced effective sample size and increased variance in posterior approximation under sparse or hard observations.
EDH/LEDH and kernelized transport improved certain stress cases but incurred substantial runtime overhead relative to KF/PF baselines.
Calibration failure occurred even when point-error metrics appeared acceptable, reinforcing NEES as a non-optional diagnostic.
No new experiments were added in this portfolio redesign. The page uses existing outputs from the benchmark repository.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
bash scripts/run_part1.sh
Source repository: github.com/meamresh/MLCOE_Q2_PF
Filtering maps naturally to latent-state inference, where hidden factors evolve sequentially under uncertainty.
The benchmark emphasizes online updates, calibration quality, and failure awareness rather than static offline fit.
NEES provides a direct signal for uncertainty reliability, critical when downstream policies consume posterior covariance.
Scaling diagnostics connect algorithmic complexity to practical deployment constraints in large latent-state systems.