2026-05-06
Status: research note; not peer reviewed. The conformal validity claims are standard; the experiments study how score choice changes efficiency and local behavior.
Conformal prediction has a clean promise. If the calibration and test examples are exchangeable, then a split conformal prediction set has finite-sample marginal coverage. The model used to build the score can be wrong. The likelihood can be misspecified. The geometry can be only approximate.
That is the part I do not want to disturb. The question in this note is different:
If conformal validity is model-free, where can model geometry matter?
My answer is:
Geometry enters through the nonconformity score. It does not prove coverage. It changes efficiency.
So the slogan is
\boxed{\text{exchangeability gives validity; score geometry affects efficiency.}}
This note is a small experiment around that slogan. The goal is not to introduce a new conformal method, but to separate two things that are easy to mix together: the validity mechanism and the shape induced by the score. Conformal calibration gives the first. The score gives the second.
Let
Z_i=(X_i,Y_i), \qquad i=1,\ldots,n+1,
be exchangeable observations, and let
S:\mathcal X\times\mathcal Y\to\mathbb R
be any measurable nonconformity score. On a calibration set, define
R_i=S(X_i,Y_i).
For target miscoverage \alpha, use the conformal quantile
\widehat q = R_{\left(\lceil (n_{\mathrm{cal}}+1)(1-\alpha)\rceil\right)}.
The split conformal set is
C_\alpha(x)=\{y:S(x,y)\le \widehat q\}.
Then
\mathbb P\{Y_{n+1}\in C_\alpha(X_{n+1})\}\ge 1-\alpha.
The important word is any. The score can come from a random forest, a kernel method, a neural network, a quantile model, or a density model. If the calibration and test scores are exchangeable, the marginal guarantee remains.
This is why I do not view information geometry as a new validity proof for conformal prediction. The validity proof is already there. The role of geometry is narrower: a geometric score may rank candidate labels better. Better ranking can mean shorter intervals, smaller sets, or better conditional behavior. But it is not what makes the method valid.
The simplest regression score is
S_{\mathrm{abs}}(x,y)=|y-\widehat\mu(x)|.
It creates a global residual interval,
C_\alpha(x) = [\widehat\mu(x)-\widehat q,\widehat\mu(x)+\widehat q].
This is a reasonable geometry when the conditional noise is roughly symmetric and has the same scale everywhere. When the noise scale changes with x, a global residual threshold tends to be too wide in easy regions and too narrow in hard regions. Marginal coverage can still be right, because the errors average out.
A local-scale score changes the ruler:
S_{\mathrm{loc}}(x,y) = \frac{|y-\widehat\mu(x)|}{\widehat\sigma(x)}.
The resulting interval is
C_\alpha(x) = [ \widehat\mu(x)-\widehat q\widehat\sigma(x), \widehat\mu(x)+\widehat q\widehat\sigma(x) ].
This is a one-dimensional Mahalanobis idea. The local ruler is
g(x)\approx \frac{1}{\widehat\sigma^2(x)}.
In a Gaussian conditional model,
Y\mid X=x\sim N(\mu(x),\sigma^2(x)),
the negative log-likelihood contains
\frac{(y-\mu(x))^2}{2\sigma^2(x)}.
So local normalization measures residual size in the local uncertainty units of the model. The conformal step does not care whether the Gaussian story is true. It only uses the resulting ranks. But the efficiency can care a lot.
There are other geometries. Conformalized quantile regression uses lower and upper quantile models,
\widehat q_\ell(x),\qquad \widehat q_u(x),
with score
S_{\mathrm{CQR}}(x,y) = \max\{\widehat q_\ell(x)-y,\;y-\widehat q_u(x)\}.
The resulting interval is no longer forced to be symmetric around a mean. For multimodal data, even quantile intervals can be the wrong shape: if two separated labels are plausible and the middle is implausible, a single interval wastes volume. A density score,
S_{\mathrm{dens}}(x,y)=-\log \widehat p(y\mid x),
instead gives the set
C_\alpha(x)=\{y:\widehat p(y\mid x)\ge e^{-\widehat q}\},
which can be disconnected. That is the right geometry for multimodal conditional laws.
A local metric can help only if it is estimated stably. If \widehat\sigma(x) is too small on some calibration points, then
\frac{|y-\widehat\mu(x)|}{\widehat\sigma(x)}
can become huge. The conformal quantile then becomes huge, and the final intervals can explode.
To make this failure mode visible, I use stabilized local scales,
\widetilde\sigma(x) = \operatorname{clip}(\widehat\sigma(x),s_{\min},s_{\max})+\lambda,
and a local/global blend,
\widetilde\sigma_\gamma(x) = (1-\gamma)\widetilde\sigma_{\mathrm{local}}(x) + \gamma\widetilde\sigma_{\mathrm{global}}.
Here \gamma=0 is fully local geometry, while \gamma=1 is mostly global residual geometry. The practical question is not whether local geometry is elegant. It is how much local geometry we can trust.
I use synthetic regression data with mean
\mu(x)=\sin(2\pi x_1)+\frac12 x_1.
The settings are deliberately simple.
| Setting | Data-generating idea | What it tests |
|---|---|---|
| Homoskedastic | Y=\mu(X)+0.25\varepsilon, \varepsilon\sim N(0,1) | no local scale structure |
| Heteroskedastic | Y=\mu(X)+\sigma(X)\varepsilon, \sigma(x)=0.10+0.80|x_1| | local scale geometry |
| Heavy-tailed | \varepsilon\sim t_3/\sqrt3 | robustness |
| Skewed | \varepsilon\sim \mathrm{Exponential}(1)-1 | asymmetric errors |
| Multimodal | Y=\mu(X)+0.8B+\sigma(X)\varepsilon, B\in\{-1,+1\} | disconnected conditional structure |
The experiment is ordinary split conformal: split the data into train, calibration, and test sets; fit a model on the training set; build one or more scores; compute the conformal quantile on calibration scores; and evaluate coverage and set length on the test set.
The score changes. The calibration logic does not.
The code lives in ig_conformal.py. I ran the full benchmark separately and report the saved summaries below. The benchmark uses four model families: random forests, kernel ridge regression, MLPs, and GBRT-based conformalized quantile regression.
The results match the small claim. Coverage stays near 90\% for most methods. That is the conformal part. The average length changes a lot. That is the score-geometry part.
This is the clearest positive case for local geometry.
| model | baseline len. | geometric score | geometric len. | reduction | baseline cond. err. | geometric cond. err. |
|---|---|---|---|---|---|---|
| GBRT-CQR | 1.904 | conformalized quantile regression | 1.697 | 10.9% | 0.088 | 0.026 |
| KRR | 2.579 | blended normalized residual, \gamma=0.75 | 2.683 | -4.0% | 0.080 | 0.069 |
| MLP | 1.988 | stabilized normalized residual | 1.784 | 10.3% | 0.096 | 0.038 |
| RF | 1.950 | blended normalized residual, \gamma=0.25 | 1.676 | 14.1% | 0.086 | 0.047 |
For random forests, the average length drops from about 1.95 to 1.68. For MLPs, it drops from about 1.99 to 1.78. For GBRT-CQR, it drops from about 1.90 to 1.70. The conditional coverage diagnostics also improve.
Kernel ridge regression is the warning. In this run, even the best blended geometry is slightly longer than the global residual baseline. In more aggressive local-normalization variants, the intervals can become enormous because the estimated scale is unstable. That is not a conformal failure. Coverage is still protected. It is a score failure.
The practical lesson is:
use only as much geometry as you can estimate stably.
The homoskedastic case is a useful sanity check. The simple residual score is hard to beat; for random forests, the absolute residual interval has average length about 0.856, and the oracle local normalization gives essentially the same number. Local geometry has nothing useful to learn.
The heavy-tailed and skewed cases are mixed but informative. Blended local-scale scores give modest gains for RF and MLP in the heavy-tailed setting and often improve conditional balance. In the skewed setting, RF and MLP blended scores work well, while CQR improves conditional balance but is not uniformly shorter in this particular run. More flexible geometry is not automatically better. It helps only when the fitted score ranks labels well.
The multimodal setting is the cleanest example. Interval methods have the wrong shape: they must cover the space between modes even when the middle is not very plausible.
The conditional density score uses
S(x,y)=-\log \widehat p(y\mid x).
I fit a simple two-mode conditional density model,
\widehat p(y\mid x) = \widehat\pi(x)\mathcal N(y;\widehat\mu_+(x),\widehat\sigma_+^2) + (1-\widehat\pi(x))\mathcal N(y;\widehat\mu_-(x),\widehat\sigma_-^2).
This is not meant to be a state-of-the-art density estimator. It is just expressive enough to test the geometry. The density-level-set conformal method gives average length about 1.82 with coverage about 0.904.
| baseline family | interval baseline len. | density-set len. | reduction |
|---|---|---|---|
| GBRT-CQR | 2.345 | 1.820 | 22.4% |
| KRR | 3.803 | 1.820 | 52.1% |
| MLP | 2.691 | 1.820 | 32.4% |
| RF | 2.457 | 1.820 | 25.9% |
This is the main idea in its most visible form:
once the score geometry matches the conditional law, conformal sets become much more efficient.
At first, the density method looked less locally balanced when I binned by \sigma(x). The mean conditional error was about 0.0757. But \sigma(x) is not the right diagnostic for a mixture score. The score models mode structure.
So I also bin by mixture-specific quantities: the estimated upper-mode probability \widehat\pi(x), its entropy H(\widehat\pi(x)), the mode margin |\widehat\pi(x)-0.5|, and the mode separation |\widehat\mu_+(x)-\widehat\mu_-(x)|.
| diagnostic | mean conditional error |
|---|---|
| \sigma(x) bins | 0.0757 |
| \widehat\pi(x) bins | 0.0232 |
| mode entropy bins | 0.0240 |
| mode margin bins | 0.0240 |
| mode separation bins | 0.0248 |
The earlier diagnostic was partly asking the wrong question. If the score models local scale, bin by local scale. If the score models mixture uncertainty, bin by mixture uncertainty.
This experiment is not a theorem about optimal conformal prediction. The claim is smaller:
conformal calibration gives marginal validity, while the score determines the geometry of the set.
That leads to the practical question:
does the score geometry match the conditional law of Y\mid X?
When the answer is yes, conformal sets can be much smaller. When the answer is no, coverage can remain valid while efficiency degrades.
A simple taxonomy is:
\boxed{\text{flat noise} \Rightarrow \text{residual score}}
\boxed{\text{heteroskedasticity} \Rightarrow \text{local scale score}}
\boxed{\text{varying quantiles} \Rightarrow \text{CQR}}
\boxed{\text{multimodality} \Rightarrow \text{density-level-set score}.}
The conformal layer is the safety net. It turns model-dependent scores into marginally valid prediction sets. But it does not make all scores equally good.
I initially tried Student-t and Huber versions of normalized residual scores. In scalar symmetric regression, these are mostly rank-equivalent to normalized absolute residuals. For example,
\log\left(1+\frac{r^2}{\nu}\right)
is monotone in |r|, and the Huber loss is also monotone in |r|. Since conformal prediction is rank-based, these transformations do not change the prediction set if they use the same scale. They may be useful diagnostics, but they do not create a new scalar interval geometry.
For density-level-set conformal prediction in one-dimensional regression, I approximate
|C_\alpha(x)|
on a grid. The accepted grid points are those satisfying
-\log \widehat p(y\mid x)\le \widehat q.
If the density has two separated modes, the accepted grid points can form two disconnected intervals. The reported length is the total grid length of the accepted region.
The same idea has a classification analogue. For a neural classifier, ordinary conformal scores use softmax probabilities,
S(x,y)=1-\widehat p_\theta(y\mid x),
or cumulative-probability rankings. A more geometric score could use the penultimate-layer representation \phi_\theta(x). For class y, fit a class center m_y and covariance \Sigma_y, then use
S(x,y) = (\phi_\theta(x)-m_y)^\top \Sigma_y^{-1} (\phi_\theta(x)-m_y).
This is the classification analogue of local metric scores in regression. The network supplies the geometry. Conformal calibration supplies the validity.
All validity statements here are ordinary split conformal validity statements. The new part is empirical and conceptual: different score geometries induce different set shapes and efficiencies.
A formal theorem would need to control excess set size, for example
\mathbb E|C_\alpha(X)|-\mathbb E|C_\alpha^*(X)|,
in terms of score-ranking error, density-estimation error, or metric-estimation error. That is not proved here.
The full benchmark can be regenerated from ig_conformal.py:
raw, summary = ig.run_grid(
settings=[
"homoskedastic",
"heteroskedastic",
"heavy_tail",
"skewed",
"multimodal",
],
model_families=["rf", "krr", "mlp", "cqr_gbrt"],
n=6000,
d=5,
alpha=0.1,
seeds=[0, 1, 2],
save_csv=True,
output_prefix="ig_conformal_results",
)
compact = ig.summarize_geometry_effect(summary)
compact.to_csv("ig_conformal_geometry_effect.csv", index=False)
multimodal_diag = ig.summarize_multimodal_density_diagnostics(summary)
multimodal_diag.to_csv(
"ig_conformal_multimodal_density_diagnostics.csv",
index=False,
)Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005.
Glenn Shafer and Vladimir Vovk. “A Tutorial on Conformal Prediction.” Journal of Machine Learning Research 9, 371–421, 2008.
Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. “Distribution-Free Predictive Inference for Regression.” Journal of the American Statistical Association 113(523), 1094–1111, 2018.
Yaniv Romano, Evan Patterson, and Emmanuel Candès. “Conformalized Quantile Regression.” NeurIPS, 2019.
Anastasios Angelopoulos and Stephen Bates. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” arXiv:2107.07511, 2021.
Rob Hyndman. “Computing and Graphing Highest Density Regions.” The American Statistician 50(2), 120–126, 1996.
Rafael Izbicki, Gilson Shimizu, and Rafael Stern. “Flexible Distribution-Free Conditional Predictive Bands using Density Estimators.” AISTATS, 2020.
Shun-ichi Amari. Information Geometry and Its Applications. Springer, 2016.
@misc{miryusupov2026conformalgeometry,
author = {Miryusupov, Shohruh},
title = {Conformal Scores and the Geometry of Efficiency},
year = {2026},
howpublished = {Research note},
url = {https://www.miryusupov.com/blog/posts/conformal-scores-geometry-efficiency/}
}