When the power function is not an error bar

2026-05-14

A recurring trap in kernel methods is that a computable geometric quantity can look like an uncertainty estimate.

The power function is such a quantity. It is mathematically meaningful: it measures how well the design points constrain evaluation at a new location, relative to a chosen kernel. But the usual pointwise error bound has a hidden constant:

|f(x) - s_X f(x)| \leq P_X(x)\,\|f\|_{\mathcal H}.

The computable part is P_X(x). The difficult part is \|f\|_{\mathcal H}.

In classical approximation theory, that is natural: one assumes the target belongs to the native space with a controlled norm. In machine learning, the target is unknown, observations are noisy, the kernel is a modeling choice, and the RKHS norm may be enormous, infinite, or simply unconnected to the prediction problem.

The point of this note is narrow. The power function is useful as a geometry-of-information diagnostic, but it should not be read as a pointwise ML error bar unless the missing norm factor has a meaningful and calibrated scale. The four diagnostics below show why.

The formal statement

Let X = \{x_1, \ldots, x_n\} be interpolation sites and let k be a positive definite kernel with RKHS \mathcal H. The kernel interpolant is

s_X f(x) = \sum_{i=1}^n \alpha_i k(x, x_i), \qquad K_X \alpha = f_X,

where (K_X)_{ij} = k(x_i, x_j) and f_X = (f(x_1), \ldots, f(x_n))^\top. The power function is

P_X^2(x) = k(x,x) - k_X(x)^\top K_X^{-1} k_X(x),

where k_X(x) = (k(x,x_1), \ldots, k(x,x_n))^\top.

It depends on the kernel, the design, and the evaluation point. It does not depend on the observed target values.

The standard deterministic estimate is

|f(x) - s_X f(x)| \leq P_X(x)\,\|f\|_{\mathcal H}.

That is a useful theorem. But the theorem is not the same as the applied claim that P_X(x) alone is an error bar. The cleaner separation is

P_X(x) = \text{design geometry under the kernel}, \qquad \|f\|_{\mathcal H} = \text{unknown target complexity under the kernel}.

The second term is usually where the statistical difficulty lives.

Same power function, different errors

Fix the training sites and the kernel. The power function is now fixed. Change only the target.

Figure 1: Same design and same kernel. The power function is unchanged, but the actual interpolation error depends strongly on the target.

The design has not changed. The kernel has not changed. The power function has not changed. Only the target has changed — and that is enough to rule out interpreting P_X(x) alone as a pointwise error estimate.

At most, P_X(x) is the computable geometric factor in a bound whose missing factor is the target norm.

Table Table 1 gives the numerical version. Since design and kernel are fixed, max power is identical for every target. The fitted norm column is the computable interpolation norm \|s_X\|_{\mathcal H}, and the bound scale is \max_x P_X(x)\,\|s_X\|_{\mathcal H}. This is a diagnostic scale, not the theorem’s true bound with the unknown \|f\|_{\mathcal H}, so it can be smaller than the actual error.

Table 1: Same power function, different target errors. With the same design and kernel, max power is unchanged, while actual interpolation error and fitted norm vary strongly across targets.
target max power max error fit norm bound scale vacuity ratio
smooth 0.000301 4.82e-05 1.8 0.000541 0.00027
oscillatory 0.000301 3.78 4.95e+04 14.9 7.46
kink 0.000301 0.032 1.57 0.000474 0.000948
jump 0.000301 0.494 152 0.0458 0.0458

The hidden norm can become the whole problem

Suppose we accept the chosen kernel and ask how expensive different targets are under it. A finite-dimensional proxy for the RKHS norm is

\|f\|_{\mathcal H,m}^2 = f(Z)^\top K_Z^{-1} f(Z),

where Z is a fine grid. This is the RKHS norm of the minimum-norm interpolant through the values of f on Z.

It is not the true continuous RKHS norm of f, but it is a useful diagnostic. If this proxy grows rapidly as the grid is refined, the native-space assumption is not giving a stable scale for the target.

Figure 2: Finite-dimensional native-space norm proxy as the grid is refined. The same smooth Gaussian kernel makes oscillatory or nonsmooth targets expensive.
Table 2: Finite-grid RKHS norm proxy versus grid resolution. The smooth target remains stable, while oscillatory, kinked, and discontinuous targets become increasingly expensive under the same Gaussian kernel.
grid size jump kink oscillatory smooth
30 3452.72 51.67 25828.36 1.78
50 5057.70 96.28 31171.90 1.81
80 6609.07 133.37 35970.28 1.82
120 8166.49 166.97 40459.34 1.83
180 10020.30 204.79 45304.34 1.83

For compatible smooth targets, the scale stays moderate. For rough, oscillatory, or discontinuous targets, the native-space cost can become very large under the same Gaussian kernel. The theorem is not failing. It is asking for a target-complexity constant that the machine-learning problem usually does not provide.

A noisy ML-style example

In practice, we do not observe a noiseless function. We observe data. A tempting substitution is to replace the unknown \|f\|_{\mathcal H} with the fitted interpolation norm

\|s_X\|_{\mathcal H}^2 = y^\top K_X^{-1} y.

This quantity is computable, but it is not the unknown target norm. It is a data-dependent interpolation norm, and with noisy observations it can mostly measure the cost of fitting noise.

Figure 3: Interpolating noisy observations can make the fitted RKHS norm unstable. The computable fitted norm is not the unknown target norm.

If ridge regularization is introduced, the fitted norm can be stabilized. But then the original interpolation theorem is no longer being used as stated. We have moved from a deterministic interpolation bound to a different regularized statistical procedure. That may be entirely reasonable; it is just not the same certificate.

Vacuity ratios

A bound is useful only if its scale is meaningful relative to the prediction problem. Define

V = \frac{\max_x P_X(x) \cdot C}{\operatorname{range}(y)},

where C is a candidate norm scale. When V \gg 1, the bound may be formally true but no longer informative at the scale of the prediction problem.

Table 3: Vacuity ratios for candidate norm scales. The ratios compare the scale of the power-function bound with the output range when using either the fitted interpolation norm or the finite-grid norm proxy.
target max power output range fit norm grid norm proxy V, fit norm V, grid norm
smooth 0.000301 2 1.8 1.83 0.00027 0.000276
oscillatory 0.000301 2 4.95e+04 4.05e+04 7.46 6.09
kink 0.000301 0.499 1.57 167 0.000948 0.101
jump 0.000301 1 152 8.17e+03 0.0458 2.46

The norm term does not make the bound wrong. It makes it vacuous when the hidden complexity factor is too large, too unstable, or unavailable.

What the power function does and does not see

The power function sees the kernel, the design points, geometric coverage under the kernel metric, and where interpolation is weakly constrained by the data locations.

It does not see the unknown target complexity, kernel misspecification, observation noise, distribution shift, model-selection error, hyperparameter-selection error, whether the fitted norm is signal or noise, or whether the target belongs to the native space at all.

That is why the power function is useful as a design diagnostic but fragile as a pointwise uncertainty statement.

The misleading substitution

The tempting applied move is

|f(x) - s_X f(x)| \leq P_X(x)\,\|f\|_{\mathcal H} \qquad\leadsto\qquad \text{error at } x \approx P_X(x).

This drops the factor that depends on the target.

A slightly less naive version replaces \|f\|_{\mathcal H} with \|s_X\|_{\mathcal H}. But that is not a theorem either. It is a heuristic that can be dominated by noise, conditioning, and hyperparameter choices.

The formal theorem controls approximation error for functions in the RKHS. The ML problem is to learn an unknown target from finite, noisy data under model uncertainty. These are not the same problem, and the power function carries the geometry of the first without the statistics of the second.

Takeaway

The power function is a valid geometric quantity. It tells us something real about the design and the kernel.

But the pointwise error bound is a product of design geometry and unknown target complexity:

\text{pointwise error} \leq \text{design geometry} \times \text{unknown target complexity}.

In applications, the first term is computable and visually appealing. The second term is usually unknown, unstable, or incompatible with the modeling assumptions.

The useful interpretation is therefore simple: the power function is a geometry-of-information diagnostic. Without meaningful and calibrated control of \|f\|_{\mathcal H}, it is not a pointwise prediction-error estimate.

References

Gregory E. Fasshauer. Meshfree Approximation Methods with MATLAB. World Scientific, 2007.

Holger Wendland. Scattered Data Approximation. Cambridge University Press, 2005.

Robert Schaback and Holger Wendland. “Kernel techniques: From machine learning to meshless methods.” Acta Numerica 15, 2006.

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

Suggested citation

@misc{miryusupov2026powerfunction,
  author       = {Miryusupov, Shohruh},
  title        = {When the Power Function Is Not an Error Bar},
  year         = {2026},
  howpublished = {Research note},
  url          = {https://www.miryusupov.com/blog/posts/power_function_not_error_bar/index.html}
}