Prokhorov continuity and machine-learning robustness

2026-05-24

Many machine-learning failures are not caused by obvious outliers. A corrupted point may not be far away in Euclidean distance. A mislabeled example may look perfectly typical. A rare sensor glitch may be indistinguishable from a legitimate observation.

This pushes toward a robustness question that is more distributional than geometric. If P is the clean data-generating distribution, Q is a corrupted version of it, and A(P) is the model learned from P, then what we want is roughly:

P \approx Q \quad \Longrightarrow \quad A(P) \approx A(Q).

The scope of this note is narrow. I want to argue that a learning rule should not amplify a vanishing amount of hidden distributional contamination into an arbitrarily large model movement. This is not a claim that Prokhorov distance is always the right metric. It is a claim about alignment: if the practical uncertainty looks like small-mass contamination, then a robust learner should be stable under small-mass contamination.

Prokhorov continuity

Let \mathcal Z be the data space, for example \mathcal Z=\mathcal X\times\mathcal Y, and let \mathcal P(\mathcal Z) be the space of probability measures on it. A learning method can be viewed abstractly as a map

A:\mathcal P(\mathcal Z)\to\mathcal H,

where \mathcal H might be a parameter space, an RKHS, a set of classifiers, a set of cluster centers, a space of predictive distributions, or whatever object the method produces.

We say that A is Prokhorov-continuous at P if

\rho_{\mathrm{Pr}}(Q_n,P)\to0 \quad\Rightarrow\quad d_{\mathcal H}\{A(Q_n),A(P)\}\to0.

In words: when another data distribution is close to the clean distribution in Prokhorov distance, the fitted object should also be close.

This is closely related to Hampel’s qualitative robustness. Robust statistical procedures are tied to continuity of the statistical functional under weak convergence, and the Prokhorov metric is one standard way to metrize weak convergence on Polish spaces.

A useful contamination model is

Q_\varepsilon=(1-\varepsilon)P+\varepsilon R,

where R is arbitrary contamination. The contaminating distribution may put mass very far away. Nevertheless,

\rho_{\mathrm{Pr}}(P,Q_\varepsilon)\leq\varepsilon.

So Q_\varepsilon\to P in Prokhorov distance as \varepsilon\to0, even when the corrupted values are huge. The tension is simple: Prokhorov sees small mass, but squared-loss methods may see huge magnitude.

Why this matters in ML

In clean mathematical models, bad data often look like visible outliers. In real ML systems, bad data can be much more subtle: missing observations in a stream, rare corrupted sensor readings, label noise near the decision boundary, duplicated or stale records, covariate shift affecting only a small subpopulation, or adversarially chosen low-mass errors.

Pairwise-distance outlier detection asks whether you can identify the bad points geometrically. Prokhorov continuity asks a different question: even if you fail to identify them, can they strongly damage the model? That second question is often more important.

The same issue appears in computer vision. A trained ResNet, OCR system, or document model receiving images with glare, occlusion, motion blur, missing text, or sensor corruption does not always need to recover the clean label. Sometimes the clean label is no longer identifiable.

In that case, robustness should not mean blind invariance. If the evidence needed for prediction has disappeared, the model should not confidently preserve its old answer. A better behavior is to treat the corrupted region as unreliable evidence: lower confidence, abstain, or defer to a human.

So robustness has two sides. A model should be insensitive to irrelevant corruption, but sensitive enough to recognize when prediction-relevant evidence has been damaged. In this sense, robustness is controlled sensitivity, not just invariance.

Prokhorov continuity gives one distributional language for this requirement. A small amount of corrupted visual mass should not cause an arbitrary confident jump in the output distribution. If the prediction changes, it should change in a controlled way: toward uncertainty, abstention, or a justified alternative prediction.

The point is not that Prokhorov distance is the final metric for images. The point is that the same question appears again:

\text{Is the prediction map continuous under the perturbations the system should tolerate?}

Dangerous patterns

The following methods are not Prokhorov-continuous in full generality. The qualification matters: with bounded data, clipping, robust losses, compactness, fixed regularization, or stronger topologies, some of these methods can behave well.

Method	Why Prokhorov continuity can fail
Mean and moment estimators	Weak or Prokhorov convergence does not control unbounded moments.
OLS linear regression	Coefficients depend on first and second moments; rare high-leverage points can dominate.
Ridge regression with squared loss	Regularization helps, but squared loss remains tail-sensitive when inputs or labels are unbounded.
Kernel ridge regression	Squared loss makes the RKHS solution sensitive to rare huge labels.
Gaussian-process regression with Gaussian likelihood	Posterior mean behaves like regularized least-squares smoothing.
PCA and covariance methods	Covariance is not controlled by Prokhorov convergence.
k-means	A tiny far-away mass can steal a centroid; for k=1, k-means is the mean.
Gaussian mixture MLE	Likelihoods can have degeneracies, local optima, and sensitivity to tiny components.
Hard-margin SVM	A tiny contradictory label mass can destroy separability.
SVM with vanishing regularization	Fixed-regularization SVMs can be robust, but robustness may be lost when \lambda_n\to0.
Unregularized ERM with unbounded loss	Small mass can dominate the risk.
Decision trees and greedy splitters	Hard split choices can jump when nearly tied split candidates swap order.
Nearest-neighbor rules under strong output metrics	The identity of the nearest neighbor may change abruptly near decision boundaries.

The common thread is tails, moments, hard thresholds, and unstable argmins. These are the places where a tiny amount of distributional mass can have a large downstream effect.

A minimal counterexample

Let

P=\delta_0

and contaminate it by

Q_\varepsilon=(1-\varepsilon)\delta_0+\varepsilon\delta_{M_\varepsilon}, \qquad M_\varepsilon=\frac{1}{\varepsilon^2}.

Then

\rho_{\mathrm{Pr}}(P,Q_\varepsilon)\leq\varepsilon\to0.

But the mean under Q_\varepsilon is

\mathbb E_{Q_\varepsilon}[X] = \varepsilon\cdot\frac{1}{\varepsilon^2} = \frac{1}{\varepsilon} \to\infty.

So the mean is not Prokhorov-continuous. This example is not only about the mean. It is a template: any method that secretly depends on uncontrolled first or second moments can inherit the same pathology.

A small numerical diagnostic

The empirical version of the construction is simple. For sample size n, define the clean sample as all-zero labels, and define the contaminated sample as having one label equal to n^2 while all other labels remain zero. The contamination mass is

\varepsilon_n=\frac1n\to0.

The empirical contaminated distribution is close to the clean one in Prokhorov distance. For a bounded RBF kernel applied to the label marginal, the MMD between the clean and contaminated label distributions is also of order 1/n. Meanwhile the mean, KRR prediction, and GP posterior mean all grow with n.

Small distributional contamination, large model movement. The contamination mass and bounded-kernel MMD shrink, while the mean, KRR prediction, and GP posterior mean grow.

The formulas behind the plot are:

\text{mean}=\frac{n^2}{n}=n, \qquad \text{KRR prediction}=\frac{n}{2}, \qquad \text{GP prediction}=\frac{n^2}{1+n}\sim n,

while

\text{RBF-MMD}\approx\frac{\sqrt2}{n}.

This experiment is intentionally simple. It is not meant to be a realistic benchmark. It is a diagnostic counterexample showing that a learner can become worse as the contamination mass goes to zero.

MMD as diagnostic, not cure

MMD should be treated carefully. It is not itself a learning method. It is a discrepancy between distributions:

\operatorname{MMD}_k(P,Q) = \|\mu_P-\mu_Q\|_{\mathcal H}.

For many bounded continuous kernels, MMD is continuous under weak or Prokhorov convergence. For some kernels and spaces, MMD even metrizes weak convergence. So the warning is not that MMD is Prokhorov-discontinuous.

The subtler warning is that MMD can correctly report weak distributional closeness while a downstream learner is discontinuous in that weak topology.

Using the earlier example,

P=\delta_0, \qquad Q_\varepsilon=(1-\varepsilon)\delta_0+\varepsilon\delta_{1/\varepsilon^2},

we have, for a bounded kernel with k(x,x)\leq K,

\operatorname{MMD}_k(P,Q_\varepsilon) \leq 2\varepsilon\sqrt K \to0.

But

\mathbb E_{Q_\varepsilon}[X]\to\infty.

This is not a contradiction. It means the metric and the downstream task are using different notions of closeness. Small MMD does not imply small downstream ML damage. The missing ingredient is continuity of the downstream learning map itself.

SVMs as a contrast

SVMs are subtle here and should not be put unconditionally on the bad list.

Fixed-regularization soft-margin SVMs with bounded kernels and suitable Lipschitz losses can be qualitatively robust. In that setting, the SVM can be represented as a functional on probability measures and continuity under weak convergence can be proven.

But this robustness is not automatic. It can fail for hard-margin SVMs, vanishing regularization \lambda_n\to0, unstable hyperparameter selection, unsuitable unbounded losses, or settings where the output metric is too strong.

The right lesson is that regularization and loss choice can create Prokhorov stability. SVMs are useful here as a contrast case rather than a blanket example of failure.

Takeaway

Prokhorov continuity is a sanity check, not a guarantee. It asks whether a learning method respects the scale at which the data distribution is reliable.

If a small amount of hidden contamination can move the fitted model arbitrarily far, the method may be inappropriate for noisy data streams, sensor errors, hidden label noise, and high-dimensional pipelines where outliers are hard to detect. The useful pair of questions is not only “are the distributions close?” but also “is the learner continuous under that notion of closeness?” Many standard methods need bounded data, clipping, robust losses, or stable regularization before they are safe to use without thinking about this.

Appendix

The goal here is minimal examples showing that learning maps are not Prokhorov-continuous in general.

Contamination lemma. For Q_\varepsilon = (1-\varepsilon)P + \varepsilon R and any Borel set B:

Q_\varepsilon(B) \leq P(B) + \varepsilon \leq P(B^\varepsilon) + \varepsilon, \quad P(B) \leq Q_\varepsilon(B^\varepsilon) + \varepsilon.

Hence \rho_{\mathrm{Pr}}(P, Q_\varepsilon) \leq \varepsilon whenever contamination mass is small.

Kernel ridge regression. Take X = 0 always and k(0,0) = 1. Population KRR solves

\min_a \;\mathbb{E}[(Y-a)^2] + \lambda a^2, \qquad a_Q = \frac{\mathbb{E}_Q[Y]}{1+\lambda}.

With P = \delta_{(0,0)} and Q_\varepsilon = (1-\varepsilon)\delta_{(0,0)} + \varepsilon\delta_{(0,1/\varepsilon^2)}, we get \rho_{\mathrm{Pr}}(P,Q_\varepsilon) \leq \varepsilon \to 0 but a_{Q_\varepsilon} = 1/[(1+\lambda)\varepsilon] \to \infty while a_P = 0.

Gaussian-process regression. With repeated input x_1 = \cdots = x_n = 0, zero-mean GP prior with k(0,0) = \tau^2, and Gaussian noise \sigma^2, the posterior mean is

m(0) = \frac{\tau^2}{\sigma^2 + n\tau^2} \sum_{i=1}^n y_i.

With one corrupted label y_n = n^2 and the rest zero, the empirical measure is shifted by only 1/n in Prokhorov distance, but m_{\text{bad}}(0) \sim n. The issue is not the GP prior; it is the Gaussian likelihood plus unbounded labels.

k-means. For k=1, k-means is the mean. For k=2 with

P = \frac{1}{2}\delta_{-1} + \frac{1}{2}\delta_1 \quad \text{and} \quad Q_\varepsilon = (1-\varepsilon)P + \varepsilon\delta_{1/\varepsilon^2},

keeping centers near \pm 1 costs roughly \varepsilon M_\varepsilon^2 = 1/\varepsilon^3 \to \infty. So one center moves to the far-away mass.

PCA. With P = \frac{1}{2}\delta_{(-1,0)} + \frac{1}{2}\delta_{(1,0)} (leading direction horizontal) and Q_\varepsilon = (1-\varepsilon)P + \varepsilon\delta_{(0,1/\varepsilon)}, the vertical variance becomes 1/\varepsilon - 1 \to \infty, and the top PCA direction flips from horizontal to vertical.

Hard-margin SVM. P = \frac{1}{2}\delta_{(-1,-1)} + \frac{1}{2}\delta_{(1,+1)} is perfectly separable. Adding \varepsilon\delta_{(1,-1)} makes x=1 carry both labels, so an arbitrarily small Prokhorov perturbation makes the hard-margin problem infeasible.

MMD calculation. For bounded k(x,x) \leq K and Q_\varepsilon = (1-\varepsilon)\delta_0 + \varepsilon\delta_{1/\varepsilon^2}:

\operatorname{MMD}_k(P, Q_\varepsilon) = \varepsilon\|k(1/\varepsilon^2, \cdot) - k(0,\cdot)\|_{\mathcal{H}} \leq 2\varepsilon\sqrt{K} \to 0.

Meanwhile \mathbb{E}_{Q_\varepsilon}[X] = 1/\varepsilon \to \infty. The discrepancy is weak; the learner is tail-sensitive.

References

F. R. Hampel. “A General Qualitative Definition of Robustness.” The Annals of Mathematical Statistics, 1971. https://doi.org/10.1214/aoms/1177693054

R. Hable and A. Christmann. “Qualitative Robustness of Support Vector Machines.” Journal of Multivariate Analysis, 2011. https://doi.org/10.1016/j.jmva.2011.02.008

C.-J. Simon-Gabriel, A. Barp, B. Schölkopf, and L. Mackey. “Metrizing Weak Convergence with Maximum Mean Discrepancies.” Journal of Machine Learning Research, 2023. https://www.jmlr.org/papers/v24/21-0599.html

M. Debruyne, M. Hubert, and J. A. K. Suykens. “Robustness of Reweighted Least Squares Kernel Based Regression.” Journal of Multivariate Analysis, 2010. https://doi.org/10.1016/j.jmva.2009.09.004

K. De Brabanter and J. Vandewalle. “Robustness by Reweighting for Kernel Estimators: An Overview.” Statistical Science, 2021. https://doi.org/10.1214/20-STS816

Suggested citation

@misc{miryusupov2026prokhorovcontinuity,
  author       = {Miryusupov, Shohruh},
  title        = {Prokhorov Continuity and Machine-Learning Robustness},
  year         = {2026},
  howpublished = {Research note},
  url          = {https://www.miryusupov.com/blog/posts/prokhorov_continuity_ml/index.html}
}