Discrepancy is not risk

2026-05-28

A distribution discrepancy answers one question. A classifier risk answers another.

This distinction is easy to lose because the same words keep appearing: distance, separation, alignment, mismatch. MMD, Wasserstein distance, Euclidean distance, Mahalanobis distance, and learned embedding distances all sound as if they measure whether two classes or two domains are easy to tell apart.

Sometimes they do. Often they do not.

A two-sample discrepancy compresses a comparison into one number,

D(P,Q).

If this number is large, the distributions are far apart in the chosen geometry. If it is small, they are close in that geometry. That is the right answer when the question is a two-sample question.

Classification asks for something more specific. A new observation z comes from one of two populations, say \pi_1 or \pi_2, and a particular rule \widehat g will be used. The relevant quantity is not just whether \pi_1\ne\pi_2. It is the error of that rule:

P\{\widehat g(z)=2\mid z\in\pi_1\}.

That is a different object. A discrepancy compares distributions. A risk evaluates a decision rule.

The examples below are deliberately small. They are not arguments against MMD, Wasserstein distance, or other discrepancies, and they are not arguments in favor of T-rules or D-rules. They isolate a single failure mode: reading a geometry-dependent distribution comparison as if it were the risk of the classifier actually being deployed.

Three objects, three questions

Let

X_1,\ldots,X_{n_1}\sim \pi_1, \qquad Y_1,\ldots,Y_{n_2}\sim \pi_2,

with means \mu_1,\mu_2 and, in the classical setup, common covariance \Sigma. A new observation z must be assigned to one of the two populations.

The T-rule compares corrected Euclidean distances to the two sample centers:

\alpha_1\|z-\bar X\|^2 \quad\text{and}\quad \alpha_2\|z-\bar Y\|^2, \qquad \alpha_i=\frac{n_i}{n_i+1}.

For a true \pi_1 observation, its false assignment probability is

P_T(2\mid 1) = P\left\{ \alpha_1\|z-\bar X\|^2- \alpha_2\|z-\bar Y\|^2>0 \mid z\in\pi_1 \right\}.

The D-rule uses covariance-scaled geometry instead. In the population version, the relevant mean separation is

\Delta^2=(\mu_1-\mu_2)^\top\Sigma^{-1}(\mu_1-\mu_2).

MMD asks a different question. For a reproducing-kernel Hilbert space \mathcal H with kernel k,

\operatorname{MMD}(P,Q) = \sup_{\|f\|_{\mathcal H}\le 1} \left|E_P f(X)-E_Q f(Y)\right|.

So the table is not a list of interchangeable distances. It is a list of different questions.

object	question
T-rule	What is the error of a Euclidean center rule?
D-rule	What is the error of a covariance-scaled center rule?
MMD	Are two distributions different in the chosen RKHS geometry?

The rest of the note is about what goes wrong when one answer is read as another.

Why T and D are useful reference points

T-type and D-type rules are useful here because they make a basic high-dimensional tension visible.

In low dimension, the natural instinct is

\text{estimate covariance} \rightarrow \text{invert it} \rightarrow \text{use Mahalanobis geometry}.

That is the D philosophy. It can be very good when the covariance estimate is reliable, because it uses scale and correlation structure.

In high dimension, the covariance matrix is itself a noisy object. When p/n is not small, its eigenvalues no longer behave like classical fixed-p quantities. If p>n, the sample covariance is singular. Even when p<n, inversion can amplify spectral noise.

The T philosophy avoids covariance inversion. It uses Euclidean or trace information instead of the full inverse covariance. This sacrifices affine invariance, but it can be stable when the D philosophy is fragile.

regime	D-type / Mahalanobis logic	T-type / trace logic
p small, n large	often excellent	may ignore useful covariance structure
p/n non-negligible	affected by random-matrix noise	often more stable
p>n	needs regularization or becomes undefined	still directly defined
signal in low-variance correlated direction	can be much better	can miss the right geometry
covariance estimate unreliable	can overfit badly	may be safer

This is the statistical version of a practical ML question:

Should I trust the learned geometry, or use a cruder but stabler one?

A deep embedding may have dimension 768, 2048, or larger, while the labeled sample per class is modest. A full covariance-based rule can be fragile. A Euclidean prototype rule can be biased but stable. A kernel discrepancy such as MMD can be meaningful as a distribution test while still saying little about the finite-sample risk of the classifier that will actually be deployed.

The operative question is

\text{What is the distribution of the decision statistic, and what error does it imply?}

More plainly: what error does the actual rule make?

In the first few diagnostics I use population, known-parameter analogues of the T/D rules. That removes estimation noise, so the distinction between discrepancy and decision risk is not hidden by finite-sample terms.

Diagnostic 1: the same linear MMD can mean different errors

Start with the one-dimensional Gaussian pair

\pi_1=N(0,\sigma^2), \qquad \pi_2=N(1,\sigma^2).

With the linear kernel

k(x,y)=xy,

MMD reduces to squared mean difference:

\operatorname{MMD}^2(\pi_1,\pi_2)=|0-1|^2=1.

So the discrepancy is identical for every \sigma.

The classification error is not. With known means and common variance, the T-rule and D-rule have the same one-dimensional boundary,

z=\frac12.

Therefore

P\{\widehat g(z)=2\mid z\in\pi_1\} = P\{Z>1/2\}, \qquad Z\sim N(0,\sigma^2),

and hence

P\{\widehat g(z)=2\mid z\in\pi_1\} =1-\Phi\left(\frac{1}{2\sigma}\right).

A one-unit mean shift is enormous when \sigma=0.1. It is almost invisible when \sigma=100.

Table 1

\sigma	linear MMD^2	known T/D error P(2\mid 1)
0.1	1.000	2.87e-07
1.0	1.000	0.30853754
10.0	1.000	0.48006119
100.0	1.000	0.49800530

The MMD value did not change. The classification meaning did. The discrepancy is correct in its chosen geometry; it is just not a risk calculation.

Diagnostic 2: RBF-MMD can shrink while error stays fixed

The previous example used the linear kernel because the algebra is transparent. A high-dimensional version appears with the Gaussian RBF kernel.

Let

\pi_1=N_p(0,I_p), \qquad \pi_2=N_p(2e_1,I_p),

where only the first coordinate separates the classes.

The Bayes rule, known-mean T-rule, and known-covariance D-rule all have the same error:

P(2\mid 1)=\Phi(-1)\approx 0.1587.

This error does not depend on p. The signal remains two standard deviations apart in the first coordinate.

Now use the RBF kernel

k(x,y)=\exp\left(-\frac{\|x-y\|^2}{2\tau^2}\right), \qquad \tau^2=2p.

For equal-covariance Gaussians with mean difference 2e_1,

\operatorname{MMD}^2 = 2\left(1+\frac{1}{p}\right)^{-p/2} \left[1-\exp\left(-\frac{1}{p+1}\right)\right].

The classification error is fixed. The RBF-MMD value decays at order 1/p.

Table 2

p	RBF-MMD^2, \tau^2=2p	known-rule error
2	0.3780	0.1587
5	0.1946	0.1587
10	0.1079	0.1587
20	0.0571	0.1587
50	0.0237	0.1587
100	0.0120	0.1587
200	0.0060	0.1587
500	0.0024	0.1587
1000	0.0012	0.1587

Nothing mysterious is happening. The classifier uses the coordinate that carries the signal. The kernel discrepancy, with this bandwidth, spreads the comparison across the full ambient distance.

So a small MMD value does not imply high classification error. Nor does a large MMD value identify the error of the classifier being used. It only reports separation in the chosen kernel geometry.

In this example, that geometry is not aligned with the discriminative coordinate.

Diagnostic 3: MMD can see a difference that mean rules cannot use

Now reverse the problem. Let

\pi_1=N_p(0,I_p), \qquad \pi_2=N_p(0,\operatorname{diag}(9,1,\ldots,1)).

The means are equal:

\mu_1=\mu_2=0.

A T-rule, and the usual common-covariance D-rule when used only for mean separation, has no mean direction to use. At the population level, the center comparison carries no class information, so the error is random:

P(2\mid 1)=\frac12.

But the distributions are not equal. The first coordinate has variance 1 under \pi_1 and variance 9 under \pi_2. An RBF-MMD can detect this.

Table 3

method/object	what it sees	value
T/D mean rule	no mean separation	0.5000 error
RBF-MMD	variance difference	0.0299 MMD^2
Bayes quadratic rule	variance separation	0.1159 false 2\mid 1 error

A quadratic Bayes rule can exploit the variance difference. In the one-coordinate variance example, the false assignment probability under \pi_1 is

P(2\mid 1) = P\left\{|Z|> \sqrt{\frac{2\log 3}{1-1/9}} \right\}, \qquad Z\sim N(0,1),

which is approximately 0.1159.

So there are three correct answers:

object	answer
MMD	the distributions differ
T/D mean rule	there is no usable mean direction under this rule
Bayes quadratic rule	the variance difference is usable if the classifier may use it

None of these answers is wrong. Trouble starts when one is used as a substitute for the others.

Diagnostic 4: the wrong geometry can make a good signal look useless

Now keep common covariance and mean separation, but change the geometry.

Let

\pi_1=N_2(-\delta/2,\Sigma), \qquad \pi_2=N_2(\delta/2,\Sigma),

with

\delta=(1,1), \qquad \Sigma=\operatorname{diag}(0.1^2,10^2).

There is one stable coordinate and one noisy coordinate. The Euclidean T-rule uses the direction \delta=(1,1), giving the noisy coordinate the same coefficient as the stable coordinate.

The population T-rule error is

P_T(2\mid 1) = \Phi\left( -\frac{\|\delta\|^2}{2\sqrt{\delta^\top\Sigma\delta}} \right).

The D-rule uses covariance-scaled geometry. Its error is

P_D(2\mid 1) = \Phi\left( -\frac12 \sqrt{\delta^\top\Sigma^{-1}\delta} \right).

Table 4

rule	signal scale	variance scale	error P(2\mid 1)
T-rule / Euclidean	\\\|\delta\\\|^2=2.00	\delta^\top\Sigma\delta=100.01	0.46017415
D-rule / Mahalanobis	\Delta^2=100.01	covariance-scaled	2.86e-07

The populations, means, and covariance have not changed. Only the decision geometry has.

This is the T/D lesson in miniature: classifier risk depends on the decision geometry, not only on whether the distributions differ.

Diagnostic 5: a shortcut swap in a vision representation

A computer-vision version is a shortcut-feature example.

Imagine a binary image classification problem. The true label is the object category, but the source training set contains a shortcut: background color, texture, lighting, camera type, or acquisition artifact. A model can classify the source domain perfectly using that shortcut.

Let the learned representation be one scalar,

h(x)\in\{-1,+1\},

where h records the shortcut rather than the object. In the source domain,

Y=+1 \Rightarrow h=+1, \qquad Y=-1 \Rightarrow h=-1.

In the target domain, the shortcut relation is reversed:

Y=+1 \Rightarrow h=-1, \qquad Y=-1 \Rightarrow h=+1.

Assume the class prior is balanced in both domains:

P(Y=+1)=P(Y=-1)=\frac12.

Then the marginal feature distribution is identical in the two domains:

P_S(h=+1)=P_T(h=+1)=\frac12, \qquad P_S(h=-1)=P_T(h=-1)=\frac12.

Therefore the MMD between the unlabeled source and target feature distributions is exactly zero, for any kernel, because the two marginal distributions are the same.

But the source-trained classifier

\widehat g(h)=\operatorname{sign}(h)

has source error 0 and target error 1.

Table 5

quantity	value	interpretation
source marginal P_S(h)	uniform on \{-1,+1\}	balanced shortcut feature
target marginal P_T(h)	uniform on \{-1,+1\}	same unlabeled feature distribution
\operatorname{MMD}^2(P_S(h),P_T(h))	0	perfect marginal alignment
source error of \operatorname{sign}(h)	0	shortcut works in source
target error of \operatorname{sign}(h)	1	shortcut reverses in target

The class-conditional discrepancy tells a different story. With the linear kernel on h,

\operatorname{MMD}^2 \left(P_S(h\mid Y=+1),P_T(h\mid Y=+1)\right) = (1-(-1))^2=4.

So MMD is not blind here. It sees the marginal feature distribution correctly. That is precisely why it misses the shortcut reversal: the relevant object is label-conditional.

In computer-vision language: matching the global distribution of features, textures, backgrounds, or styles does not by itself guarantee that the object-label decision boundary is aligned. A domain discrepancy can improve while the target classifier is still using the wrong visual cue.

Technical addendum: Gaussian-looking risk formulas can still be wrong

A related but more technical warning appears after the decision rule has already been chosen. Even then, the risk formula may depend on distributional features that are invisible to a first-and-second-moment summary. This section is less about MMD and more about the danger of replacing a decision statistic by an overly Gaussian approximation.

Gaussian formulas are tempting because they are clean. But high-dimensional T-rule approximations for general populations can contain higher-moment terms. Third and fourth moments can enter the variance of the decision statistic.

For learned representations, this is not a corner case. They can be skewed, sparse, heavy-tailed, clipped, quantized, or produced by nonlinear activations.

Consider the simplified common-covariance model

\pi_1: z=u, \qquad \pi_2: z=u+\delta, \qquad \Sigma=I_p,

where the coordinates of u are independent and standardized:

E u_j=0, \qquad E u_j^2=1, \qquad E u_j^3=\theta, \qquad E u_j^4=\gamma.

For equal sample sizes n_1=n_2=n, write

\alpha=\frac{n}{n+1}.

A representative large-dimensional T-rule approximation has the form

P_T(2\mid 1) \approx \Phi\left(-\frac{\alpha\|\delta\|^2}{B_p}\right),

where, in the diagonal identity-covariance case, the variance scale has terms of the schematic form

B_p^2 \approx \left[\beta_0+\frac{2\alpha^2\gamma}{n^3}\right]p +\frac{4\theta}{n^2}\mathbf 1^\top\delta +4\alpha\|\delta\|^2.

The exact constants depend on the asymptotic setup and normalization. The structural point is what matters here:

\text{classification error can depend on } \gamma \text{ and } \theta.

Two representations can have the same mean and covariance while producing different T-rule errors because their coordinate distributions have different skewness or tail behavior.

Table 6

coordinate law	fourth moment \gamma	T-rule approximation
Rademacher-like, light-tailed	1	0.1496
Gaussian	3	0.1500
moderately heavy-tailed	9	0.1513
spiky finite-moment	100	0.1683
very spiky finite-moment	500	0.2221

The Gaussian line reports approximately 0.1500. With the same mean and covariance but a very spiky finite-moment representation, the approximation is closer to 0.2221.

Skewness can also enter. If \mathbf 1^\top\delta\ne0, the third-moment term changes the variance of the T-rule statistic.

Table 7

skewness \theta	T-rule approximation
-4	0.1456
-2	0.1485
0	0.1513
2	0.1540
4	0.1566

The ML warning is small but important: whitening fixes first and second moments. It does not make a representation Gaussian. It also does not make a Gaussian error formula automatically correct.

Summary of the diagnostics

diagnostic	discrepancy says	classifier risk says	lesson
same linear MMD, different \sigma	unchanged	changes from near zero to near random	scale matters
RBF-MMD in high p	shrinks	fixed	ambient geometry can dilute signal
variance-only difference	detects distribution difference	mean rule fails	distribution difference need not be usable by a given rule
T versus D geometry	same populations	different errors	rule geometry matters
shortcut swap	marginal MMD is zero	target error is maximal	unlabeled alignment can miss label-conditional failure
non-Gaussian moments	first two moments agree	approximate risk changes	higher moments can affect decision statistics

What this says about ML practice

The practical failure is not MMD-specific. It is the habit of replacing downstream error by a proxy geometry.

A representation method may reduce MMD between domains. A contrastive objective may separate pairs. A retrieval metric may put some objects close. A Wasserstein loss may reduce transport cost. Each statement can be true while the task error moves in the wrong direction.

For example, reducing domain MMD in a vision representation does not by itself show that the classifier’s margin distribution, class-conditional overlap, or false-positive rate improved. It only shows that the chosen source and target distributions became closer in the chosen RKHS geometry. The same warning applies to representation monitoring. A shift score in embedding space can be a useful alarm, but it is not automatically an accuracy-drop estimate. To make it operational, one should check whether the score is aligned with the deployed model’s margin distribution, class-conditional errors, false-positive rate, or task-specific loss. Without that link, the score remains a diagnostic of representation geometry, not a statement about downstream risk.

The missing question is always

P\{\widehat g(z)\ne y\}.

Or, in asymmetric problems,

P\{\widehat g(z)=2\mid z\in\pi_1\}.

A discrepancy is not wrong because it differs from classification error. It becomes misleading only when it is interpreted as classification error without proving the link.

A useful workflow is

\text{population model} \rightarrow \text{decision rule} \rightarrow \text{distribution of the decision statistic} \rightarrow \text{misclassification probability}.

A weaker workflow is

\text{population model} \rightarrow \text{distance between distributions} \rightarrow \text{informal claim about classification}.

The second workflow is where the mistake enters.

Takeaway

None of these distances is the villain. MMD, Wasserstein distance, Euclidean distance, and Mahalanobis distance are all useful when read as answers to the questions they actually ask.

The danger is the interpretation placed on them.

A distribution discrepancy answers a distribution question. A classifier error answers a decision question. They can agree when the geometry, model, and classifier are aligned. They can separate when the discrepancy listens to the wrong coordinate, the wrong scale, the wrong covariance, the wrong finite-sample object, the wrong marginal, or the wrong higher moments.

The diagnostic question is:

Does this discrepancy control the error of the rule I will actually use?

If yes, the discrepancy has operational meaning. If no, it is still useful as a geometric diagnostic, but it should not be reported as evidence of lower risk. A small discrepancy can be false comfort; a large discrepancy can be irrelevant alarm.

References

Bai, Z. D., and Saranadasa, H. “Effect of High Dimension: By an Example of a Two Sample Problem.” Statistica Sinica 6, 1996.

Saranadasa, H. “Asymptotic Expansion of the Misclassification Probabilities of D- and A-Criteria for Discrimination from Two High Dimensional Populations Using the Theory of Large Dimensional Random Matrices.” Journal of Multivariate Analysis 46(1), 1993.

Yao, J., Zheng, S., and Bai, Z. Large Sample Covariance Matrices and High-Dimensional Data Analysis. Cambridge University Press, 2015.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. “A Kernel Two-Sample Test.” Journal of Machine Learning Research 13, 2012.

Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. “A Theory of Learning from Different Domains.” Machine Learning 79, 2010.

Long, M., Cao, Y., Wang, J., and Jordan, M. I. “Learning Transferable Features with Deep Adaptation Networks.” Proceedings of ICML, 2015.

Suggested citation

@misc{miryusupov2026discrepancyrisk,
  author       = {Miryusupov, Shohruh},
  title        = {Discrepancy Is Not Risk},
  year         = {2026},
  howpublished = {Research note},
  url          = {https://www.miryusupov.com/blog/posts/discrepancy_is_not_risk/}
}