2026-05-16
Wasserstein distance has a clean promise: it compares probability distributions by asking how much mass must move, and how far. If the sample-space geometry matches the geometry of the problem, that is exactly the right question. A meter is a meter. A pixel is a pixel.
But when Wasserstein distance is used as a statistical loss, there is a small trap. Absolute displacement is not the same as scale-relative discrepancy. Raw Wasserstein distance detects location shifts; what it does not automatically detect is whether a shift is large or small relative to the uncertainty or intrinsic scale of the distributions being compared.
Three diagnostics below. For equal-variance Gaussians, raw Euclidean W_2 sees the mean shift but not its size in standard-deviation units. Fisher–Rao geometry gives a contrasting scale-relative notion of distance. And a scaled Wasserstein ground cost can recover scale-relative behavior — but only after the geometry is changed.
Optimal transport is defined after choosing a cost for moving mass from one point to another. In the usual Wasserstein distance, this cost comes from a ground metric on the sample space; see Peyré and Cuturi (2019). This is not an implementation detail — it is the geometry of the loss.
If the data are measured in meters, the loss sees meters. If they are represented by coordinates, labels, features, embeddings, or point clouds, the loss sees whatever geometry has been assigned to those objects. So the practical question is not only whether you are using Wasserstein distance, but also what geometry you gave it.
In some problems, absolute displacement is exactly the right notion of error: moving mass by one physical unit is what matters. In other problems, the relevant question is scale-relative: how large is this displacement relative to the uncertainty or intrinsic scale of the distribution? Raw Euclidean Wasserstein answers the first question. It does not automatically answer the second.
Consider the univariate Gaussian family N(\mu,\sigma^2). For two Gaussians with the same variance, the squared 2-Wasserstein distance reduces to the squared difference between means. This is a special case of the Gaussian Wasserstein formula; see Dowson and Landau (1982):
W_2^2(P,Q)=|\mu_1-\mu_2|^2, \qquad W_2(P,Q)=|\mu_1-\mu_2|.
Now compare these two pairs:
N(0,0.1^2) \ \text{vs.}\ N(1,0.1^2), \qquad N(0,100^2) \ \text{vs.}\ N(1,100^2).
In both cases the means differ by one, so W_2=1. But the scale-relative meaning is very different:
\frac{|\Delta\mu|}{\sigma} = \frac{1}{0.1} = 10, \qquad \frac{|\Delta\mu|}{\sigma} = \frac{1}{100} = 0.01.
The Wasserstein value is the same. The statistical interpretation is not. This does not mean Wasserstein distance is blind to variance in general. It means that in this equal-variance location comparison, the same absolute displacement gets the same cost regardless of whether that displacement is statistically large or small.
The Fisher–Rao metric gives a different geometry on a statistical model. It is not a universal replacement for Wasserstein distance, but it is useful here because it encodes the intrinsic statistical scale of the Gaussian family.
For N(\mu,\sigma^2), the Fisher–Rao line element is
ds^2 = \frac{d\mu^2+2\,d\sigma^2}{\sigma^2}.
For a small location perturbation with \sigma fixed,
ds \approx \frac{|d\mu|}{\sigma}.
So a one-unit displacement is large when \sigma=0.1 and small when \sigma=100.
Wasserstein distance measures displacement in the external sample-space geometry. Fisher–Rao distance measures displacement in the internal statistical geometry of the model. They are not contradicting each other — they are measuring different things.
For equal-variance univariate Gaussians, the exact Fisher–Rao distance is
d_{\mathrm{FR}} \left( N(\mu_1,\sigma^2), N(\mu_2,\sigma^2) \right) = \sqrt{2}\, \operatorname{arcosh} \left( 1+ \frac{(\mu_1-\mu_2)^2}{4\sigma^2} \right),
which agrees locally with |\mu_1-\mu_2|/\sigma when the displacement is small relative to \sigma; see Pinele, Strapasson, and Costa (2020). The \operatorname{arcosh} term is geometric: after the change of variables
x=\frac{\mu}{\sqrt{2}}, \qquad y=\sigma,
the Fisher–Rao line element becomes
ds^2 = 2\,\frac{dx^2+dy^2}{y^2}.
So the univariate Gaussian Fisher–Rao geometry is, up to a constant factor, the Poincaré upper half-plane geometry.
Let \Delta\mu=1.
| pair | \sigma | W_2 | FR local | FR exact |
|---|---|---|---|---|
| N(0,0.1^2) vs N(1,0.1^2) | 0.1 | 1 | 10 | 5.587 |
| N(0,100^2) vs N(1,100^2) | 100 | 1 | 0.01 | 0.01 |
The Wasserstein value does not move. The Fisher–Rao diagnostic changes dramatically. For \sigma=0.1, the local Fisher–Rao value is not meant to be a precise numerical approximation because the shift is large relative to the scale. The qualitative conclusion is the same either way: the narrow Gaussian shift is much larger than the wide Gaussian shift in scale-relative terms.
Wasserstein distance can be made scale-aware. But scale awareness does not appear for free — it has to enter through the ground geometry.
In the Gaussian example, suppose displacement is measured not by the raw Euclidean cost |x-y| but by the standardized cost
\frac{|x-y|}{\sigma}.
Equivalently, transport the rescaled variable
z=\frac{x}{\sigma}.
Then the Wasserstein distance between N(0,\sigma^2) and N(1,\sigma^2) in standardized coordinates is 1/\sigma. That is 10 for \sigma=0.1 and 0.01 for \sigma=100.
This recovers the scale-relative quantity. But it is not the raw Wasserstein distance in the original coordinates. It is Wasserstein distance after changing the geometry. In this equal-variance example, the rescaling is unambiguous; if the two distributions have different scales, the choice of normalization becomes part of the modeling problem.
The ground cost is part of the statistical meaning of the loss.
Suppose a model outputs a predictive distribution. Then the scale parameter is not decoration: a wide predictive distribution signals uncertainty, and a narrow one signals confidence. Now suppose the predicted mean is wrong by one unit. Should the loss treat that error the same in both cases?
For the equal-variance Gaussian comparison above, raw Euclidean Wasserstein says yes for the location term. A Fisher-type geometry says no. A likelihood-based loss would also treat the same residual differently depending on \sigma, because a one-unit error means something different under a sharp predictive distribution than under a diffuse one.
There is also a distinction between comparing two distributions and comparing a distribution to a point observation. If P=N(\mu,\sigma^2) and Q=\delta_y, then
W_2^2(P,Q) = (\mu-y)^2+\sigma^2.
Wasserstein does see predictive spread here. But it sees spread as transport cost to collapse a distribution onto a point, which is different from treating \sigma as calibrated uncertainty in the likelihood or Fisher-geometric sense.
The practical question is whether the Wasserstein loss measures the kind of error the task actually cares about. For some tasks, absolute displacement is the right unit: one meter, one dollar, or one pixel may have the same cost regardless of predictive uncertainty. For other tasks, the relevant error is relative to the distribution’s scale. In those cases, raw Wasserstein loss can measure the wrong discrepancy.
The same issue appears on unscaled data. If one coordinate is in dollars and another in percentages, or one feature has variance 10^{-2} and another has variance 10^4, a Wasserstein loss built on the raw Euclidean ground cost inherits those units. It may emphasize coordinates with larger numerical scale even when they are not more meaningful.
This is the same issue practitioners already know from feature scaling: a Euclidean cost on raw coordinates inherits raw units. Wasserstein loss does the same because its ground cost does the same.
In raw coordinates,
W_2 = |\Delta\mu|.
In standardized coordinates,
W_2 = \frac{|\Delta\mu|}{\sigma}.
Both are Wasserstein distances. They differ because the geometry differs.
Some settings where the ground geometry choice is non-trivial:
In each case, applying a Wasserstein-type loss is also choosing a ground geometry. That choice is part of the model.
There is no universal fix — the right geometry depends on the problem. Some options:
A useful diagnostic question is:
If I multiply one coordinate by 100, should the loss care 100 times more about errors in that coordinate?
If the answer is no, the raw ground geometry may not be the right geometry.
The issue is not Wasserstein distance. It is the interpretation.
For equal-variance Gaussians under the usual Euclidean ground metric, a unit mean shift has W_2=1 regardless of whether the standard deviation is 0.1 or 100. That is not a defect — it is what the geometry says. Raw Wasserstein reports that both pairs require one unit of transport in the chosen sample-space geometry.
Whether that one unit is statistically large or small is a separate question. Answering it requires either a different geometry, a rescaled ground cost, or an additional statistical loss.
The practical question is simple:
\text{Should the loss measure absolute displacement, or scale-relative discrepancy?}
When the answer is absolute displacement, Wasserstein distance may be exactly the right tool. When the answer is scale-relative discrepancy, raw Wasserstein loss may have the wrong scale.
Shun-ichi Amari. Information Geometry and Its Applications. Springer, 2016.
D. C. Dowson and B. V. Landau. “The Fréchet distance between multivariate normal distributions.” Journal of Multivariate Analysis 12(3), 1982.
Yoni Dukler, Wuchen Li, Alex Lin, and Guido Montúfar. “Wasserstein of Wasserstein loss for learning generative models.” Proceedings of the 36th International Conference on Machine Learning, 2019.
Haoqiang Fan, Hao Su, and Leonidas J. Guibas. “A point set generation network for 3D object reconstruction from a single image.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya-Polo, and Tomaso Poggio. “Learning with a Wasserstein loss.” Advances in Neural Information Processing Systems, 2015.
Aude Genevay, Gabriel Peyré, and Marco Cuturi. “Learning generative models with Sinkhorn divergences.” Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 2018.
Gabriel Peyré and Marco Cuturi. Computational Optimal Transport. Foundations and Trends in Machine Learning 11(5–6), 2019.
Julia Pinele, João E. Strapasson, and Sueli I. R. Costa. “The Fisher–Rao distance between multivariate normal distributions: Special cases, bounds and applications.” Entropy 22(4), 2020.
@misc{miryusupov2026wassersteingeometry,
author = {Miryusupov, Shohruh},
title = {When Wasserstein Loss Uses the Wrong Geometry},
year = {2026},
howpublished = {Research note},
url = {https://www.miryusupov.com/blog/posts/wasserstein-wrong-geometry/index.html}
}