Curvature, Volume, and the Boundary

A personal motivation sits behind it. My friend and former professor Andrew P. Mullhaupt opened my eyes to a simple but easily forgotten point: bias is not always a disease. Used carelessly, bias distorts. Used deliberately, with an understanding of the geometry, it can be a remedy.

Firth’s adjustment was introduced as a frequentist bias-reduction device: modify the likelihood score so that the leading O(n^{-1}) bias of the maximum likelihood estimator disappears (Firth 1993). Jeffreys’ prior, on the other hand, is a geometric volume element: if Fisher information is the metric on a statistical model, then |i(\beta)|^{1/2}\,d\beta is the natural volume form on that model. In canonical-link exponential-family GLMs, these two ideas meet:

Kosmidis and Firth’s 2009 and 2021 papers can be read as two developments of this same geometric picture (Kosmidis and Firth 2009; Kosmidis and Firth 2021). The 2009 paper extends bias reduction to exponential-family nonlinear models, where the model manifold bends because the predictor itself is nonlinear. The 2021 paper studies binomial-response GLMs, where the important geometric feature is not just curvature but the boundary: fitted probabilities can run to 0 or 1, Fisher volume collapses, and ordinary maximum likelihood can fail to exist.

There is also a philosophical point running through the note. Classical statistics often treats bias as something to be removed. Firth’s approach is subtler. It starts from bias reduction, but the adjustment can also be read as a deliberate geometric regularization: the estimator is nudged away from regions where maximum likelihood behaves badly. In separated binomial GLMs, that nudge is not a cosmetic correction; it restores existence. The lesson is not that bias is good in itself, but that insisting on unpenalized likelihood can be worse than accepting a small, principled shrinkage bias that keeps the estimator inside the finite part of the model.

The note has four parts. Part I gives the geometric intuition. Part II reads the Kosmidis–Firth papers through that lens. Part III gives a small synthetic fraud example. Part IV gives the technical appendix.

Part I: Intuition

The space of distributions as a landscape

Fix a sample space. Think of every probability distribution on it as a point in a large abstract space \mathcal P. For Bernoulli data with three observations, \mathcal P is a simplex. For richer sample spaces, it may be infinite-dimensional. The picture we need is simple: one distribution, one point.

is a p-dimensional surface inside \mathcal P. As \beta varies, f(\,\cdot\,;\beta) traces out the model manifold.

Fisher information is the ruler

To do geometry on \mathcal M, we need a local notion of distance. The Kullback–Leibler divergence supplies one. For nearby parameters \beta and \beta+d\beta,

\mathrm{KL}\{f_\beta\,\|\,f_{\beta+d\beta}\} = \frac12 d\beta^\top i(\beta)d\beta + o(\|d\beta\|^2),

where i(\beta) is the Fisher information matrix. Thus Fisher information is the metric tensor: it tells us the squared length of an infinitesimal step on the model.

The MLE as projection

which is usually not itself on the model surface \mathcal M. Maximum likelihood chooses the model point closest to \hat F_n in the KL sense:

up to terms not depending on \beta. Geometrically, it is useful to imagine dropping a perpendicular from the empirical distribution to the model manifold.

Why curvature creates bias

Now take a curve in the plane, say a parabola. Put a symmetric cloud of noisy points around a target point P_0 on that curve. Project each noisy point perpendicularly onto the curve. If the curve were a straight line, the projected points would remain symmetric around P_0. But if the curve bends, the projections are no longer symmetric along the curve: the average projected point drifts.

That drift is the geometric intuition behind the O(n^{-1}) bias of the MLE. Cox and Snell’s expansion writes that drift in coordinates (Cox and Snell 1968). In McCullagh-style notation,

b^s(\beta) = \kappa^{s,a}\kappa^{b,c} \left\{\frac12\kappa_{a,b,c}+\kappa_{ab,c}\right\} + O(n^{-2}),

where \kappa_{r,s}=E(U_rU_s) is Fisher information, \kappa_{r,s,t}=E(U_rU_sU_t), \kappa_{rs,t}=E(\partial_{rs}\ell\,U_t), raised indices denote the inverse of the information matrix, and repeated indices are summed.

Efron’s statistical curvature gives one precise way to measure how much a statistical model bends inside a surrounding exponential family (Efron 1975). A model that is an affine subspace in natural-parameter space has zero Efron curvature. That does not mean that every coordinate representation of its MLE is exactly unbiased. For example, the Bernoulli mean MLE \bar Y is unbiased for \pi, but \operatorname{logit}(\bar Y) is biased for \operatorname{logit}(\pi). The curvature statement is about the family as a geometric object; finite-sample bias can also be introduced by nonlinear coordinates and boundary effects.

Firth’s adjustment undoes the drift

Firth’s idea is direct. If the MLE drifts by b(\beta), modify the score in the opposite direction:

\tilde U_s(\beta) = U_s(\beta) + A_s(\beta), \qquad A_s(\beta) = -\kappa_{s,a}(\beta)b^a(\beta).

The solution of \tilde U(\tilde\beta)=0 has bias of order O(n^{-2}) under standard regularity conditions (Firth 1993). In the projection picture, Firth’s estimator corrects the first-order displacement caused by the geometry of the projection.

Jeffreys’ prior is the Fisher volume element

is the natural volume element on the manifold. It is the analogue of \sqrt{|g|}\,dx in Riemannian geometry and is invariant under smooth reparametrization.

The canonical-link coincidence

For canonical-link exponential-family GLMs, the gradient of the log-volume equals Firth’s adjustment:

Part II: The Kosmidis–Firth program

2009: nonlinear predictors and extra curvature

With a canonical link, the model is an affine subspace of the saturated natural-parameter space. With a non-canonical link, or with a nonlinear predictor, the embedding of \beta into the surrounding exponential family bends.

Kosmidis and Firth study exponential-family nonlinear models, where \eta_i(\beta) need not be linear (Kosmidis and Firth 2009). Let

and let W be the usual diagonal matrix of working weights. The weighted hat matrix is

where \xi_i combines the leverage term from the diagonal of H with additional terms involving the second derivatives of \eta_i(\beta). When \eta_i is linear in \beta, the predictor-curvature term vanishes and the expression reduces to the familiar GLM adjustment.

The computational point is important: the method modifies Fisher scoring with a term built from the same objects already used in GLM fitting.

\beta^{(k+1)} = \beta^{(k)} + i\{\beta^{(k)}\}^{-1} \left[U\{\beta^{(k)}\}+A\{\beta^{(k)}\}\right].

The geometric reading is: the 2009 paper provides the bookkeeping needed when the model manifold bends because the predictor map itself bends.

2021: binomial GLMs, finiteness, and the boundary

i(\beta)=X^\top W(\beta)X, \qquad w_i(\beta) = m_i\frac{(d\pi_i/d\eta_i)^2}{\pi_i(1-\pi_i)}.

The model has a boundary. As \pi_i\to 0 or \pi_i\to 1, the Bernoulli distribution degenerates to a point mass. For standard links, the corresponding weights collapse, so the Fisher volume collapses:

Under complete or quasi-complete separation, ordinary maximum likelihood tries to move toward exactly that boundary. Along a separating ray, the log-likelihood approaches a finite ceiling but does not attain it at a finite parameter value. Thus the ordinary MLE does not exist.

Kosmidis and Firth’s 2021 result shows that, for standard binomial links such as logit, probit, complementary log–log, log–log, and cauchit, the Jeffreys-penalized estimator

\hat\beta_J = \arg\max_\beta \left\{ \ell(\beta)+\frac12\log\det X^\top W(\beta)X \right\}

The proof idea is geometric. Separation lets \ell(\beta) plateau as \|\beta\|\to\infty. But the same motion drives |i(\beta)|^{1/2} to zero. Hence

and the penalized objective turns downward. The Jeffreys term acts like an infinite wall at the boundary of the binomial manifold.

Part III: A small synthetic fraud example

A useful applied example is rare-event fraud classification. Imagine a binary feature

In a small training set, this rare flag might occur only among fraudulent transactions. That does not mean the flag is truly deterministic. It only means that, in this training sample, the data are quasi-completely separated.

\Pr(Y=1\mid x,z) = \operatorname{logit}^{-1}(\beta_0+\beta_1x_1+\beta_2x_2+\gamma z).

If all observations with z=1 have Y=1, the ordinary likelihood is increased by sending \gamma\to+\infty. The MLE does not exist. The Firth/Kosmidis estimator keeps \gamma finite because, along the same direction,

The information in the rare-flag direction disappears at the boundary, so the Jeffreys term penalizes the escape.

To make the example concrete, I generate a small synthetic fraud dataset, define the logistic likelihood and Fisher information, and compare the ordinary Newton path with the Firth/Jeffreys path. The setup code is hidden in the rendered post; the important object is the coefficient path and the Fisher-volume barrier.

Ordinary logistic escapes; Firth stays finite

The ordinary Newton iterations are shown only to display the direction of escape. Under quasi-complete separation they should not be interpreted as converging to an MLE. They move toward the direction where the likelihood has its supremum, not toward a finite maximizer.

Visualize the boundary, the Fisher-volume barrier, and the effect of the flag

The last two panels separate the baseline surface for unflagged transactions from the shifted surface for flagged transactions. This makes the role of the rare flag visually explicit: under the Firth fit, the flag raises fraud probability substantially, but by a finite amount.

A rare perfect predictor in a fraud model. Ordinary logistic regression sends the rare-flag coefficient toward infinity. The Jeffreys/Firth objective stays finite because the Fisher-volume term collapses at the boundary. The last two panels show the baseline fraud surface for unflagged transactions and the upward-shifted surface for flagged transactions.

Each point is a transaction. Blue points are legitimate transactions (Y=0,z=0), orange points are fraudulent but unflagged transactions (Y=1,z=0), and green stars are fraudulent transactions with the rare flag (Y=1,z=1). The training sample contains no legitimate flagged transaction (Y=0,z=1), so the rare flag perfectly separates part of the data.

Panel C is the geometric story in one picture. The likelihood likes large \gamma: it wants to declare the rare flag deterministically fraudulent. But the Fisher-volume term goes to -\infty, because the model is approaching a boundary point where the flagged Bernoulli distributions become point masses. The penalized objective turns over and has a finite maximum.

Panels D and E show the fitted Firth probabilities under two hypothetical settings of the rare flag. Panel D sets z=0 everywhere, so it shows the baseline fraud surface for unflagged transactions. Panel E sets z=1 everywhere, so it shows how transactions at those same (x_1,x_2) locations would be scored if the rare flag were present. The rare flag shifts the surface upward by the finite Firth estimate \hat\gamma: it is strong evidence of fraud, but not an infinite rule.

Firth does not merely find a separating line. Under separation, the ordinary logistic separating direction corresponds to an infinite coefficient and no finite MLE. Firth instead gives a finite probabilistic decision boundary: the rare flag shifts the log-odds by a large but finite amount, so the model treats it as strong evidence rather than a deterministic rule.

That is the practical face of the 2021 theorem: the estimator is not merely numerically stabilized; it is prevented from leaving the finite part of the statistical manifold.

Part IV: Technical appendix

Jeffreys’ score equals Firth’s adjustment for canonical GLMs

f(y_i;\theta_i,\phi) = \exp\left\{ \frac{y_i\theta_i-b(\theta_i)}{\phi}+c(y_i,\phi) \right\}.

\ell(\beta) = \frac1\phi\sum_i\{y_i x_i^\top\beta-b(x_i^\top\beta)\}+\text{const},

\kappa_{r,s,t} = E(U_rU_sU_t) = \frac1\phi\sum_i b'''(\theta_i)x_{ir}x_{is}x_{it}.

\partial_t i_{rs}(\beta) = \frac1\phi\sum_i b'''(\theta_i)x_{it}x_{ir}x_{is} = \kappa_{t,r,s}(\beta). \tag{1}

This is the canonical-link identity: the derivative of the metric is the third cumulant of the score.

For canonical GLMs, the relevant Cox–Snell term reduces, under the sign convention used here, to

\partial_s P(\beta) = \frac12\mathrm{tr}\{i(\beta)^{-1}\partial_s i(\beta)\} = \frac12 i^{cd}\partial_s i_{cd}.

\boxed{ \partial_s\left\{\frac12\log\det i(\beta)\right\} = A_s(\beta) = -i_{sa}(\beta)b^a(\beta). }

Adding the Jeffreys score is exactly Firth’s first-order bias correction in canonical-link GLMs.

Intercept-only Bernoulli check

b^\beta = -\frac{\kappa_{\beta,\beta,\beta}}{2i(\beta)^2} = \frac{2\pi-1}{2n\pi(1-\pi)}.

The same result follows from the delta method applied to \hat\beta=\operatorname{logit}(\bar Y):

E(\hat\beta-\beta) \approx \frac12 g''(\pi)\mathrm{Var}(\bar Y) = \frac{2\pi-1}{2n\pi(1-\pi)}.

Efron’s statistical curvature: logistic versus probit

For a one-parameter curved exponential family embedded in an n-dimensional Bernoulli natural-parameter space, write

and let \dot\theta=\partial\theta/\partial\beta and \ddot\theta=\partial^2\theta/\partial\beta^2. With

\gamma^2(\beta) = \frac{ (\ddot\theta^\top W\ddot\theta)(\dot\theta^\top W\dot\theta) -(\dot\theta^\top W\ddot\theta)^2 }{(\dot\theta^\top W\dot\theta)^3}.

\pi_i=\Phi(\beta x_i), \qquad \theta_i=\log\frac{\Phi(\beta x_i)}{1-\Phi(\beta x_i)}.

Now \theta_i(\beta) is nonlinear. In general \ddot\theta is not proportional to \dot\theta, and the curvature is positive. The probit model is a curved submodel of the Bernoulli natural-parameter space.

The takeaway is that different pathologies in likelihood estimation have different geometric signatures:

References

D. R. Cox and E. J. Snell. “A general definition of residuals.” Journal of the Royal Statistical Society: Series B 30(2), 1968.

Bradley Efron. “Defining the curvature of a statistical problem, with applications to second order efficiency.” The Annals of Statistics 3(6), 1975.

David Firth. “Bias reduction of maximum likelihood estimates.” Biometrika 80(1), 1993.

Harold Jeffreys. Theory of Probability. Oxford University Press, 3rd edition, 1961.

Ioannis Kosmidis and David Firth. “Bias reduction in exponential family nonlinear models.” Biometrika 96(4), 2009.

Ioannis Kosmidis and David Firth. “Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models.” Biometrika 108(1), 2021.

Preface