robustcov is an alpha-stage scientific Python package for robust covariance estimation, heavy-tail scatter estimation, and interpretable robust-distance anomaly diagnostics.
It is designed for workflows where ordinary covariance estimates become unstable: contaminated samples, heavy-tailed data, small-sample regimes, high-dimensional scatter estimation, and Mahalanobis-style anomaly screening.
The package combines a Python API with C++/pybind11 kernels for selected compute-heavy routines. The focus is practical robust scatter estimation, diagnostic reporting, benchmark galleries, and application-oriented examples rather than a full probabilistic modeling framework.
import numpy as npimport robustcov as rcrng = np.random.default_rng(0)# Heavy-tailed data with injected outliersX = rng.standard_t(df=3, size=(400, 5))X[:30] +=8.0est = rc.FastMCD(quality="balanced", random_state=42).fit(X)print(est.location_)print(est.covariance_)print(est.radial_kurtosis_)det = rc.RobustOutlierDetector( estimator=est, contamination=0.075,).fit(X)print(det.labels_)
Motivation
Classical covariance is highly sensitive to outliers and heavy tails. A small number of extreme observations can inflate covariance estimates, rotate principal directions, distort Mahalanobis distances, and hide the very anomalies one wants to detect.
This is especially visible in settings such as:
fraud screening;
sensor anomaly detection;
portfolio stress diagnostics;
biomedical feature screening;
network traffic monitoring;
image or text embedding outlier detection;
small-sample, high-dimensional scientific data.
robustcov provides robust covariance and scatter estimators that try to separate central structure from contamination or diffuse heavy tails. The resulting robust distances can then be used for diagnostics, anomaly scores, plots, and benchmark comparisons.
What the package does
The package currently focuses on four related tasks:
robust covariance estimation under contamination;
heavy-tail scatter estimation for small-sample or high-dimensional data;
robust-distance anomaly detection and diagnostic reporting;
reproducible benchmark and use-case galleries.
Main public APIs include:
FastMCD;
MinCovDet;
RegularizedTyler;
StudentTScatter;
RegularizedCauchy;
KLRegularizedTyler;
WieselTyler;
HellingerRegularizedTyler;
AutoRobustScatter;
RobustOutlierDetector;
AutoRobustAnomalyDetector;
ClusterRobustOutlierDetector;
RobustMedianImputer;
plotting and diagnostic helpers.
The package is not intended to be a replacement for scikit-learn or SciPy. It is narrower: robust covariance, robust scatter, robust distances, and interpretable anomaly diagnostics.
Robust distances
Most workflows are built around robust squared Mahalanobis distances. Given observations
Large robust distances indicate observations that are far from the estimated central elliptical structure.
This idea is simple, but the quality of the result depends heavily on the covariance estimate. If the covariance is fitted using all outliers, then the distances may be distorted. robustcov focuses on estimators that are less sensitive to this failure mode.
FastMCD for classical contamination
FastMCD is the main estimator for a classical contamination setting: most observations form a central cloud, while a minority are separated outliers.
and the outliers are at least partly separable from the central data. It is less appropriate when the data are mostly heavy-tailed but not clearly split into clean and contaminated subsets.
Heavy-tail scatter estimators
For diffuse heavy tails, small samples, or high-dimensional regimes, hard subset selection may not be the right tool. robustcov includes several iteratively reweighted scatter estimators.
RegularizedCauchy applies strong radial downweighting with shrinkage. It is intended for very heavy-tailed samples and small-sample regimes.
StudentTScatter uses Student-t style weights. Smaller degrees of freedom correspond to heavier tails and more aggressive downweighting.
RegularizedTyler estimates robust shape. Tyler-style estimators are scale-free unless a scale correction is applied, so they are often best interpreted through robust shape and robust distances.
Automatic estimator selection
AutoRobustScatter is a practical exploratory selector. It fits candidate estimators and chooses one using diagnostic or stability-based criteria.
import numpy as npimport robustcov as rcrng = np.random.default_rng(3)X = rng.standard_t(df=3, size=(300, 12))X[:20] +=5.0auto = rc.AutoRobustScatter(selection="diagnostic").fit(X)print(auto.best_estimator_name_)print(auto.summary())
This is not an oracle. It is meant as a helpful first pass when the user does not yet know whether the data are better described as classical contamination, diffuse heavy tails, or a small-sample high-dimensional problem.
Robust outlier detection
The robust covariance estimators can be wrapped into anomaly detectors.
The anomaly score is based on robust distance. This makes the results interpretable: a point is suspicious because it is far from the robustly estimated center relative to the robust covariance or scatter shape.
Diagnostic reports
robustcov includes diagnostic summaries to help interpret fitted estimators.
import numpy as npimport robustcov as rcrng = np.random.default_rng(5)X = rng.standard_t(df=3, size=(400, 8))est = rc.RegularizedCauchy(alpha=0.10).fit(X)report = rc.diagnostic_report(est)print(report.summary())
Reports are intended to summarize quantities such as:
robust-distance behavior;
radial kurtosis;
condition number;
detected fraction;
support fraction when applicable;
tail diagnostics;
heuristic recommendations.
This is useful because robust covariance is rarely a one-number problem. The same estimator can behave differently depending on tail behavior, sample size, dimension, and contamination structure.
Visual diagnostics
The package includes plotting helpers for robust distances, QQ plots, covariance heatmaps, benchmark curves, and anomaly panels.
import numpy as npimport robustcov as rcrng = np.random.default_rng(6)X = rng.normal(size=(500, 5))X[:30] +=5.0est = rc.FastMCD(quality="balanced", random_state=0).fit(X)rc.plot_robust_distance_profile( est, output_path="distance_profile.png", show=False,)rc.plot_mahalanobis_qq( est, output_path="qq.png", show=False,)rc.plot_covariance_heatmap( est.covariance_, title="FastMCD covariance", output_path="covariance.png", show=False,)
The plotting API is designed for reports and documentation, not only for notebooks. Most functions support saving figures through output_path.
Multimodal diagnostics
A single global robust covariance model is not always appropriate. If a data set contains several legitimate clusters or regimes, a global robust estimator may treat small valid modes as anomalies.
ClusterRobustOutlierDetector provides a cluster-then-local-robust-scatter workflow.
This is not a full robust mixture model. It is a practical diagnostic workflow for multimodal data where local robust distances are more meaningful than one global covariance estimate.
Missing values and preprocessing
The package includes a robust median imputer for simple preprocessing pipelines.
The goal is not to replace full-featured preprocessing libraries. The helper exists so robust covariance examples can handle simple missing-value cases without leaving the package.
OpenMP acceleration
If OpenMP is available at build time, the C++ backend can parallelize selected operations.
import robustcov as rcprint("OpenMP available:", rc.has_openmp())print("threads:", rc.get_num_threads())rc.set_num_threads(4)
For reproducible timing, avoid thread oversubscription from BLAS libraries:
The benchmark documentation is intentionally conservative. robustcov is strongest when the signal is covariance-shaped, heavy-tailed, high-dimensional, or benefits from interpretable robust distances. It is not expected to win every anomaly-detection benchmark.
For fair local benchmarking, report:
hardware;
operating system;
Python version;
NumPy and SciPy versions;
compiler;
OpenMP availability;
BLAS thread settings;
sample size and dimension;
random seed.
Use-case gallery
The examples directory includes practical scripts for different application patterns.
The external-data pages should be read as evidence and diagnostics, not as universal claims. Some data sets are good fits for robust covariance methods, some are competitive but slower, and some are included mainly to show limitations.
Algorithms
The package connects several robust-statistics ideas:
minimum covariance determinant estimation;
C-step refinement for robust subset selection;
robust Mahalanobis distances;
Tyler shape estimation;
regularized Tyler scatter;
Student-t M-estimation;
Cauchy-style radial downweighting;
shrinkage toward stable covariance targets;
empirical and diagnostic anomaly thresholds;
cluster-aware local robust scatter diagnostics.
A simplified view of the robust-distance workflow is:
raw data
↓
robust location and scatter estimation
↓
robust squared distances
↓
diagnostic thresholds or empirical contamination rule
↓
outlier labels, anomaly scores, plots, and reports
For heavy-tailed M-estimators, the core idea is iterative radial reweighting. Observations with large robust distance receive smaller weights when updating the scatter matrix.
For example, a Student-t style weight has the form
w_i(d_i^2) =
\frac{\nu + p}{\nu + d_i^2},
where
\nu
is the degrees-of-freedom parameter and
p
is the dimension.
Tyler-style estimators use a shape equation of the form