Skip to content

Latest commit

 

History

History
36 lines (23 loc) · 3.85 KB

README.md

File metadata and controls

36 lines (23 loc) · 3.85 KB

RobustStats

##Location estimators:

  • bisquareWM - Mean with weights given by the bisquare rho function.
  • huberWM - Mean with weights given by Huber's rho function.
  • trimean - Tukey's trimean, the average of the median and the midhinge.

Dispersion estimators:

  • shorthrange - Length of the shortest closed interval containing at least half the data.
  • scaleQ - Normalized Rousseeuw & Croux Q statistic, from the 25%ile of all 2-point distances.
  • scaleS - Normalized Rousseeuw & Croux S statistic, from the median of the median of all 2-point distances.

Utility functions:

  • _weightedhighmedian - Weighted median (breaks ties by rounding up). Used in scaleQ.

Recommendations:

For location, consider the bisquareWM with k=3.9*sigma, if you can make any reasonable guess as to the "Gaussian-like width" sigma (see dispersion estimators for this). If not, trimean is a good second choice, though less efficient.

For dispersion, the scaleS is a good general choice, though scaleQ is very efficient for nearly Gaussian data. The MAD is the most robust though less efficient. If scaleS doesn't work, then shorthrange is a good second choice.

The first reference on scaleQ and scaleS (below) is a lengthy discussion of the tradeoffs among scaleQ, scaleS, shortest half, and median absolute deviation (MAD, see BaseStats.mad for Julia implementation). All four have the virtue of having the maximum possible breakdown point, 50%. This means that replacing up to 50% of the data with unbounded bad values leaves the statistic still bounded. The efficiency of Q is better than S and S is better than MAD (for Gaussian distributions), and the influence of a single bad point and the bias due to a fraction of bad points is only slightly larger on Q or S than on MAD. Unlike MAD, the other three do not implicitly assume a symmetric distribution.

To choose between Q and S, the authors note that Q has higher statistical efficiency, but S is typically twice as fast to compute and has lower gross-error sensitivity. An interesting advantage of Q over the others is that its influence function is continuous. For a rough idea about the efficiency, the large-N limit of the standardized variance of each quantity is 2.722 for MAD, 1.714 for S, and 1.216 for Q, relative to 1.000 for the standard deviation (given Gaussian data). The paper gives the ratios for Cauchy and exponential distributions, too; the efficiency advantages of Q are less for Cauchy than for the other distributions.

References:

Created on April 16, 2015 Updated January 24, 2017 for Julia v0.5

Joe Fowler, NIST Boulder Laboratories

Build Status