Benjamin Weiner

bjw - at- as.arizona.eduSteward Observatory

Least-squares fitting straight lines of the form ` y = A + Bx `
to a collection of data points is a long-standing statistical problem
with a surprising amount of disagreement on the best method. If you
are just beginning with least squares fitting, I recommend reading
a standard book such as Bevington's *Data Reduction and Error Analysis*
and then the chapter on model fitting in Press et al *Numerical Recipes*.
The Press et al. treatment takes you up to using chi-squared fitting
to fit a straight line
model to data points that have errors in both x and y, where
the errors are different from point to point (what statisticians
call "heteroscedastic").

However, Press et al. does not tell you
what to do if the scatter about the best-fit model is greater than
can be explained by the measurement errors. That is what this
page and the associated programs attempt to address. If there
is intrinsic scatter in the data, the model of a ideal straight
line distribution (with no thickness) is not a good description
of the data: you will get reduced chi-squared much greater
than 1. This makes error estimation hard and can bias your
best-fit values. The approach I have taken is to fit the data
data with a model distribution that has some width, essentially
` P(y) ~ exp (-(y-ypred)^2 / (2s^2) )`
where `ypred = A + Bx` is the ridgeline of the distribution
and `s` is the intrinsic scatter, the gaussian width about
the ridgeline.

In the course of a project on the evolution of the Tully-Fisher relation between galaxies' luminosities and internal kinematic velocities, I found that the results can be sensitive to fit method and to the amount of intrinsic scatter in the data (scatter beyond that attributed to the measurement errors). I wrote a few computer programs which attempt to do least-squares fits of linear relations in a justifiable way. This page is for distributing these programs so that others may use them.

Some derivation and testing of fitting algorithms are discussed in my Paper II on the evolution of the Tully-Fisher relation, astro-ph/0609091 (2006, ApJ, 653, 1049; for non astronomers, the reference is to the Astrophysical Journal) and users should read the Appendix of that paper, and the sections which discuss the properties and testing of fits, to understand the problems and algorithms. I would like to thank Greg Novak for a number of illuminating discussions on the subtleties of least-squares fitting.

Here I distribute three programs, `LSTSQ`, `LSTSQ_BOOTSTRAP`,
and `MLSFIT`. The `LSTSQ` programs are essentially front-ends
to Numerical Recipes chi-squared fitting routines, with an extension
(which can be very important) to handle fitting linear relations
that have significant intrinsic scatter. The `MLSFIT` program
is an implementation of what is basically a maximum-likelihood
method that is derived in the Appendix of Paper II. This turns out
to be very similar to the extended NR routine when everything is
Gaussian and some other conditions are met. However, `MLSFIT`
can be generalized to non-Gaussian errors or intrinsic scatter,
and can be used, modified and distributed freely since it is not
encumbered by the restrictive Numerical Recipes license.

`LSTSQ` can handle fitting linear relations ` y = A + Bx`
to x,y data with: (0) no errors; (1) errors on y; (2) errors on x and y;
or (3) errors on x and y and intrinsic scatter in y. The input file is a
text file with columns: (0) x, y; (1) x, y, y-error; or (2,3) x, x-error,
y, y-error. `LSTSQ_BOOTSTRAP` does the same fit but
then offers to do N bootstrap realizations (sampling with replacement)
and fits on your data set. This is useful for measuring errors
on the fitted parameters A and B, and the covariance between them.

`MLSFIT` takes as input a file with x, x-error, y, y-error
and does a brute-force grid search through the parameters A, B,
and the scatter S in Y. This is slower than the chi-squared minimum
finding fitting routines used in `LSTSQ`, of course, but
provides a good visual indication of what the fit is doing.
It can also be modified to work on non-Gaussian problems for
which chi-squared minimization does not directly apply.