## Fitting lines to data using least squares and intrinsic scatter

Benjamin Weiner

` bjw - at- as.arizona.edu `
Steward Observatory

Least-squares fitting straight lines of the form y = A + Bx to a collection of data points is a long-standing statistical problem with a surprising amount of disagreement on the best method. If you are just beginning with least squares fitting, I recommend reading a standard book such as Bevington's Data Reduction and Error Analysis and then the chapter on model fitting in Press et al Numerical Recipes. The Press et al. treatment takes you up to using chi-squared fitting to fit a straight line model to data points that have errors in both x and y, where the errors are different from point to point (what statisticians call "heteroscedastic").

However, Press et al. does not tell you what to do if the scatter about the best-fit model is greater than can be explained by the measurement errors. That is what this page and the associated programs attempt to address. If there is intrinsic scatter in the data, the model of a ideal straight line distribution (with no thickness) is not a good description of the data: you will get reduced chi-squared much greater than 1. This makes error estimation hard and can bias your best-fit values. The approach I have taken is to fit the data data with a model distribution that has some width, essentially P(y) ~ exp (-(y-ypred)^2 / (2s^2) ) where ypred = A + Bx is the ridgeline of the distribution and s is the intrinsic scatter, the gaussian width about the ridgeline.

In the course of a project on the evolution of the Tully-Fisher relation between galaxies' luminosities and internal kinematic velocities, I found that the results can be sensitive to fit method and to the amount of intrinsic scatter in the data (scatter beyond that attributed to the measurement errors). I wrote a few computer programs which attempt to do least-squares fits of linear relations in a justifiable way. This page is for distributing these programs so that others may use them.

Some derivation and testing of fitting algorithms are discussed in my Paper II on the evolution of the Tully-Fisher relation, astro-ph/0609091 (2006, ApJ, 653, 1049; for non astronomers, the reference is to the Astrophysical Journal) and users should read the Appendix of that paper, and the sections which discuss the properties and testing of fits, to understand the problems and algorithms. I would like to thank Greg Novak for a number of illuminating discussions on the subtleties of least-squares fitting.

Here I distribute three programs, LSTSQ, LSTSQ_BOOTSTRAP, and MLSFIT. The LSTSQ programs are essentially front-ends to Numerical Recipes chi-squared fitting routines, with an extension (which can be very important) to handle fitting linear relations that have significant intrinsic scatter. The MLSFIT program is an implementation of what is basically a maximum-likelihood method that is derived in the Appendix of Paper II. This turns out to be very similar to the extended NR routine when everything is Gaussian and some other conditions are met. However, MLSFIT can be generalized to non-Gaussian errors or intrinsic scatter, and can be used, modified and distributed freely since it is not encumbered by the restrictive Numerical Recipes license.

LSTSQ can handle fitting linear relations y = A + Bx to x,y data with: (0) no errors; (1) errors on y; (2) errors on x and y; or (3) errors on x and y and intrinsic scatter in y. The input file is a text file with columns: (0) x, y; (1) x, y, y-error; or (2,3) x, x-error, y, y-error. LSTSQ_BOOTSTRAP does the same fit but then offers to do N bootstrap realizations (sampling with replacement) and fits on your data set. This is useful for measuring errors on the fitted parameters A and B, and the covariance between them.

MLSFIT takes as input a file with x, x-error, y, y-error and does a brute-force grid search through the parameters A, B, and the scatter S in Y. This is slower than the chi-squared minimum finding fitting routines used in LSTSQ, of course, but provides a good visual indication of what the fit is doing. It can also be modified to work on non-Gaussian problems for which chi-squared minimization does not directly apply.