Readme file for lstsq.f and lstsq_bootstrap.f

Benjamin Weiner, July 2006

Introduction:

The LSTSQ programs are a simple front end to least-squares
routines for fitting straight lines to (X,Y) data with errors in
none, one, or both coordinates.  The fitting routines themselves
are from Numerical Recipes, which means you need a copy of the NR
routines to compile them from source.  You also need the PGPLOT
library for plotting.  Chapter 15, "Modeling of Data," in NR
edition 2, describes the fitting subroutines in greater detail
and should be read by everyone who uses them.  (Please note
that while I think the NR code license is oppressively cumbersome,
the book is quite useful as an introduction to the techniques.)

These programs are provided freely to allow better parameter
fitting for all.  They are licensed under the GNU GPL; you may
modify and redistribute them, but must retain the copyright notice.
The fit algorithms are described in Numerical Recipes ch. 15; the
desirability and derivation of an intrinsic scatter error-term
are discussed in Weiner et al. 2006 (ApJ, submitted, astro-ph/------);
if you rely heavily on this fitting method, refer to that 
paper as background.

To compile:

Extract the tar file to get Readme, source, Makefile, and
Linux binaries.  On a linux system you should be able to run
the binaries, but you need to setenv PGPLOT_DIR [local-pgplot-dir]
or PGPLOT won't find its fonts.  To compile, you need to
edit the Makefile to point at your local copies of pgplot 
and Numerical Recipes libraries.

Input:

The input format for lstsq is a text file with one data point per line.
Columns are, for the choices of problem type:
0.  X, Y
1.  X, Y, Y-error
2.  X, X-error, Y, Y-error
3.  X, X-error, Y, Y-error

The file does not have to be formatted but lines must be <80 chars long.
Lines that begin with '#' are ignored.  Note that due to max 
array dimensions in fitexy.f, fitting with errors on both X and Y
only works for 1000 points or fewer.

When you run lstsq, it asks you whether you want to use errors
on: none; Y; X and Y; or X and Y with intrinsic scatter added as
an extra component of Y-error.  Type 0, 1, 2, or 3 for your choice,
then the data file name.  [If you chose 3, you then also type
the scatter to add.]  lstsq then plots the data and asks what
X range you want to use in the fit, plots the fit, and prints the
best-fit values of A and B, the intercept and slope of the 
best-fit line.

lstsq_bootstrap operates the same way but after the fit, it
allows you to do N fits to bootstrap resamples (with replacement)
of the data set.  This is a good way of estimating errors
on the fitted intercept and slope, especially if your error bars 
are questionable or the best fit has a reduced chi-squared > 1 
(i.e. the sample has intrinsic scatter or your error bars are 
too small).  

Output:

lstsq writes an output file, lstsq.out, with data and fit values.
lstsq_bootstrap also writes a lstsq_bootstrap.out file with the
fitted values of A_boot and B_boot for each bootstrap resample.  
It is useful to plot A_boot and B_boot versus each other to
see the covariance of fitted parameters.  I strongly suggest
subtracting a constant zeropoint from X to make the mean/median 
X close to zero; this greatly reduces covariance between A and B.
For example, if you are fitting to radius as a function of 
magnitude and the magnitudes are all between 18 and 22, subtract
20 from the mags so you are fitting radius as a function of
(magnitude-20).

Fitting with scatter:

Choice 3 allows you to add a component of "intrinsic" scatter
to the fit.  This is unfortunately not discussed in Numerical
Recipes.  Many relations - for example the Tully-Fisher relation,
or black hole mass vs. galaxy properties - have scatter beyond
the observational errors, i.e. the "true" relation is not perfect.
This causes at least two problems when fitting lines to data.

1. The first is that the fitted line has chi-sq/N>1, sometimes
by a lot.  This indicates that the model (perfect line) is not
a good model of the data, and in practice, means that the 
error estimates on the fitted parameters a and b will be too small.
The true errors can be estimated by bootstrapping, but that doesn't
fix problem 2.

2. The second problem is that, because the model is not a good
model, it can be biased.  For example, if there is a fair amount
of intrinsic scatter (and especially if there are selection limits
on the X variable) it increases the range in Y of the data.  Fitting
without taking scatter into account, the algorithm will favor
a steeper slope of Y/X to account for the range in Y.  This can
be shown/tested for various fit algorithms with Monte Carlo simulations
(see Weiner et al. 2006 for an example; also discussed in Novak
et al. 2006).

One cure for this problem is to add the intrinsic scatter
as an extra error, which is added in quadrature to the Y-error.
The intrinsic scatter is determined such that total
chisq/N = 1.
Some discussions of this issue are in Tremaine et al.
(2002, ApJ, 574, 740) and Novak et al. (2006, ApJ, 637, 96).
The underlying assumptions and a derivation of the basis 
for adding scatter to the Y-error are given in Weiner et al. 
(2006, ApJ in press, astro-ph/------) and references therein.
A different approach to fitting with scatter was given
by Akritas & Bershady (1996, ApJ, 470, 706), although the
preceding papers found some issues with that method.

To conclude, if your relation has intrinsic scatter,
fitting with the scatter term is a good thing to do;
use trial and error to figure out what scatter you need
to make reduced chi-squared = 1.  HOWEVER, if you do this,
if your sample has selection limits, make the limited
variable the X variable.  For example if one of the
variables is magnitude and there is a mag limit, make magnitude
X and add the scatter to Y.  See Weiner et al. (2006)
for more discussion of these effects (also see papers
that discuss fitting the "inverse Tully-Fisher" relation,
i.e. velocity as function of magnitude, and discuss
"incompleteness bias," e.g. Giovanelli et al 1997, AJ 113, 53).

Summary:

- Figure out your variables and the errors
- Subtract a zeropoint so the X values average ~ zero to reduce
  covariance
- Fit
- Use bootstrapping to estimate errors and covariance
- If the best fit has chisq/N > 1, consider adding an intrinsic
  scatter term
- If you need/add an intrinsic scatter term, be careful about the
  effect of selection limits.  If you have selection limits on one
  variable, make that variable X.  Even if you don't need the
  intrinsic scatter, if you have big errors on the coordinate
  that has a selection limit, the fit can be biased.