When we perform a regression fit of a straight line to a set of (x,y) data points we typically minimize the sum of squares of the "vertical" distance between the data points and the line. In other words, taking x as the independent variable, we minimize the sum of squares of the errors in the dependent variable y. However, this isn't the only possible approach. For example, we might choose to optimize the "horizontal" distances from the points to the line (i.e., the errors in the x variable), or the "perpendicular" distances to the line. If we regard each data point (x,y) as a sample, and if we assume the sample is taken at the precise value of the independent variable x, then it is sensible to regard each data point as being at the exactly correct x coordinate, and all the error is in the sampled value of the dependent coordinate y. On the other hand, if there is some uncertainty in the value of x for each sample, then conceptually it could make sense to take this into account when performing the regression to get the "best" fit. If the distribution of errors in both x and y are random (e.g., normally distributed) then one might think we could just sweep up the error in x as just one more contribution to the measured error in y, so the fitted line should be the same. However, this is not generally the case, as can be seen by considering the simple example of three (x,y) data points (0,0), (10,4), (10,8). To minimize the sum of squares of the errors in the y variable, the line must clearly pass through (0,0) and (10,6), whereas to minimize the sum of squares of the errors in the x variable the optimum line must be tilted more steeply and not pass through (0,0). Similarly if we minimize the sum of squares of the "perpendicular" distance to the line, we will get still a different line. However, the meaning of "perpendicular" is ambiguous because in general the units of x and y may be different, and so the "angles" of lines in the abstract "xy plane" do not have any absolute significance. For example, if x is time, and y is intensity, we can plot the data points with different scalings, so there is no unique notion of "perpendicular" in the time-intensity plane. In order to make the best fit, we need to scale the plot axes (conceptually) such that the variances of the errors in the x and y variables are numerically equal. Once we have done this, it makes sense to treat the results as geometrical points and find the line that minimizes the sum of squares of the perpendicular distances between the points and the line. Of course, this requires us to know the variances of the error distributions. If we don't, then the "best" line will be ambiguous. This is presumably why is it common practice to simply fit the dependent variable, since we don't have sufficient information to know, a priori, how the variances of the x and y errors are related. If we are given a set of (x,y) data points, and we somehow have sufficient information to scale them so the distributions of errors in the x and y variables have equal variances, then we can proceed to fit a line using "perpendicular" regression. One way of approaching this is to find the "principle directions" of the data points. Let's say we have the suitably scaled (x,y) coordinates of n data points. To make it simple, let's first compute the average of the x values, and the average of the y values, calling them X and Y respectively. The point (X,Y) is the centroid of the set of points. Then we can subtract X from each of the x values, and Y from each of the y values, so now we have a list of n data points whose centroid is (0,0). To find the principle directions, imagine rotating the entire set of points about the origin through an angle q. This sends the point (x,y) to the point (x',y') where x' = x cos(q) + y sin(q) y' = -x sin(q) + y cos(q) Now, for any fixed angle q, the sum of the squares of the vertical heights of the n transformed data points is S = SUM [y']^2, and we want to find the angle q that minimizes this. (We can look at this as rotating the regression line so the perpendicular corresponds to the vertical.) To do this, we take the derivative with respect to q and set it equal to zero. The derivative of [y']^2 is 2y'(dy'/dq), so we have dS/dq = 2 SUM [-x sin(q)+y cos(q)][-x cos(q)-y sin(q)] We set this to zero, so we can immediately divide out the factor of 2. Then, expanding out the product and collecting terms into separate summations gives [SUM xy] sin(q)^2 + [SUM (x^2 - y^2)] sin(q)cos(q) - [SUM xy] cos(q)^2 = 0 Dividing through by cos(q)^2, we get a quadratic equation in tan(q): {xy}tan(q)^2 + {x^2 - y^2}tan(q) - {xy} = 0 where the "curly braces" indicate that we take the sum of the contents over all n data points (x,y). Dividing through by the sum {xy} gives tan(q)^2 + A tan(q) - 1 = 0 where A = {x^2-y^2}/{xy}. Solving this quadratic for tan(q) gives two solutions, which correspond to the "principle directions", i.e., the directions in which the "scatter" is maximum and minimum. We want the minimum. Just to illustrate on a trivial example, suppose we have three data points (5,5), (6,6), and (7,7). First compute the centroid, which is (6,6), and then subtract this from each point to give the new set of points (-1,-1), (0,0), and (1,1). Then we can tabulate the sums: x y x^2 - y^2 xy --- --- --------- ---- -1 -1 0 1 0 0 0 0 1 1 0 1 ----- ----- 0 2 In this simple example we have {x^2-y^2} = 0 and {xy} = 2, which means that A = 0, so our equation for the principle directions is simply tan(q)^2 - 1 = 0 Thus the two roots are tan(q)=1 and tan(q)=-1, which corresponds to the angles +45 degrees and -45 degrees. This makes sense, because our original data points make a 45 degree line, so if we rotate them 45 degrees clockwise they are flat, whereas if we rotate them 45 degrees the other way they are vertically arranged. These are the two principle directions of this set of 3 points. The "best" fit through the original three points is a 45 degree line through the centroid - which is obvious in this trivial example, but the method works in general with arbitrary sets of points. For another example, suppose we have four data points (2,6), (4,2), (16,8), and (14,12). The centroid of these points is (9,7), so we can subtract this from each point to give the new set of points (-7,-1), (-5,-5), (7,1), and (5,5). Then we can tabulate the sums: x y xy x^2 - y^2 --- --- ---- --------- -7 -1 7 48 -5 -5 25 0 7 1 7 48 5 5 25 0 ---- ---- sums: 64 96 In this case we have {xy} = 64 and {x^2-y^2} = 96, which gives A = 3/2, so our equation for the principle directions is tan(q)^2 + (3/2)tan(q) - 1 = 0 The two roots are tan(q) = 1/2 and -2, which correspond to the angles +26.565 degrees and -63.434 degrees. This is consistent with the fact that our original four data points are the vertices of a rectangle whose edges have the slopes 1/2 and -2. The "best" fit through these four points is a line through the centroid with a slope of 1/2. (It's interesting that the two quantities which characterize the points, namely xy and x^2 - y^2, are both hyperbolic conic forms, and they constitute the invariants of Minkowski spacetime when expressed in terms of null coordinates and spatio-temporal coordinates, respectively.)Return to MathPages Main Menu
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4