With an understanding of the method of least squares and QR decomposition, we’re now ready to look at one of its most practical applications: least squares fitting. In this tutorial, we’ll explore straight-line fitting and polynomial least squares fitting, both by hand and with Python. But before we get to the problems, we need a little background and some more theory.

What Is Least Squares Fitting?

In least squares fitting, we have some function ff that takes nn-vectors as its inputs and maps them to real numbers. We don’t really know anything about the function itself and what it does under the hood. It’s your classic black box: You feed some vector xx to the function, and it spits out a yy in response:

The black-box model of a function.

Our goal in least squares fitting is to try to model ff as closely as possible, based on the input-output data pairs that we’re given. Typically, we use the following notation for our data, with (x(i),y(i))(x^{(i)}, y^{(i)}) denoting the ii-th data pair:

x(1),x(2),...,x(N)y(1),y(2),...,y(N)x^{(1)}, x^{(2)}, ..., x^{(N)}\\ y^{(1)}, y^{(2)}, ..., y^{(N)}

Here, NN is the number of data points (i.e., the size of our data set), while nn is the size of each input vector, x(i)x^{(i)}. Keep that in mind because these two are not necessarily the same.

Note: We use superscripts in parentheses to denote data pairs. Note that subscripts are usually reserved for the elements of a vector. So in this case, the first element of the second input vector would be represented as x1(2)x^{(2)}_1.

The typical example used in an introductory machine learning class is the house price index data set. You have NN data pairs of the form (x(i),y(i))(x^{(i)}, y^{(i)}). You feed your feature vector x(i)x^{(i)} to your function, and it produces some corresponding scalar value, y(i)y^{(i)}, in response. In this case, x(i)x^{(i)} may be a set of measurements for the home: the number of bedrooms, the number of bathrooms, its age, and so on. The corresponding output is y(i)y^{(i)}, which denotes the price of the home—the real price, not a prediction.

y=f(x)y = f(x)

Now, as I mentioned earlier, we rarely ever know what ff is. So what we’ll do is model the relationship between each x(i)x^{(i)} and y(i)y^{(i)} as closely as we can. We approximate their relationship with a model function that we call f^\hat{f}:

yf^(x)y \approx \hat{f}(x)

Note that yy and xx are simply placeholders. Since we’re really given NN data points—x(i)x^{(i)} and y(i)y^{(i)}—we should write out the expanded form of the above by plugging in each data pair. That’ll give us a clearer picture of what’s going on:

y(1)f^(x(1))y(2)f^(x(2))y(N)f^(x(N))y^{(1)} \approx \hat{f}(x^{(1)})\\ y^{(2)} \approx \hat{f}(x^{(2)})\\ \ldots\\ y^{(N)} \approx \hat{f}(x^{(N)})

This is starting to look more like a system of equations. But we’re not quite there yet. How exactly do we pick f^\hat{f}?

Summary: Our goal in data fitting is to model the relationship between the inputs, x(i)x^{(i)}, and the outputs, y(i)y^{(i)}, as closely as possible using a model function, f^\hat{f}. Because remember, we don’t know the true relationship, ff.

Picking a Model Function for Data Fitting

Below is the general form of the model function f^\hat{f} used in least squares fitting:

f^(x)=θ1f1(x)+θ2f2(x)++θpfp(x)\hat{f}(x) = \theta_1 f_1(x) + \theta_2 f_2(x) + \ldots + \theta_p f_p(x)

If this looks confusing, don’t worry—it’s actually very simple.

First, notice that the model is a function like any other. In this case, though, it’s composed of pp parameters, θ1,,θp\theta_1, \ldots, \theta_p, as well as pp functions, f1(x),,fp(x)f_1(x), \ldots, f_p(x). In data fitting, these functions are called basis functions.

Okay, so how can we make this least squares model function more concrete?

Well, the good news is that we get to pick the basis functions fif_i based on how we think the real function, f(x)f(x), behaves. As we’ll see shortly, if ff appears to be linear in behavior, then we may decide to pick our basis functions such that f^\hat{f} ends up resembling a straight line. On the other hand, if ff appears to be quadratic, then we may pick our basis functions such that f^\hat{f} ends up being some sort of a polynomial.

Takeaway: We pick the basis functions based on how we think f(x)f(x) behaves. This is a key step in engineering a model function. In picking the basis functions, we also decide how many of them we’ll need. This is pp.

We pick the basis functions. The θ\theta values—the model parameters—are what we need to solve for.

Data Fitting: The System of Equations

Let’s plug this general form of f^\hat{f} into the earlier set of equations that we saw:

y(1)f^(x(1))=θ1f1(x(1))+θ2f2(x(1))++θpfp(x(1))y(2)f^(x(2))=θ1f1(x(2))+θ2f2(x(2))++θpfp(x(2))y(N)f^(x(N))=θ1f1(x(N))+θ2f2(x(N))++θpfp(x(N))y^{(1)} \approx \hat{f}(x^{(1)}) = \theta_1 f_1(x^{(1)}) + \theta_2 f_2(x^{(1)}) + \ldots + \theta_p f_p(x^{(1)})\\ y^{(2)} \approx \hat{f}(x^{(2)}) = \theta_1 f_1(x^{(2)}) + \theta_2 f_2(x^{(2)}) + \ldots + \theta_p f_p(x^{(2)})\\ \ldots\\ y^{(N)} \approx \hat{f}(x^{(N)}) = \theta_1 f_1(x^{(N)}) + \theta_2 f_2(x^{(N)}) + \ldots + \theta_p f_p(x^{(N)})

Now that’s more like it—this is a linear system of equations! Let’s represent it in matrix form:

[y(1)y(2)y(N)][f1(x(1))f2(x(1))fp(x(1))f1(x(2))f2(x(2))fp(x(2))f1(x(N))f2(x(N))fp(x(N))][θ1θ2θp]\begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(N)} \end{bmatrix} \approx \begin{bmatrix} f_1(x^{(1)}) & f_2(x^{(1)}) & \ldots & f_p(x^{(1)}) \\ f_1(x^{(2)}) & f_2(x^{(2)}) & \ldots & f_p(x^{(2)}) \\ \vdots & \ldots & \ddots & \vdots \\ f_1(x^{(N)}) & f_2(x^{(N)}) & \ldots & f_p(x^{(N)}) \\ \end{bmatrix} \begin{bmatrix} \theta_1 \\ \theta_2 \\ \vdots \\ \theta_p \end{bmatrix}

Notice that our matrix has dimensions N×pN \times p. In practice, NN is often much, much larger than pp. Sound familiar? All this really means is that we have an overdetermined system—there’s no exact solution θ\theta to Aθ=yA\theta = y. This means that many data fitting problems are actually least squares problems—we need to find the θ^\hat{\theta} that gets us as close as possible to yy.

To summarize:

  • There exists some unknown relationship, ff, between x(i)x^{(i)} and y(i)y^{(i)}, such that f(x(i))=y(i)f(x^{(i)}) = y^{(i)}.
  • We approximate ff using f^(x)=θ1f1(x)+θ2f2(x)++θpfp(x)\hat{f}(x) = \theta_1 f_1(x) + \theta_2 f_2(x) + \ldots + \theta_p f_p(x).
  • We pick the basis functions f1,,fpf_1, \ldots, f_p based on how we think the real function ff behaves.
  • We solve for the parameters of our model—θ1,,θp\theta_1, \ldots, \theta_p—using the least squares method.

General Strategy for Solving Least Squares Problems

Here’s a simple five-step strategy you can use to solve least squares problems:

  1. Visualize the problem. For example, you may be given a set of data points that you can plot.
  2. Pick an appropriate model. Based on what we learned, this involves choosing the basis functions and pp.
  3. Identify the equations involved. Write them out explicitly based on your input and output pairs.
  4. Solve the overdetermined system using the least squares method.
  5. (Optional) Visualize the solution. This is a useful way to sanity check your answer, though it’s not fool-proof.

Note: For all the examples that follow, we’ll let n=1n = 1. That is, our input x(i)x^{(i)}s will just be scalar values. In reality, this changes nothing about the least squares method.

Example 1: Least Squares Straight-Line Fit

Suppose we’re given these data points for a least squares line fitting problem:

(1,1),(2,3),(3,3)=(x(1),y(1)),(x(2),y(2)),(x(3),y(3))(1, 1), (2, 3), (3, 3) = (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), (x^{(3)}, y^{(3)})

We’re asked to model the relationship between xx and yy. Let’s take it step by step.

Step 1: Visualize the Problem

First, we’ll plot the points:

Plotting the three data points we were given.

We note that the points, while scattered, appear to have a linear pattern. Clearly, it’s not possible to fit an actual straight line to the points, so we’ll do our best to get as close as possible—using least squares, of course.

Step 2: Pick an Appropriate Model

We know ff appears linear, like a y=mx+by = mx + b equation. We want our model function to look something like this:

f^(x)=θ1+θ2x\hat{f}(x) = \theta_1 + \theta_2x

Note: Alternatively, you could just as well pick f^(x)=θ1x+θ2\hat{f}(x) = \theta_1 x + \theta_2. It won’t change the solution.

So, we revisit our general model:

f^(x)=θ1f1(x)+θ2f2(x)++θpfp(x)\hat{f}(x) = \theta_1 f_1(x) + \theta_2 f_2(x) + \ldots + \theta_p f_p(x)

And we pick our basis functions, as promised, to give f^\hat{f} a linear shape. We pick pp to be 22, such that we have:

f^(x)=θ1f1(x)+θ2f2(x)\hat{f}(x) = \theta_1 f_1(x) + \theta_2 f_2(x)

Next, we define our basis functions:

f1(x)=1f2(x)=xf_1(x) = 1\\ f_2(x) = x

What does this do for us? Let’s plug them into the general formula:

f^(x)=θ1+θ2x\hat{f}(x) = \theta_1 + \theta_2 x

That gives us precisely the function we wanted.

Note: You don’t have to be this explicit about how you select your basis functions. However, I recommend doing so because it allows you to verify that your reasoning is sound.

Step 3: Identify the Equations Involved

Here are all three equations for our problem:

y(1)=θ1+θ2x(1)y(2)=θ1+θ2x(2)y(3)=θ1+θ2x(3)y^{(1)} = \theta_1 + \theta_2 x^{(1)} \\ y^{(2)} = \theta_1 + \theta_2 x^{(2)} \\ y^{(3)} = \theta_1 + \theta_2 x^{(3)}

Let’s plug in our points:

1=θ1+θ2(1)3=θ1+θ2(2)3=θ1+θ2(3)1 = \theta_1 + \theta_2 (1) \\ 3 = \theta_1 + \theta_2 (2) \\ 3 = \theta_1 + \theta_2 (3)

And, in matrix form, this looks like the following:

[133]=[111213][θ1θ2]\begin{bmatrix} 1 \\ 3 \\ 3 \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}

Step 4: Solve the Overdetermined System Using Least Squares

Three equations and two unknowns—this is an overdetermined system. How do we solve this system? Well, as we know, there’s no exact solution. But we can get the least squares solution by solving for θ\theta in this equation:

(ATA)θ=ATy(A^TA)\theta = A^Ty

Of course, we shouldn’t solve this directly without first using QR decomposition. If you perform the necessary steps for QR decomposition, you’ll get that:

A=QR=[13121301312][36302]A = QR = \begin{bmatrix} \frac{1}{\sqrt{3}} & -\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{3}} & 0 \\ \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{2}} \end{bmatrix} \begin{bmatrix} \sqrt{3} & \frac{6}{\sqrt{3}} \\ 0 & \sqrt{2} \end{bmatrix}

You can verify this by performing matrix multiplication to see that you do in fact get AA back. It looks pretty nasty with all those square root terms, but they actually cancel out quite nicely as we’ll see here in a second.

Let’s plug A=QRA = QR into the least squares equation. Doing so yields the following simplified form:

Rθ=QTyR\theta = Q^Ty

Let’s plug in the actual matrices:

[36302][θ1θ2]=[13131312012][133]\begin{bmatrix} \sqrt{3} & \frac{6}{\sqrt{3}} \\ 0 & \sqrt{2} \end{bmatrix} \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} = \begin{bmatrix} \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} \\ -\frac{1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{2}} \end{bmatrix} \begin{bmatrix} 1 \\ 3 \\ 3 \end{bmatrix}

And let’s simplify the right-hand side of the equation:

[36302][θ1θ2]=[7322]\begin{bmatrix} \sqrt{3} & \frac{6}{\sqrt{3}} \\ 0 & \sqrt{2} \end{bmatrix} \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} = \begin{bmatrix} \frac{7}{\sqrt{3}} \\ \frac{2}{\sqrt{2}} \end{bmatrix}

This is a square system! Even better, it’s an upper-triangular system—this means we can solve for θ2\theta_2 really easily and then plug it back into the first equation to solve for θ1\theta_1 (recall that this strategy is known as back-substitution). First, let’s explicitly write out the two equations:

3θ1+63θ2=730θ1+2θ2=22\sqrt{3}\theta_1 + \frac{6}{\sqrt{3}}\theta_2 = \frac{7}{\sqrt{3}} \\ 0\theta_1 + \sqrt{2}\theta_2 = \frac{2}{\sqrt{2}}

Solving the second equation, we get that θ2=1\theta_2 = 1.

Plug that into the first equation:

3θ1+63θ2=3θ1+63=73\sqrt{3}\theta_1 + \frac{6}{\sqrt{3}}\theta_2 = \sqrt{3}\theta_1 + \frac{6}{\sqrt{3}} = \frac{7}{\sqrt{3}}

Solving yields θ1=13\theta_1 = \frac{1}{3}. Excellent!

Step 5: Visualize the Solution

We have the following solution:

θ=[θ1θ2]=[131]\theta = \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} = \begin{bmatrix} \frac{1}{3} \\ 1 \end{bmatrix}

Remember our model?

f^(x)=θ1+θ2x\hat{f}(x) = \theta_1 + \theta_2 x

Plugging those in yields the following straight-line equation:

f^(x)=13+x\hat{f}(x) = \frac{1}{3} + x

Let’s plot the best-fit line along with the points:

The best-fit line to the data we were given.

Awesome! This is the best-line fit for the data points we were given.

As you add more points, data fitting (particularly the QR factorization portion) becomes more difficult to do by hand. Fortunately, you can use languages like MATLAB or Python to solve these problems. But now, when you do rely on computers, you’ll at least know what they’re doing behind the scenes.

Example 2: Least Squares Polynomial Fitting (with Python!)

Let’s not stop there! Suppose instead that we are given these five data points:

(4,5),(0,1),(1,3),(2,9),(6,10)=(x(1),y(1)),(x(2),y(2)),(x(3),y(3)),(x(4),y(4)),(x(5),y(5))(-4, 5), (0, 1), (1, 3), (2, 9), (-6, 10) = (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), (x^{(3)}, y^{(3)}), (x^{(4)}, y^{(4)}), (x^{(5)}, y^{(5)})

Let’s repeat the process.

Step 1: Visualize the Problem

Here’s a graph of our points:

Plotting the four data points we were given.

To me, these points seems to take on the shape of a parabola. Based on that observation, I’m going to perform a least squares polynomial fit using a polynomial of degree two (a quadratic, basically).

Step 2: Pick an Appropriate Model

Since we’re modeling a quadratic equation (degree-two polynomial), this is the general form of the model function we’ll aim for:

f^(x)=θ1+θ2x+θ3x2\hat{f}(x) = \theta_1 + \theta_2 x + \theta_3 x^2

To get that, we’ll start with the original form again:

f^(x)=θ1f1(x)+θ2f2(x)++θpfp(x)\hat{f}(x) = \theta_1 f_1(x) + \theta_2 f_2(x) + \ldots + \theta_p f_p(x)

And we’ll pick p=3p=3 with:

f1(x)=1f2(x)=xf3(x)=x2f_1(x) = 1\\ f_2(x) = x\\ f_3(x) = x^2

There we go! It’s that simple.

Note: Again, you could reverse the order of the polynomial to be f^(x)=θ1x2+θ2x+θ3\hat{f}(x) = \theta_1x^2 + \theta_2 x + \theta_3. This would change nothing except the order of the elements in your resulting matrix.

Step 3: Identify the Equations Involved

Here are all five equations for our polynomial fitting problem:

y(1)=θ1+θ2x(1)+θ3(x(1))2y(2)=θ1+θ2x(2)+θ3(x(2))2y(3)=θ1+θ2x(3)+θ3(x(3))2y(4)=θ1+θ2x(4)+θ3(x(4))2y(5)=θ1+θ2x(5)+θ3(x(5))2y^{(1)} = \theta_1 + \theta_2 x^{(1)} + \theta_3 (x^{(1)})^2 \\ y^{(2)} = \theta_1 + \theta_2 x^{(2)} + \theta_3 (x^{(2)})^2 \\ y^{(3)} = \theta_1 + \theta_2 x^{(3)} + \theta_3 (x^{(3)})^2 \\ y^{(4)} = \theta_1 + \theta_2 x^{(4)} + \theta_3 (x^{(4)})^2 \\ y^{(5)} = \theta_1 + \theta_2 x^{(5)} + \theta_3 (x^{(5)})^2

Let’s plug in the data we were given:

5=θ1+θ2(4)+θ3(4)21=θ1+θ2(0)+θ3(0)23=θ1+θ2(1)+θ3(1)29=θ1+θ2(2)+θ3(2)210=θ1+θ2(6)+θ3(6)25 = \theta_1 + \theta_2 (-4) + \theta_3 (-4)^2 \\ 1 = \theta_1 + \theta_2 (0) + \theta_3 (0)^2 \\ 3 = \theta_1 + \theta_2 (1) + \theta_3 (1)^2 \\ 9 = \theta_1 + \theta_2 (2) + \theta_3 (2)^2 \\ 10 = \theta_1 + \theta_2 (-6) + \theta_3 (-6)^2

I’ll simplify things a bit and represent this as a matrix equation:

[513910]=[14161001111241636][θ1θ2θ3]\begin{bmatrix} 5 \\ 1 \\ 3 \\ 9 \\ 10 \end{bmatrix} = \begin{bmatrix} 1 & -4 & 16 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & -6 & 36 \end{bmatrix} \begin{bmatrix} \theta_1 \\ \theta_2 \\ \theta_3 \end{bmatrix}

Step 4: Solve the Overdetermined System Using Least Squares

Straight-line fitting is pretty simple by hand, but polynomial least squares fitting is where it gets kind of difficult. So I’m going to “cheat” and use Python! You can use MATLAB instead if you’d prefer; the language doesn’t really matter once you know the theory.

Here’s a script that uses QR factorization explicitly:

lsq.py
import numpy as np
from numpy import linalg as LA

# Our data
A = np.array([[1, -4, 16], [1, 0, 0], [1, 1, 1], [1, 2, 4], [1, -6, 36]])

y = np.array([[5], [1], [3], [9], [10]])

# QR factorize A
Q, R = LA.qr(A)

# R (theta) = Q^T (y)
QT = np.transpose(Q)

theta = LA.solve(R, QT.dot(y))

print(theta)

However, this is really equivalent to the following code, which simply uses the LA.lstsq function:

lsq.py
import numpy as np
from numpy import linalg as LA

# Our data
A = np.array([[1, -4, 16], [1, 0, 0], [1, 1, 1], [1, 2, 4], [1, -6, 36]])

y = np.array([[5], [1], [3], [9], [10]])

theta = LA.lstsq(A, y)

print(theta)

Regardless of which version we run, we’ll get the same answer for the θ\theta vector:

θ=[1.861059041.809044050.55014058]\theta = \begin{bmatrix} 1.86105904 \\ 1.80904405 \\ 0.55014058 \end{bmatrix}

Plugging this into our model, we arrive at the following polynomial function:

f^(x)=1.86105904+1.80904405x+0.55014058x2\hat{f}(x) =1.86105904+1.80904405x+0.55014058x^{2}

Step 5: Visualize the Solution

And here’s the resulting graph with our polynomial fit to the data:

The best-fit parabola to the data we were given.

Looks like a pretty good fit to me!

Conclusion

That about does it for this series on the least squares method! I hope you found this helpful.

💬 Comments

Post comment
Loading...