The Method of Least Squares

The method of least squares is a technique for solving systems of equations, but it can be difficult for beginners to grasp if not explained well. It’s something that you’ll remember by heart once you understand the intuition behind how it’s derived. In this post, we’ll look at the problem that motivates the least squares method and gain an intuitive understanding for how it works under the hood.

What Is the Least Squares Method?

The method of least squares finds an approximate solution to a system of equations for which there is no exact solution. Let’s look at why we need least squares with this simple example:

x1+x2=2x1+2x2=3x1+3x2=3x1+4x2=5x_1 + x_2 = 2 \\ x_1 + 2x_2 = 3 \\ x_1 + 3x_2 = 3 \\ x_1 + 4x_2 = 5

We have four equations and two unknowns. We could perform row reduction, but since we only have two unknowns in this case (x1x_1 and x2x_2), we can also solve by substitution.

Let’s rewrite the first equation in terms of x1x_1:

x1=2x2x_1 = 2 - x_2

Then, let’s plug this into the second equation:

(x1)+2x2=3(2x2)+2x2=32+x2=3x2=1(x_1) + 2x_2 = 3\\ (2 - x_2) + 2x_2 = 3\\ 2 + x_2 = 3\\ x_2 = 1

We have x2x_2, so now let’s find x1x_1:

x1=2x2x1=1x_1 = 2 - x_2\\ x_1 = 1

At this point, we may be tempted to conclude that our solution is the following:

x1=1,x2=1x_1 = 1, x_2 = 1

This happens to satisfy the first and second equations by design since we used them directly. It also satisfies the fourth equation by pure coincidence. But if you plug x1=1x_1 = 1 and x2=1x_2 = 1 into the third equation, you’ll get a contradiction:

x1+3x2=1+3=43x_1 + 3x_2 = 1 + 3 = 4 \neq 3

So what went wrong?

The truth is that there is no unique solution to this system of equations. We can verify this visually by plotting our equations, which are really just lines in 2D (you can think of x2x_2 as our “yy” and x1x_1 as our “xx,” or vice versa):

The graph of the four equations above.

As this graph illustrates, there is no single point at which all of the lines intersect, meaning there’s no solution that satisfies all four equations simultaneously. What we found is the red dot, (1,1)(1, 1). As we saw earlier, there are three equations for which this is a solution (blue, purple, and red in the graph above).

Overdetermined Systems Don’t Have a Unique Solution

The system of equations above is called an overdetermined system; such a system has more equations (44) than unknowns (22). If we denote the matrix’s dimensions as m×nm \times n, then an overdetermined system is one where m>nm > n. Visually, this looks like a tall and thin matrix.

Here’s the matrix form of our system above:

Ax=b:[11121314][x1x2]=[1335]Ax = b: \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 3 \\ 3 \\ 5 \end{bmatrix}

As an exercise, try to find the inverse of the matrix AA. (Spoiler: It’s not possible!) This is because AA is not a square matrix. In general, the matrix of an overdetermined system is not invertible, meaning we can’t solve for xx using the following traditional method:

x=A1bx = A^{-1}b

This leads us to an important conclusion: An overdetermined system does not have a single, unique solution.

As a direct consequence of this, an overdetermined system can either have:

  • Infinitely many solutions, or
  • No solutions at all (like in our example).

Here’s an example where there are infinitely many solutions, if you’re wondering how that’s possible:

x1+x2=32x1+2x2=63x1+3x2=9x_1 + x_2 = 3\\ 2x_1 + 2x_2 = 6\\ 3x_1 + 3x_2 = 9

This is an overdetermined system because there are three equations but only two unknowns. Notice that the second and third equations are just scaled up (×2\times 2 and ×3\times 3) versions of the first one. This means that they’re all the same line, and thus there are infinitely many points of “intersection.”

How Do We Solve Overdetermined Systems?

So far, we’ve seen that we can’t solve an overdetermined system in the traditional sense. So what do we do? Let’s look at the general form of an overdetermined system Ax=bAx = b. As a matrix equation, it looks like this:

[a11a12a1na21a22a2nam1am2amn][x1x2xn]=[b1b2bm] \begin{bmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ a_{21} & a_{22} & \dots & a_{2n} \\ \vdots & \dots & \ddots & \vdots \\ a_{m1} & a_{m2} & \dots & a_{mn} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix}

As we’ve seen, this system has no unique solution—either there’s no solution whatsoever, or there are infinitely many solutions. From a computational standpoint, neither of those are simple problems to solve. We like working with rectangular systems, where m=nm = n, because they are computationally simpler to solve (relatively speaking).

But we don’t just throw up our arms in defeat if we’re given an overdetermined system. Instead, we try to do our best—to find a value for xx such that no other xx will get us closer to approximating bb. This is known as the least squares solution to Ax=bAx = b. It’s not an actual solution like what we’re used to. Rather, a least squares solution is about as close to a “real” solution as we can hope to get.

Least Squares Visualized

Before we look at the generalized graph of the least squares method, let’s look at a simple example that you’ll come across frequently in statistics classes on least squares regression. Suppose you’re given this two-dimensional data set of (x,y)(x, y) pairs:

(1,1),(2,3),(3,3),(4,5),(6,4)(1, 1), (2, 3), (3, 3), (4, 5), (6, 4)

Don’t confuse these xx variables with the ones we saw earlier—just pretend for a second that you’re back in algebra class and learning how to plot in the xyxy-plane.

If we plot these data points, we’ll get the following graph:

Plotting the five data points we were given.

These points appear to follow a linear shape, but it’s not possible to plot a straight line that fits all of the points. But clearly, we can draw a best-fit line that at least gets as close to all of the points as possible:

A best-fit line through the points we plotted.

We won’t concern ourselves right now with how that equation was actually obtained. Let’s just represent it in its most generic form:

y=θ1(x)+θ2y = \theta_1(x) + \theta_2

We were given several (x,y)(x, y) pairs, so we can plug those into the above equation to get a set of equations that we want to solve in order to find the values for θ1\theta_1 and θ2\theta_2:

1=θ1(1)+θ23=θ1(2)+θ23=θ1(3)+θ25=θ1(4)+θ24=θ1(6)+θ21 = \theta_1(1) + \theta_2\\ 3 = \theta_1(2) + \theta_2\\ 3 = \theta_1(3) + \theta_2\\ 5 = \theta_1(4) + \theta_2\\ 4 = \theta_1(6) + \theta_2

Our goal is to find values for θ1\theta_1 and θ2\theta_2 that will minimize the error of estimating the trend in the data with the imperfect equation y=θ1x+θ2y = \theta_1x + \theta_2. In other words, this is a least squares problem—we can’t get an exact solution, so we try to find a “solution” that’s as good as possible.

Let’s represent this problem in matrix form:

[1121314161][θ1θ2]=[13354]\begin{bmatrix} 1 & 1 \\ 2 & 1 \\ 3 & 1 \\ 4 & 1 \\ 6 & 1 \end{bmatrix} \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 3 \\ 3 \\ 5 \\ 4 \end{bmatrix}

This was really an overdetermined system, Aθ=bA\theta = b, disguised as a bunch of (x,y)(x, y) data pairs! We have more equations than unknowns. This alone tells us that there is no unique solution and that we must find a least squares solution instead.

Let’s generalize this problem. We start with an m×nm \times n system:

[a11a12a1na21a22a2nam1am2amn][x1x2xn]=[b1b2bm]\begin{bmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ a_{21} & a_{22} & \dots & a_{2n} \\ \vdots & \dots & \ddots & \vdots \\ a_{m1} & a_{m2} & \dots & a_{mn} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix}

And we consider the case where n=2n = 2—that is, our solution will have just two components, x1x_1 and x2x_2 (these were θ1\theta_1 and θ2\theta_2 in the concrete example above):

[a11a12a21a22am1am2][x1x2]=[b1b2bm]\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \\ \vdots & \vdots \\ a_{m1} & a_{m2} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix}

If n=2n = 2 like it is here, AxAx ends up being a plane in a 3D space:

A graph of Ax=b.

Notice that the vector bb falls outside this plane. In plain English, this means that there’s no xx such that Ax=bAx = b. If we imagine for a second that we live in the 3D space of this graph, our walking surface would be limited to the plane itself. In other words, we would have no way of reaching bb; the best that we could do is to navigate the plane itself.

Okay, so we can’t get to bb… Bummer. But can we get close? We sure can! And in fact, geometrically, that’s exactly what the method of least squares does—it finds the point in the plane AxAx that’s closest to bb. From the image, we see that the closest point to bb is right under it—where the orthogonal projection of bb onto the plane actually touches the plane.

Let’s define two things:

  • x^\hat{x} is the point in the plane AxAx that, when plugged in for xx, gets us closest to bb. It’s what we call the least squares solution to Ax=bAx = b. Visually, it’s on the plane AxAx and directly under bb.
  • r=bAx^r = b - A\hat{x} is the residual vector, which is orthogonal to the plane AxAx as you can see above.

Now, let’s revisit our matrix equation, Ax=bAx = b. We’ll rewrite it using AA’s column vectors, a1a_1 and a2a_2. Note that this changes absolutely nothing:

[a1a2][x1x2]=[b1b2]\begin{bmatrix} a_1 & a_2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \end{bmatrix}

As we saw in the image earlier, the vector rr is clearly orthogonal to the plane AxAx. By definition, this means that rr must be orthogonal to all vectors in the plane. And we know two vectors that lie in the plane: a1a_1 and a2a_2.

Now, if two vectors are orthogonal, then their dot product is zero. So let’s write that out explicitly (using the matrix notation for a dot product):

a1Tr=0a2Tr=0a_1^T r = 0\\ a_2^T r = 0

These two are simultaneously true. We can express this using the following equivalent notation:

[a1a2]Tr=ATr=0\begin{bmatrix} a_1 & a_2 \end{bmatrix}^T r = A^T r = 0

Time to start plugging some things in. Recall from above that we defined rr to be bAx^b - A\hat{x}:

ATr=0AT(bAx^)=0ATbATAx^=0A^T r = 0\\ A^T(b - A\hat{x}) = 0\\ A^Tb - A^TA\hat{x} = 0

Now move ATAx^A^TA\hat{x} to the other side. Doing so gives us this important equation:

ATAx^=ATbA^TA\hat{x} = A^Tb

At this point, you may have two related questions:

1. Why didn’t we just multiply both sides of Ax=bAx = b with ATA^T and replace xx with x^\hat{x} in the first place?

Well, sure—we could’ve done that. And in fact, now that we know that ATAx^=ATbA^TA\hat{x} = A^Tb gives us the least squares solution x^\hat{x}, this is indeed what we’ll do in the future. In hindsight, though, we had no good reason to do so from the get-go. We only arrived at this trivial conclusion after considering the geometry of the problem, and by working through some algebraic substitution and simplification.

2. How does this actually change the problem?

This is a good question! After all, we multiplied both sides by the same quantity, ATA^T.

Recall that the matrix AA of an overdetermined system Ax=bAx = b is tall and thin (m>nm > n), and is therefore not invertible. How does multiplying both sides of the equation by AA’s transpose change the situation?

Well, if we do that, here’s what will happen to the dimensions of the problem:

The dimensions of A^TAx = A^Tb

Since this is a rectangular n×nn \times n system, we can solve for it by multiplying both sides by the inverse of ATAA^TA:

ATAx^=ATbx^=(ATA)1ATbA^TA\hat{x} = A^Tb\\ \hat{x} = (A^TA)^{-1}A^Tb

The Pseudoinverse of a Matrix

And on that note, it’s time for a quick definition.

This is the least squares solution to Ax=bAx = b when it’s an overdetermined system:

x^=(ATA)1ATb\hat{x} = (A^TA)^{-1}A^Tb

This is the solution to Ax=bAx = b when it’s a rectangular system:

x=(A1)bx = (A^{-1})b

Notice the similarity? For this reason, we call (ATA)1AT(A^TA)^{-1}A^T the pseudo-inverse of AA.

Least Squares and QR Decomposition

In practice, when working with floating-point systems, you should never compute the inverse of a matrix; doing so can result in a considerable loss of significance.

With a rectangular system Ax=bAx = b, it’s true that the solution is x=A1bx = A^{-1}b, but that’s not how we actually compute it. Instead, we use a process known as Gaussian Elimination, which avoids computing A1A^{-1} altogether.

Similarly, with an overdetermined system Ax=bAx = b, it’s again true that the least squares solution is x=(ATA)1ATbx = (A^TA)^{-1}A^Tb. But we shouldn’t use this directly. In practice, what is often done instead is to first rewrite AA as a product of two special matrices to avoid the need for computing AA’s pseudoinverse. Those two matrices are:

  • QQ, an m×nm \times n orthonormal matrix.
  • RR, an n×nn \times n upper-triangular matrix.

This process is known as QR factorization (or “decomposition”). It’s exactly analogous to the process of factorizing numbers that you’re already familiar with—for example, that 8=4×28 = 4 \times 2. Except now, it’s just with matrices: A=QRA = QR. QR factorization is what we call a “numerically stable algorithm,” which is desirable because it helps us minimize the loss of significance due to our computations.

There’s a pretty good YouTube tutorial on QR factorization if you’re interested in learning more about it; I’ll leave that up to you as an exercise since the explanation would require its own blog post.

But here’s the important takeaway: Once you perform QR factorization and get A=QRA = QR, you can substitute this into the equation we saw earlier.

ATAx=ATb(QR)T(QR)x=(QR)TbA^TAx = A^Tb \\ (QR)^T(QR)x = (QR)^Tb

Recall that (XY)T=YTXT(XY)^T = Y^TX^T:


Next, we can use an important property of orthonormal matrices. If QQ is an orthonormal matrix, then QTQ=IQ^TQ = I, the identity matrix. And that simplifies the above quite a bit:


We have an RTR^T on both sides, so those cancel and leave us with this:

Rx=QTbRx = Q^Tb

Now, we have an upper-triangular, rectangular system of equations that can easily be solved via back-substitution. What are the dimensions here?

  • RR is an upper-triangular n×nn \times n matrix.
  • xx is an nn-vector.
  • QQ is an m×nm \times n matrix (and thus QTQ^T is n×mn \times m).
  • bb is an mm-vector.

Once again, the dimensions work out to produce an n×nn \times n system. Notice how this system is conveniently set up such that all the computer has to do is work from the bottom up, first solving for xnx_n, then substituting that in the equation above and solving for xn1x_{n-1}, and so on, all the way up until x1x_1:

Visualization of the equation Rx = Q^Tb

And that’s a much more numerically stable process than using AA’s pseudo-inverse directly. In fact, notice that factorizing AA as QRQR helped us to avoid computing the pseudo-inverse altogether!


Here’s the gist of what we covered in this post:

  • Least squares finds the best approximation to a system of equations for which a unique solution does not exist.
  • You find the least squares solution by solving for x^\hat{x} in ATAx^=ATbA^TA\hat{x} = A^Tb.
  • You should use QR decomposition to rewrite AA as the product of two matrices: A=QRA = QR.
  • Plug A=QRA = QR into ATAx^=ATbA^TA\hat{x} = A^Tb and find x^\hat{x} with back-substitution in Rx^=QTbR\hat{x} = Q^Tb.

And that’s it! Hopefully things are starting to make a little more sense now.

In another post, we’ll look at practical least squares applications and solve least squares data fitting problems by hand (and with Python!).


Social media preview: Photo by Kelly Sikkema (Unsplash).

Comment system powered by the GitHub Issues API. You can learn more about how I built it or post a comment over on GitHub, and it'll show up below once you reload this page.