The method of least squares is a simple and elegant technique, but itâ€™s often explained poorly. Itâ€™s something that youâ€™ll remember by heart if you understand the intuitionâ€”you wonâ€™t have to memorize a single equation.

In this comprehensive post, weâ€™ll look at the problem that motivates the least squares method and gain an intuitive understanding for how it works under the hood.

Thereâ€™s a lot of theory coming up, but Iâ€™ll break it down into simple explanations.

##
What Is the Least Squares Method

In simple terms, the **method of least squares** finds an approximate solution to a system of equations for which there is no exact solution. Letâ€™s look at why we need least squares with this simple example:

We have four equations and two unknowns. We could perform row reduction, but since we only have two unknowns in this case ( and ), we can also solve by substitution.

At this point, you should be asking yourself: Which two pairs of equations should we use to solve for and ? Hang onto that thoughtâ€”itâ€™s part of the issue of having more equations than unknowns. If we just had two equations and two unknowns, then we wouldnâ€™t run into this problem.

Letâ€™s rewrite the first equation in terms of :

Then, letâ€™s plug this into the second equation:

We have , so now letâ€™s find :

At this point, we may be tempted to conclude that our solution is the following:

This happens to satisfy the first and second equations by design since we used them directly. It also satisfies the fourth equation by pure coincidence. But if you plug and into the third equation, youâ€™ll get a contradiction:

So what went wrong?

The truth is that **there is no unique solution to this system of equations**. We can verify this visually by plotting our equations, which are really just lines in 2D (you can think of as our â€śâ€ť and as our â€ś,â€ť or vice versa):

Clearly, there is no single point at which all of the lines intersect, meaning thereâ€™s no solution that satisfies all four equations simultaneously. What we found is the red dot, . As we saw earlier, there are three equations for which this is a solution (blue, purple, and red in the graph above).

##
Overdetermined Systems Don't Have a Unique Solution

The system of equations above is called an **overdetermined system**; such a system has more equations () than unknowns (). If we denote the matrixâ€™s dimensions as , then an overdetermined system is one where . Visually, this looks like a tall and thin matrix.

Hereâ€™s the matrix form of our system above:

As an exercise, try to find the inverse of the matrix . Youâ€™ll notice that itâ€™s not possible! This is simply because is not a square matrix. In general, **the matrix of an overdetermined system is not invertible**, meaning we canâ€™t solve for using the following traditional method:

This leads us to an important conclusion: An overdetermined system does not have a single, unique solution.

Takeaway: If is an overdetermined system, then is not an invertible matrix. Consequently, thereâ€™s no unique solution .

As a direct consequence of this, an overdetermined system can either have:

- Infinitely many solutions, or
- No solutions at all (like in our example).

Hereâ€™s an example where there are infinitely many solutions, if youâ€™re wondering how thatâ€™s possible:

This is an overdetermined system because there are three equations but only two unknowns. Notice that the second and third equations are just scaled up ( and ) versions of the first one. This means that theyâ€™re all the same line, and thus there are infinitely many points of â€śintersection.â€ť

##
How Do We Solve Overdetermined Systems?

So far, weâ€™ve seen that we canâ€™t solve an overdetermined system in the traditional sense. So what do we do? Letâ€™s look at the general form of an overdetermined system . As a matrix equation, it looks like this:

Note: Although this system looks rectangular because of my notation, keep in mind that .

As weâ€™ve seen, this system has no *unique* solutionâ€”either thereâ€™s no solution whatsoever, or there are infinitely many solutions. From a *computational* standpoint, neither of those are simple problems to solve. We like working with rectangular systems, where , because they are computationally simpler to solve (relatively speaking).

Takeaway: Overdetermined systems present a computationally difficult problem to solve because they have either infinitely many solutions or no solutions at all.

But we donâ€™t just throw up our arms in defeat if weâ€™re given an overdetermined system. Instead, we try to do our bestâ€”to find a value for such that no other will get us closer to approximating . This is known as the **least squares solution** to . Itâ€™s not an *actual* solution like what weâ€™re used to. Rather, a least squares solution is about as close to a â€śrealâ€ť solution as we can hope to get.

Takeaway: A least squares solution is anapproximatesolution to a system of equations for which a unique solution does not exist.

##
Least Squares Visualized

Before we look at the generalized graph of the least squares method, letâ€™s look at a simple example that youâ€™ll come across frequently in statistics classes on least squares regression. Suppose youâ€™re given this two-dimensional data set of pairs:

Donâ€™t confuse these s with the ones we saw earlierâ€”just pretend for a second that youâ€™re back in algebra class and learning how to plot in the -plane.

If we plot these data points, weâ€™ll get the following graph:

These points appear to follow a linear shape, but itâ€™s simply not possible to plot a straight line that fits all of the points. But clearly, we can draw a **best-fit line** that at least gets as close to all of the points as possible:

Spoiler: The solution is .

We wonâ€™t concern ourselves right now with how that equation was actually obtained. Letâ€™s just represent it in its most generic form:

We were given several pairs, so we can plug those into the above equation to get a *set of equations* that we want to solve in order to find the values for and :

Our goal is to find values for and that will minimize the error of estimating the trend in the data with the imperfect equation . In other words, this is a least squares problemâ€”we canâ€™t get an exact solution, so we try to find a â€śsolutionâ€ť thatâ€™s as good as possible.

Letâ€™s represent this problem in matrix form:

This was really an overdetermined system, , disguised as a bunch of data pairs! We have more equations than unknowns. This alone tells us that there is no unique solution and that we must find a least squares solution instead.

Curious how I plotted the line for this particular problem? Youâ€™ll learn more about least squares curve fitting in the next post in this two-part series.

Letâ€™s generalize this problem. We start with an system:

And we consider the case where â€”that is, our solution will have just two components, and (these were and in the concrete example above):

If like it is here, ends up being a plane in a 3D space:

Notice that the vector falls outside this plane. In plain English, this means that thereâ€™s no such that . If we imagine for a second that we live in the 3D space of this graph, our walking surface would be limited to the plane itself. In other words, we would have no way of reaching ; the best that we could do is to navigate the plane itself.

Okay, so we canâ€™t get to â€¦ Bummer. But can we get close? We sure can! And in fact, geometrically, thatâ€™s exactly what the method of least squares doesâ€”it finds the point in the plane thatâ€™s closest to . From the image, we see that the closest point to is right under itâ€”where the orthogonal projection of onto the plane actually touches the plane.

Letâ€™s define two things:

- is the point in the plane that, when plugged in for , gets us closest to . Itâ€™s what we call the least squares solution to . Visually, itâ€™s on the plane and directly under .
- is the residual vector, which is orthogonal to the plane as you can see above.

Now, letâ€™s revisit our matrix equation, . Weâ€™ll rewrite it using â€™s column vectors, and . Note that this changes absolutely nothing:

As we saw in the image earlier, the vector is clearly orthogonal to the plane . By definition, this means that must be orthogonal to all vectors in the plane. And we know two vectors that lie in the plane: and .

Now, if two vectors are orthogonal, then their dot product is zero. So letâ€™s write that out explicitly (using the matrix notation for a dot product):

These two are simultaneously true. We can express this using the following equivalent notation:

Time to start plugging some things in. Recall from above that we defined to be :

Now move to the other side. Doing so gives us this important equation:

Summary: The here is known as theleast squares solutionto the system . Itâ€™s the point of intersection between the plane and the orthogonal projection of onto that plane.

At this point, you may have two related questions:

### 1. Why didnâ€™t we just multiply both sides of with and replace with in the first place?

Well, sureâ€”we couldâ€™ve done that. And in fact, now that we know that gives us the least squares solution , this is indeed what weâ€™ll do in the future. In hindsight, though, we had no good reason to do so from the get-go. We only arrived at this trivial conclusion after considering the geometry of the problem, and by working through some algebraic substitution and simplification.

### 2. How does this actually change the problem?

This is a good question! After all, we simply multiplied both sides by the same quantity, .

Recall that the matrix of an overdetermined system is tall and thin (), and is therefore not invertible. How does multiplying both sides of the equation by â€™s transpose change the situation?

Well, if we do that, hereâ€™s what will happen to the dimensions of the problem:

Since this is a rectangular system, we can solve for it by multiplying both sides by the inverse of :

###
The Pseudoinverse of a Matrix

And on that note, itâ€™s time for a quick definition.

This is the least squares solution to when itâ€™s an overdetermined system:

This is the solution to when itâ€™s a rectangular system:

Notice the similarity? For this reason, we call the **pseudo-inverse** of .

##
Least Squares and QR Decomposition

In practice, when working with floating-point systems, you should **never** compute the inverse of a matrix; doing so can result in a considerable loss of significance.

With a rectangular system , itâ€™s true that the solution is , but thatâ€™s not how we actually compute it. Instead, we use a process known as Gaussian Elimination, which avoids computing altogether.

Similarly, with an overdetermined system , itâ€™s again true that the least squares solution is . But we shouldnâ€™t use this directly. In practice, what is often done instead is to first rewrite as a product of two special matrices to avoid the need for computing â€™s pseudoinverse. Those two matrices are:

- , an orthonormal matrix.
- , an upper-triangular matrix.

This process is known as QR factorization (or â€śdecompositionâ€ť). Itâ€™s exactly analogous to the process of factorizing numbers that youâ€™re already familiar withâ€”for example, that . Except now, itâ€™s just with matrices: . QR factorization is what we call a â€śnumerically stable algorithm,â€ť which is desirable because it helps us minimize the loss of significance due to our computations.

Thereâ€™s a pretty good YouTube tutorial on QR factorization if youâ€™re interested in learning more about it; Iâ€™ll leave that up to you as an exercise since the explanation would require its own blog post.

But hereâ€™s the important takeaway: Once you perform QR factorization and get , you can substitute this into the equation we saw earlier.

Recall that :

Next, we can use an important property of orthonormal matrices. If is an orthonormal matrix, then , the identity matrix. And that simplifies the above quite a bit:

We have an on both sides, so those cancel and leave us with this:

Now, we have an upper-triangular, rectangular system of equations that can easily be solved via back-substitution. What are the dimensions here?

- is an upper-triangular matrix.
- is an -vector.
- is an matrix (and thus is ).
- is an -vector.

Once again, the dimensions work out to produce an system. Notice how this system is conveniently set up such that all the computer has to do is work from the bottom up, first solving for , then substituting that in the equation above and solving for , and so on, all the way up until :

And thatâ€™s a much more numerically stable process than using â€™s pseudo-inverse directly. In fact, notice that factorizing as helped us to avoid computing the pseudo-inverse altogether!

##
Summary

Hereâ€™s the gist of what we covered in this post:

- Least squares finds the best approximation to a system of equations for which a unique solution does not exist.
- You find the least squares solution by solving for in .
- You should use QR decomposition to rewrite as the product of two matrices: .
- Plug into and find with back-substitution in .

And thatâ€™s it! Hopefully things are starting to make a little more sense now.

In the next post, weâ€™ll look at practical least squares applications and solve least squares data fitting problems by hand (and with Python!).

## đź’¬ Comments

Post comment