least squares - Jacob Hume

---- > [!theorem] Theorem. ([[least squares|linear least squares]]) > Let $\|\cdot\|_{2}$ denote the [[Euclidean inner product|Euclidean norm]], $A \in \mathbb{F}^{M \times N}$, and $y \in \mathbb{F}^{M}$ . Let $\hat{x} \in \mathbb{F}^{N}$ > Then, the following are equivalent. > 1. $\hat{x}=\text{arg min}_{x \in \mathbb{F}^{N}} \|Ax-y\|^{2}_{2}$; > 2. $\hat{x}$ solves the [[normal equations]] $A'Ax = A'y$; > 3. $\hat{x}=x=V_{r}\Sigma_{r}^{-1}U_{r}'y + V_{0}z_{0}$, where $U_{r},V_{r}, V_{0}$ are obtained from the [[compact svd]] of $A$ and $z_{0}$ is *any* vector of the appropriate length; > 4. $\hat{x} \in A^{+}y + \ker A$. > >Moreover, the [[Moore-Penrose pseudoinverse]] $\hat{x}=A^{+}y=(\sum_{k=1}^{r} \frac{1}{\sigma_{k}}v_{k}u_{k}')y$ will always solve $(2)$ and hence be a solution, but it is not unique unless $A$ has [[linearly independent]] columns. It does however, solve the [[bilevel optimization]] problem as a solution with smallest [[norm]]. ($\sigma_{k}$ denotes the $k^{th}$ [[singular values|singular value]] of $A$.) > [!NOTE] Note. > [[kernel and image of the adjoint|Note that]] if $y \in (\ker A^{\top})^{\perp}$, then $y \in \im A$ and so an exact solution to $Ax=y$ exists. This is especially useful if $A$ is symmetric, e.g., a [[graph Laplacian]]. > [!theorem] Tikhonov Regularized Least Squares > Set $\beta > 0$. Then the following are equivalent, with $\hat{x}$ unique: > 1. $\hat{x} = \arg \min_{x \in \mathbb{F}^{N}} \|Ax-y\|_{2}^{2} + \beta \|x\|_{2}^{2}$; > 2. $\hat{x}=(\tilde{A}'\tilde{A})^{-1}\tilde{A'}\tilde{y},$ where where $\tilde{A}:= \|\begin{bmatrix} A \\ \sqrt{ \beta }I \end{bmatrix}\|_{2}^{2}$ and $\tilde{y}:=\begin{bmatrix}y \\ 0\end{bmatrix}$; > 3. $\hat{x}=(A'A + \beta I)^{-1} A' y$ ([[ordinary least-squares linear regression|compare]]); > 4. $\hat{x}=V_{r}(\Sigma_{r}^{2} + \beta I_{r})^{-1} \Sigma_{r}U_{r}' y$; > 5. $\hat{x}=\sum_{k=1}^{r}\frac{\sigma_{k}}{\sigma_{k}^{2} + \beta}v_{k}u_{k}'y$, >where $V_{r}$, $\Sigma_{r}$, $V_{r}$ are from the [[compact svd]] of $A$ and the $\sigma_{i}$ denote the [[singular values]] of $A$. > [!proposition] Corollary. > If the columns of $A$ are [[linearly independent]], then $A'A$ is [[inverse matrix|invertible]] by [[columns linearly independent iff gram matrix is invertible]] and hence from $(2)$ we see that $\hat{x}$ is given by the closed form expression $\hat{x}=(A'A)^{-1}A'y.$ > \ > If columns of $A$ are *not* [[linearly independent]], then $\ker A$ is nontrival and characterization $(4)$ above shows that infinitely many $\hat{x}$ exist (though $A^{+}y$ is the unique choice minimizing $\|\hat{x}\|_{2}$). > \ > So it's an *iff*. > \ > **Warning.** This is NOT efficient in code! > [!proof]- Proof. ([[least squares]]) > We only treat the real case, for now, deriving $(1) \iff (2) \iff(3) \iff (4)$. > So suppose $(1)$ holds. Applying [[vector norm is convex]] and [[convex image of Ax + b is a convex function]] tells us that since $\|\cdot\|$ and $f(x)=x^{2}$ are [[convex function|convex]], $\|Ax-y\|_{2}^{2}$ is [[convex function|convex]]. Moreover, it is [[derivative|differentiable]], since $\|Ax-y\|_{2}^{2}=(Ax-y)'(Ax-y)=x'A'Ax-2\overbrace{y'Ax}^{=(A'y)'x}+y'y \ (1) $ > is a [[quadratic form]]. Thus, using the [[characterization of extrema for differentiable convex functions]] we see that setting the [[gradient|gradient]] of $(1)$ equal to $0$ yields an equation whose solutions are [[global extrema|global minimizers]] of the program. By [[derivative of a quadratic form|gradient of a quadratic form]] (note that the [[Gram matrix]] $A'A$ is [[symmetric matrix|symmetric]]), this means setting $2A'Ax - 2A'y=0.$ > Simplifying and rearranging yields the [[normal equations]] $A'Ax=A'y$, thus $(1) \iff (2)$. > > Next we show $(2) \iff (3)$. Factoring $A$ using the [[compact svd]] as $A=U_{r}\Sigma_{r}V_{r}'$, we find $\begin{align} > \|Ax-y\|_2^{2}= & \|U_{r}\Sigma_{r}V_{r}'x-y\|_{2}^{2} \\ > = & \|U_{r}\Sigma_{r}V_{r}\|_{2}^{2} - 2\re\big((U_{r}\Sigma_{r}V_{r}'x)'y\big) + \|y\|_{2}^{2} \\ > = & \|\Sigma_{r}V_{r}\|_{2}^{2} - 2\re\big((U_{r}\Sigma_{r}V_{r}'x)'y\big) + \|U_{r}'y\|_{2}^{2} - \|U_{r}'y\|_{2}^{2} + \|y\|_{2}^{2} \\ > = & \|\Sigma_{r}V_{r}'x - U_{r}'y\|_{2}^{2} + \|y\|_{2}^{2} + \|U_{r}'y\|_{2}^{2}, > \end{align}$ > where the second equality is [[polarization identities|just algebra]], the third equality uses [[group-invariant function|unitary invariance]] of [[semi-isometric matrix|semi-unitary matrices]] and [[completing the square]]. Assume $A\neq0$; then $r >0.$ The final two terms are independent of $x$, and so our minimization program becomes $\hat{x} =\text{arg min} \|\Sigma_{r} V_{r}'x - U_{r}'y\|_{2}^{2}.$ > Recall that $\Sigma_{r}$ is [[inverse matrix|invertible]] and $V_{r}'$ has [[orthonormal]] rows. > > Clearly, setting $\hat{x}=V_{r}\Sigma_{r}^{-1}U_{r}'y$ yields $0$ above and this choice is therefore a [[global extrema|global minimizer]]. If $A$ has [[linearly independent]] columns ($r=N$), then this minimizer is *unique* since $A'A$ is [[inverse matrix|invertible]] and the [[normal equations]] can be rearranged into a unique solution. Otherwise, $r<N$ and there are other minimizers. It turns out that all minimizers have the form $x=V_{r}\Sigma_{r}^{-1}U_{r}'y + V_{0}z_{0}$ where $z_{0}$ is *any* vector of the appropriate length. > > Finally, notice that $A'AA^{+}=V_{r}\Sigma_{r}U_{r}'(U_{r}U_{r}')=V_{r}\Sigma_{r}U_{r}'=A'$ > implies $A^{+}y$ is *a* solution because it solves the [[normal equations]]: $\begin{align} > A'AA^{+}y= & A'y. > \end{align}$ > Hence the [[Moore-Penrose pseudoinverse]] $A^{+}y$ is a (not unique unless $A$ has [[linearly independent]] columns) solution. $A^{+}y$ does, however, have minimal norm. (Should be reminded form [[Statistical Learning Theory MOC.canvas|Statistical Learning Theory MOC]] why this is desirable.) This is straightforward because we showed before $x=\overbrace{V_{r}\Sigma_{r}^{-1}U_{r}'y}^{A ^{+}y} + V_{0}z_{0}$ > and $V_{0}z_{0}=\text{ ker A}$ as $z_{0}$ ranges over all possible inputs (see [[compact svd]]). Since norms are nonnegative the result is clear. > > **Tikhonov Regularization.** Set $\beta>0$. We want to solve $\arg \min_{x \in \mathbb{F}^{N}} \|Ax-y\|_{2}^{2} + \beta \|x\|_{2}^{2}$. Using the [[stacking property of the Euclidean norm]], rewrite this as $\arg \min_{x \in \mathbb{F}^{N}} \left\| \begin{bmatrix} > A \\ \sqrt{ \beta }I_{N} > \end{bmatrix}x - \begin{bmatrix} > y \\ 0 > \end{bmatrix}\right\|_{2}^{2}=\arg \min_{x \in \mathbb{F}^{N}} \|\tilde{A}x-\tilde{y}\|_{2}^{2}, \ \ (*)$ > where $\tilde{A}:= \|\begin{bmatrix} A \\ \sqrt{ \beta }I \end{bmatrix}\|_{2}^{2}$ and $\tilde{y}:=\begin{bmatrix}y \\ 0\end{bmatrix}$. Noting that $\tilde{A}$ has $N$ [[linearly independent]] rows (those of $\sqrt{ \beta }I$), and hence $N$ [[linearly independent]] columns (full [[column rank]]), we conclude from our work above that the unique solution to $(*)$ is $\hat{x}=(\tilde{A}'\tilde{A})^{-1}\tilde{A'}\tilde{y}.$ > Substituting and performing [[block matrix]] multiplication yields the second expression listed in the theorem: $\hat{x}=(A'A+ \beta I)^{-1} A'y.$ > Next, we examine using the [[Singular Value Decomposition of a Matrix|SVD]] $A=U\Sigma V'$. We see that $\begin{align} > A'A+\beta I = & V\Sigma'U'U\Sigma V'+ \beta I \\ > = & V\Sigma'\Sigma V' + \beta \overbrace{VV'}^{I} \\ > = & V(\Sigma'\Sigma + \beta I)V', > \end{align}$ > so that $\begin{align} > \hat{x}= & (A'A+ \beta I)^{-1} A' y \\ > = & V(\Sigma'\Sigma + \beta I)^{-1} V'V \Sigma' U'y \\ > = & V(\Sigma'\Sigma + \beta I)^{-1} \Sigma'U' y \\ > = & V_{r}(\Sigma _{r}^{2} + \beta I_{r})^{-1} \Sigma_{r}U_{r}' y \\ > = & \sum_{k=1}^{r} \frac{\sigma_{k}}{\sigma_{k}^{2} + \beta}v_{k}u_{k}' y, > \end{align}$ > where we used a [[compact svd]] in the penultimate step. The final two expressions are the ones we wanted to show. ---- #### ----- #### References > [!backlink] > ```dataview TABLE rows.file.link as "Further Reading" FROM [[]] FLATTEN file.tags GROUP BY file.tags as Tag > [!frontlink] > ```dataview > TABLE rows.file.link as "Further Reading" > FROM outgoing([[]]) > FLATTEN file.tags as Tag > WHERE Tag = "#definition" OR Tag = "#theorem" OR Tag = "#MOC" OR Tag = "#proposition" OR Tag = "#axiom" > GROUP BY Tag > ```