----
> [!theorem] Theorem. ([[least squares|linear least squares]])
> Let $\|\cdot\|_{2}$ denote the [[Euclidean inner product|Euclidean norm]], $A \in \mathbb{F}^{M \times N}$, and $y \in \mathbb{F}^{M}$ . Let $\hat{x} \in \mathbb{F}^{N}$
> Then, the following are equivalent.
> 1. $\hat{x}=\text{arg min}_{x \in \mathbb{F}^{N}} \|Ax-y\|^{2}_{2}$;
> 2. $\hat{x}$ solves the [[normal equations]] $A'Ax = A'y$;
> 3. $\hat{x}=x=V_{r}\Sigma_{r}^{-1}U_{r}'y + V_{0}z_{0}$, where $U_{r},V_{r}, V_{0}$ are obtained from the [[compact svd]] of $A$ and $z_{0}$ is *any* vector of the appropriate length;
> 4. $\hat{x} \in A^{+}y + \ker A$.
>
>Moreover, the [[Moore-Penrose pseudoinverse]] $\hat{x}=A^{+}y=(\sum_{k=1}^{r} \frac{1}{\sigma_{k}}v_{k}u_{k}')y$ will always solve $(2)$ and hence be a solution, but it is not unique unless $A$ has [[linearly independent]] columns. It does however, solve the [[bilevel optimization]] problem as a solution with smallest [[norm]]. ($\sigma_{k}$ denotes the $k^{th}$ [[singular values|singular value]] of $A$.)
> [!NOTE] Note.
> [[kernel and image of the adjoint|Note that]] if $y \in (\ker A^{\top})^{\perp}$, then $y \in \im A$ and so an exact solution to $Ax=y$ exists. This is especially useful if $A$ is symmetric, e.g., a [[graph Laplacian]].
> [!theorem] Tikhonov Regularized Least Squares
> Set $\beta > 0$. Then the following are equivalent, with $\hat{x}$ unique:
> 1. $\hat{x} = \arg \min_{x \in \mathbb{F}^{N}} \|Ax-y\|_{2}^{2} + \beta \|x\|_{2}^{2}$;
> 2. $\hat{x}=(\tilde{A}'\tilde{A})^{-1}\tilde{A'}\tilde{y},$ where where $\tilde{A}:= \|\begin{bmatrix} A \\ \sqrt{ \beta }I \end{bmatrix}\|_{2}^{2}$ and $\tilde{y}:=\begin{bmatrix}y \\ 0\end{bmatrix}$;
> 3. $\hat{x}=(A'A + \beta I)^{-1} A' y$ ([[ordinary least-squares linear regression|compare]]);
> 4. $\hat{x}=V_{r}(\Sigma_{r}^{2} + \beta I_{r})^{-1} \Sigma_{r}U_{r}' y$;
> 5. $\hat{x}=\sum_{k=1}^{r}\frac{\sigma_{k}}{\sigma_{k}^{2} + \beta}v_{k}u_{k}'y$,
>where $V_{r}$, $\Sigma_{r}$, $V_{r}$ are from the [[compact svd]] of $A$ and the $\sigma_{i}$ denote the [[singular values]] of $A$.
> [!proposition] Corollary.
> If the columns of $A$ are [[linearly independent]], then $A'A$ is [[inverse matrix|invertible]] by [[columns linearly independent iff gram matrix is invertible]] and hence from $(2)$ we see that $\hat{x}$ is given by the closed form expression $\hat{x}=(A'A)^{-1}A'y.$
> \
> If columns of $A$ are *not* [[linearly independent]], then $\ker A$ is nontrival and characterization $(4)$ above shows that infinitely many $\hat{x}$ exist (though $A^{+}y$ is the unique choice minimizing $\|\hat{x}\|_{2}$).
> \
> So it's an *iff*.
> \
> **Warning.** This is NOT efficient in code!
> [!proof]- Proof. ([[least squares]])
> We only treat the real case, for now, deriving $(1) \iff (2) \iff(3) \iff (4)$.
> So suppose $(1)$ holds. Applying [[vector norm is convex]] and [[convex image of Ax + b is a convex function]] tells us that since $\|\cdot\|$ and $f(x)=x^{2}$ are [[convex function|convex]], $\|Ax-y\|_{2}^{2}$ is [[convex function|convex]]. Moreover, it is [[derivative|differentiable]], since $\|Ax-y\|_{2}^{2}=(Ax-y)'(Ax-y)=x'A'Ax-2\overbrace{y'Ax}^{=(A'y)'x}+y'y \ (1) $
> is a [[quadratic form]]. Thus, using the [[characterization of extrema for differentiable convex functions]] we see that setting the [[gradient|gradient]] of $(1)$ equal to $0$ yields an equation whose solutions are [[global extrema|global minimizers]] of the program. By [[derivative of a quadratic form|gradient of a quadratic form]] (note that the [[Gram matrix]] $A'A$ is [[symmetric matrix|symmetric]]), this means setting $2A'Ax - 2A'y=0.$
> Simplifying and rearranging yields the [[normal equations]] $A'Ax=A'y$, thus $(1) \iff (2)$.
>
> Next we show $(2) \iff (3)$. Factoring $A$ using the [[compact svd]] as $A=U_{r}\Sigma_{r}V_{r}'$, we find $\begin{align}
> \|Ax-y\|_2^{2}= & \|U_{r}\Sigma_{r}V_{r}'x-y\|_{2}^{2} \\
> = & \|U_{r}\Sigma_{r}V_{r}\|_{2}^{2} - 2\re\big((U_{r}\Sigma_{r}V_{r}'x)'y\big) + \|y\|_{2}^{2} \\
> = & \|\Sigma_{r}V_{r}\|_{2}^{2} - 2\re\big((U_{r}\Sigma_{r}V_{r}'x)'y\big) + \|U_{r}'y\|_{2}^{2} - \|U_{r}'y\|_{2}^{2} + \|y\|_{2}^{2} \\
> = & \|\Sigma_{r}V_{r}'x - U_{r}'y\|_{2}^{2} + \|y\|_{2}^{2} + \|U_{r}'y\|_{2}^{2},
> \end{align}$
> where the second equality is [[polarization identities|just algebra]], the third equality uses [[group-invariant function|unitary invariance]] of [[semi-isometric matrix|semi-unitary matrices]] and [[completing the square]]. Assume $A\neq0$; then $r >0.$ The final two terms are independent of $x$, and so our minimization program becomes $\hat{x} =\text{arg min} \|\Sigma_{r} V_{r}'x - U_{r}'y\|_{2}^{2}.$
> Recall that $\Sigma_{r}$ is [[inverse matrix|invertible]] and $V_{r}'$ has [[orthonormal]] rows.
>
> Clearly, setting $\hat{x}=V_{r}\Sigma_{r}^{-1}U_{r}'y$ yields $0$ above and this choice is therefore a [[global extrema|global minimizer]]. If $A$ has [[linearly independent]] columns ($r=N$), then this minimizer is *unique* since $A'A$ is [[inverse matrix|invertible]] and the [[normal equations]] can be rearranged into a unique solution. Otherwise, $r<N$ and there are other minimizers. It turns out that all minimizers have the form $x=V_{r}\Sigma_{r}^{-1}U_{r}'y + V_{0}z_{0}$ where $z_{0}$ is *any* vector of the appropriate length.
>
> Finally, notice that $A'AA^{+}=V_{r}\Sigma_{r}U_{r}'(U_{r}U_{r}')=V_{r}\Sigma_{r}U_{r}'=A'$
> implies $A^{+}y$ is *a* solution because it solves the [[normal equations]]: $\begin{align}
> A'AA^{+}y= & A'y.
> \end{align}$
> Hence the [[Moore-Penrose pseudoinverse]] $A^{+}y$ is a (not unique unless $A$ has [[linearly independent]] columns) solution. $A^{+}y$ does, however, have minimal norm. (Should be reminded form [[Statistical Learning Theory MOC.canvas|Statistical Learning Theory MOC]] why this is desirable.) This is straightforward because we showed before $x=\overbrace{V_{r}\Sigma_{r}^{-1}U_{r}'y}^{A ^{+}y} + V_{0}z_{0}$
> and $V_{0}z_{0}=\text{ ker A}$ as $z_{0}$ ranges over all possible inputs (see [[compact svd]]). Since norms are nonnegative the result is clear.
>
> **Tikhonov Regularization.** Set $\beta>0$. We want to solve $\arg \min_{x \in \mathbb{F}^{N}} \|Ax-y\|_{2}^{2} + \beta \|x\|_{2}^{2}$. Using the [[stacking property of the Euclidean norm]], rewrite this as $\arg \min_{x \in \mathbb{F}^{N}} \left\| \begin{bmatrix}
> A \\ \sqrt{ \beta }I_{N}
> \end{bmatrix}x - \begin{bmatrix}
> y \\ 0
> \end{bmatrix}\right\|_{2}^{2}=\arg \min_{x \in \mathbb{F}^{N}} \|\tilde{A}x-\tilde{y}\|_{2}^{2}, \ \ (*)$
> where $\tilde{A}:= \|\begin{bmatrix} A \\ \sqrt{ \beta }I \end{bmatrix}\|_{2}^{2}$ and $\tilde{y}:=\begin{bmatrix}y \\ 0\end{bmatrix}$. Noting that $\tilde{A}$ has $N$ [[linearly independent]] rows (those of $\sqrt{ \beta }I$), and hence $N$ [[linearly independent]] columns (full [[column rank]]), we conclude from our work above that the unique solution to $(*)$ is $\hat{x}=(\tilde{A}'\tilde{A})^{-1}\tilde{A'}\tilde{y}.$
> Substituting and performing [[block matrix]] multiplication yields the second expression listed in the theorem: $\hat{x}=(A'A+ \beta I)^{-1} A'y.$
> Next, we examine using the [[Singular Value Decomposition of a Matrix|SVD]] $A=U\Sigma V'$. We see that $\begin{align}
> A'A+\beta I = & V\Sigma'U'U\Sigma V'+ \beta I \\
> = & V\Sigma'\Sigma V' + \beta \overbrace{VV'}^{I} \\
> = & V(\Sigma'\Sigma + \beta I)V',
> \end{align}$
> so that $\begin{align}
> \hat{x}= & (A'A+ \beta I)^{-1} A' y \\
> = & V(\Sigma'\Sigma + \beta I)^{-1} V'V \Sigma' U'y \\
> = & V(\Sigma'\Sigma + \beta I)^{-1} \Sigma'U' y \\
> = & V_{r}(\Sigma _{r}^{2} + \beta I_{r})^{-1} \Sigma_{r}U_{r}' y \\
> = & \sum_{k=1}^{r} \frac{\sigma_{k}}{\sigma_{k}^{2} + \beta}v_{k}u_{k}' y,
> \end{align}$
> where we used a [[compact svd]] in the penultimate step. The final two expressions are the ones we wanted to show.
----
####
-----
#### References
> [!backlink]
> ```dataview
TABLE rows.file.link as "Further Reading"
FROM [[]]
FLATTEN file.tags
GROUP BY file.tags as Tag
> [!frontlink]
> ```dataview
> TABLE rows.file.link as "Further Reading"
> FROM outgoing([[]])
> FLATTEN file.tags as Tag
> WHERE Tag = "#definition" OR Tag = "#theorem" OR Tag = "#MOC" OR Tag = "#proposition" OR Tag = "#axiom"
> GROUP BY Tag
> ```