Further properties (important for our analyses later)

Let us study again \( \boldsymbol{X}^T\boldsymbol{X} \) in terms of our SVD,

$$ \boldsymbol{X}^T\boldsymbol{X}=\boldsymbol{V}\boldsymbol{\Sigma}^T\boldsymbol{U}^T\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T=\boldsymbol{V}\boldsymbol{\Sigma}^T\boldsymbol{\Sigma}\boldsymbol{V}^T. $$

If we now multiply from the right with \( \boldsymbol{V} \) (using the orthogonality of \( \boldsymbol{V} \)) we get

$$ \left(\boldsymbol{X}^T\boldsymbol{X}\right)\boldsymbol{V}=\boldsymbol{V}\boldsymbol{\Sigma}^T\boldsymbol{\Sigma}. $$

This means the vectors \( \boldsymbol{v}_i \) of the orthogonal matrix \( \boldsymbol{V} \) are the eigenvectors of the matrix \( \boldsymbol{X}^T\boldsymbol{X} \) with eigenvalues given by the singular values squared, that is

$$ \left(\boldsymbol{X}^T\boldsymbol{X}\right)\boldsymbol{v}_i=\boldsymbol{v}_i\sigma_i^2. $$

Similarly, if we use the SVD decomposition for the matrix \( \boldsymbol{X}\boldsymbol{X}^T \), we have

$$ \boldsymbol{X}\boldsymbol{X}^T=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T\boldsymbol{V}\boldsymbol{\Sigma}^T\boldsymbol{U}^T=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{\Sigma}^T\boldsymbol{U}^T. $$

If we now multiply from the right with \( \boldsymbol{U} \) (using the orthogonality of \( \boldsymbol{U} \)) we get

$$ \left(\boldsymbol{X}\boldsymbol{X}^T\right)\boldsymbol{U}=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{\Sigma}^T. $$

This means the vectors \( \boldsymbol{u}_i \) of the orthogonal matrix \( \boldsymbol{U} \) are the eigenvectors of the matrix \( \boldsymbol{X}\boldsymbol{X}^T \) with eigenvalues given by the singular values squared, that is

$$ \left(\boldsymbol{X}\boldsymbol{X}^T\right)\boldsymbol{u}_i=\boldsymbol{u}_i\sigma_i^2. $$

Important note: we have defined our design matrix \( \boldsymbol{X} \) to be an \( n\times p \) matrix. In most supervised learning cases we have that \( n \ge p \), and quite often we have \( n >> p \). For linear algebra based methods like ordinary least squares or Ridge regression, this leads to a matrix \( \boldsymbol{X}^T\boldsymbol{X} \) which is small and thereby easier to handle from a computational point of view (in terms of number of floating point operations).

In our lectures, the number of columns will always refer to the number of features in our data set, while the number of rows represents the number of data inputs. Note that in other texts you may find the opposite notation. This has consequences for the definition of for example the covariance matrix and its relation to the SVD.