We can rewrite
$$ \frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = 0 = \boldsymbol{X}^T\left( \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right), $$as
$$ \boldsymbol{X}^T\boldsymbol{y} = \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta}, $$and if the matrix \( \boldsymbol{X}^T\boldsymbol{X} \) is invertible we have the solution
$$ \boldsymbol{\beta} =\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. $$We note also that since our design matrix is defined as \( \boldsymbol{X}\in {\mathbb{R}}^{n\times p} \), the product \( \boldsymbol{X}^T\boldsymbol{X} \in {\mathbb{R}}^{p\times p} \). In the above case we have that \( p \ll n \), in our case \( p=5 \) meaning that we end up with inverting a small \( 5\times 5 \) matrix. This is a rather common situation, in many cases we end up with low-dimensional matrices to invert. The methods discussed here and for many other supervised learning algorithms like classification with logistic regression or support vector machines, exhibit dimensionalities which allow for the usage of direct linear algebra methods such as LU decomposition or Singular Value Decomposition (SVD) for finding the inverse of the matrix \( \boldsymbol{X}^T\boldsymbol{X} \).
Small question: Do you think the example we have at hand here (the nuclear binding energies) can lead to problems in inverting the matrix \( \boldsymbol{X}^T\boldsymbol{X} \)? What kind of problems can we expect?