11. Basic ideas of the Principal Component Analysis (PCA)#
The principal component analysis deals with the problem of fitting a
low-dimensional affine subspace
We have a data set defined by a design/feature matrix
Each data point is determined by
extrinsic (measurement) variablesWe may want to ask the following question: Are there fewer intrinsic variables (say
) that still approximately describe the data?If so, these intrinsic variables may tell us something important and finding these intrinsic variables is what dimension reduction methods do.
A good read is for example Vidal, Ma and Sastry.
11.1. Introducing the Covariance and Correlation functions#
Before we discuss the PCA theorem, we need to remind ourselves about the definition of the covariance and the correlation function. These are quantities
Suppose we have defined two vectors
where for example
With this definition and recalling that the variance is defined as
we can rewrite the covariance matrix as
The covariance takes values between zero and infinity and may thus lead to problems with loss of numerical precision for particularly large values. It is common to scale the covariance matrix by introducing instead the correlation matrix defined via the so-called correlation function
The correlation function is then given by values
In the above example this is the function we constructed using pandas.
In our derivation of the various regression algorithms like Ordinary Least Squares or Ridge regression
we defined the design/feature matrix
with
with a given vector
With these definitions, we can now rewrite our
and the correlation matrix
The Numpy function np.cov calculates the covariance elements using
the factor
which in turn is converted into into the
# Importing various packages
import numpy as np
n = 100
x = np.random.normal(size=n)
print(np.mean(x))
y = 4+3*x+np.random.normal(size=n)
print(np.mean(y))
W = np.vstack((x, y))
C = np.cov(W)
print(C)
-0.02745698767039481
3.9527157086177156
[[0.80249705 2.35440603]
[2.35440603 7.83541057]]
11.2. Correlation Matrix#
The previous example can be converted into the correlation matrix by
simply scaling the matrix elements with the variances. We should also
subtract the mean values for each column. This leads to the following
code which sets up the correlations matrix for the previous example in
a more brute force way. Here we scale the mean values for each column of the design matrix, calculate the relevant mean values and variances and then finally set up the
import numpy as np
n = 100
# define two vectors
x = np.random.random(size=n)
y = 4+3*x+np.random.normal(size=n)
#scaling the x and y vectors
x = x - np.mean(x)
y = y - np.mean(y)
variance_x = np.sum(x@x)/n
variance_y = np.sum(y@y)/n
print(variance_x)
print(variance_y)
cov_xy = np.sum(x@y)/n
cov_xx = np.sum(x@x)/n
cov_yy = np.sum(y@y)/n
C = np.zeros((2,2))
C[0,0]= cov_xx/variance_x
C[1,1]= cov_yy/variance_y
C[0,1]= cov_xy/np.sqrt(variance_y*variance_x)
C[1,0]= C[0,1]
print(C)
0.07408022521552643
1.9519435372439522
[[1. 0.56323995]
[0.56323995 1. ]]
We see that the matrix elements along the diagonal are one as they should be and that the matrix is symmetric. Furthermore, diagonalizing this matrix we easily see that it is a positive definite matrix.
The above procedure with numpy can be made more compact if we use pandas.
We whow here how we can set up the correlation matrix using pandas, as done in this simple code
import numpy as np
import pandas as pd
n = 10
x = np.random.normal(size=n)
x = x - np.mean(x)
y = 4+3*x+np.random.normal(size=n)
y = y - np.mean(y)
X = (np.vstack((x, y))).T
print(X)
Xpd = pd.DataFrame(X)
print(Xpd)
correlation_matrix = Xpd.corr()
print(correlation_matrix)
[[ 0.53340467 1.72530188]
[-0.56167889 -3.2544002 ]
[ 1.85988539 5.87145508]
[-1.23561281 -4.57354138]
[ 0.52739566 0.93489572]
[ 0.89775088 2.87036267]
[-0.71834014 -0.65667853]
[ 0.38699758 1.30537358]
[-1.48772565 -3.9664681 ]
[-0.20207669 -0.25630072]]
0 1
0 0.533405 1.725302
1 -0.561679 -3.254400
2 1.859885 5.871455
3 -1.235613 -4.573541
4 0.527396 0.934896
5 0.897751 2.870363
6 -0.718340 -0.656679
7 0.386998 1.305374
8 -1.487726 -3.966468
9 -0.202077 -0.256301
0 1
0 1.000000 0.966262
1 0.966262 1.000000
We expand this model to the Franke function discussed above.
# Common imports
import numpy as np
import pandas as pd
def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4
def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)
N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))
for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)
return X
# Making meshgrid of datapoints and compute Franke's function
n = 4
N = 100
x = np.sort(np.random.uniform(0, 1, N))
y = np.sort(np.random.uniform(0, 1, N))
z = FrankeFunction(x, y)
X = create_X(x, y, n=n)
Xpd = pd.DataFrame(X)
# subtract the mean values and set up the covariance matrix
Xpd = Xpd - Xpd.mean()
covariance_matrix = Xpd.cov()
print(covariance_matrix)
0 1 2 3 4 5 6 7 \
0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.0 0.073438 0.077471 0.075458 0.076186 0.076594 0.069080 0.068939
2 0.0 0.077471 0.083242 0.080919 0.082408 0.083412 0.074516 0.074729
3 0.0 0.075458 0.080919 0.083069 0.084410 0.085279 0.079308 0.079408
4 0.0 0.076186 0.082408 0.084410 0.086139 0.087327 0.080792 0.081105
5 0.0 0.076594 0.083412 0.085279 0.087327 0.088786 0.081792 0.082290
6 0.0 0.069080 0.074516 0.079308 0.080792 0.081792 0.077847 0.078070
7 0.0 0.068939 0.074729 0.079408 0.081105 0.082290 0.078070 0.078429
8 0.0 0.068716 0.074791 0.079363 0.081242 0.082589 0.078134 0.078618
9 0.0 0.068441 0.074753 0.079219 0.081258 0.082750 0.078091 0.078687
10 0.0 0.061939 0.066921 0.073159 0.074603 0.075599 0.073240 0.073520
11 0.0 0.061580 0.066741 0.072883 0.074456 0.075571 0.073054 0.073431
12 0.0 0.061213 0.066524 0.072574 0.074264 0.075487 0.072829 0.073296
13 0.0 0.060849 0.066290 0.072252 0.074047 0.075371 0.072584 0.073135
14 0.0 0.060497 0.066053 0.071930 0.073822 0.075239 0.072334 0.072965
8 9 10 11 12 13 14
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.068716 0.068441 0.061939 0.061580 0.061213 0.060849 0.060497
2 0.074791 0.074753 0.066921 0.066741 0.066524 0.066290 0.066053
3 0.079363 0.079219 0.073159 0.072883 0.072574 0.072252 0.071930
4 0.081242 0.081258 0.074603 0.074456 0.074264 0.074047 0.073822
5 0.082589 0.082750 0.075599 0.075571 0.075487 0.075371 0.075239
6 0.078134 0.078091 0.073240 0.073054 0.072829 0.072584 0.072334
7 0.078618 0.078687 0.073520 0.073431 0.073296 0.073135 0.072965
8 0.078919 0.079094 0.073650 0.073652 0.073602 0.073523 0.073430
9 0.079094 0.079368 0.073677 0.073765 0.073796 0.073795 0.073777
10 0.073650 0.073677 0.069911 0.069801 0.069653 0.069484 0.069308
11 0.073652 0.073765 0.069801 0.069768 0.069692 0.069593 0.069484
12 0.073602 0.073796 0.069653 0.069692 0.069686 0.069654 0.069611
13 0.073523 0.073795 0.069484 0.069593 0.069654 0.069688 0.069709
14 0.073430 0.073777 0.069308 0.069484 0.069611 0.069709 0.069791
We note here that the covariance is zero for the first rows and
columns since all matrix elements in the design matrix were set to one
(we are fitting the function in terms of a polynomial of degree
We can rewrite the covariance matrix in a more compact form in terms of the design/feature matrix
To see this let us simply look at a design matrix
If we then compute the expectation value
which is just
where we wrote $
It is easy to generalize this to a matrix
11.3. Towards the PCA theorem#
We have that the covariance matrix (the correlation matrix involves a simple rescaling) is given as
Let us now assume that we can perform a series of orthogonal transformations where we employ some orthogonal matrices
Assume also that there is a transformation
That is we have
since the matrix
and since
In the derivation of the PCA theorem we will assume that the eigenvalues are ordered in descending order, that is
The eigenvalues tell us then how much we need to stretch the
corresponding eigenvectors. Dimensions with large eigenvalues have
thus large variations (large variance) and define therefore useful
dimensions. The data points are more spread out in the direction of
these eigenvectors. Smaller eigenvalues mean on the other hand that
the corresponding eigenvectors are shrunk accordingly and the data
points are tightly bunched together and there is not much variation in
these specific directions. Hopefully then we could leave it out
dimensions where the eigenvalues are very small. If
11.3.1. The Algorithm before theorem#
Here’s how we would proceed in setting up the algorithm for the PCA, see also discussion below here.
Set up the datapoints for the design/feature matrix
with , with the predictors/features referring to the column numbers and the entries being the row elements.
Center the data by subtracting the mean value for each column. This leads to a new matrix
.Compute then the covariance/correlation matrix
.Find the eigenpairs of
with eigenvalues and eigenvectors .Order the eigenvalue (and the eigenvectors accordingly) in order of decreasing eigenvalues.
Keep only those
eigenvalues larger than a selected threshold value, discarding thus features since we expect small variations in the data here.
11.3.2. Writing our own PCA code#
We will use a simple example first with two-dimensional data drawn from a multivariate normal distribution with the following mean and covariance matrix (we have fixed these quantities but will play around with them below):
Note that the mean refers to each column of data.
We will generate
The following Python code aids in setting up the data and writing out the design matrix.
Note that the function multivariate returns also the covariance discussed above and that it is defined by dividing by
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
n = 10000
mean = (-1, 2)
cov = [[4, 2], [2, 2]]
X = np.random.multivariate_normal(mean, cov, n)
Now we are going to implement the PCA algorithm. We will break it down into various substeps.
The first step of PCA is to compute the sample mean of the data and use it to center the data. Recall that the sample mean is
and the mean-centered data
When you are done with these steps, print out
df = pd.DataFrame(X)
# Pandas does the centering for us
df = df -df.mean()
# we center it ourselves
X_centered = X - X.mean(axis=0)
Alternatively, we could use the functions we discussed
earlier for scaling the data set. That is, we could have used the
StandardScaler function in Scikit-Learn, a function which ensures
that for each feature/predictor we study the mean value is zero and
the variance is one (every column in the design/feature matrix). You
would then not get the same results, since we divide by the
variance. The diagonal covariance matrix elements will then be one,
while the non-diagonal ones need to be divided by
Now we are going to use the mean centered data to compute the sample covariance of the data by using the following equation
where the data points
print(df.cov())
print(np.cov(X_centered.T))
0 1
0 4.066103 2.050297
1 2.050297 2.019626
[[4.06610301 2.0502966 ]
[2.0502966 2.01962561]]
Note that the way we define the covariance matrix here has a factor
# extract the relevant columns from the centered design matrix of dim n x 2
x = X_centered[:,0]
y = X_centered[:,1]
Cov = np.zeros((2,2))
Cov[0,1] = np.sum(x.T@y)/(n-1.0)
Cov[0,0] = np.sum(x.T@x)/(n-1.0)
Cov[1,1] = np.sum(y.T@y)/(n-1.0)
Cov[1,0]= Cov[0,1]
print("Centered covariance using own code")
print(Cov)
plt.plot(x, y, 'x')
plt.axis('equal')
plt.show()
Centered covariance using own code
[[4.06610301 2.0502966 ]
[2.0502966 2.01962561]]

Depending on the number of points
11.3.3. Diagonalize the sample covariance matrix to obtain the principal components#
Now we are ready to solve for the principal components! To do so we
diagonalize the sample covariance matrix
We compute the percentage of the total variance captured by the first principal component
We plot the mean centered data and lines along the first and second principal components
Then we project the mean centered data onto the first and second principal components, and plot the projected data.
Finally, we approximate the data as
where
Collecting all these steps we can write our own PCA function and compare this with the functionality included in Scikit-Learn.
The code here outlines some of the elements we could include in the analysis. Feel free to extend upon this in order to address the above questions.
# diagonalize and obtain eigenvalues, not necessarily sorted
EigValues, EigVectors = np.linalg.eig(Cov)
# sort eigenvectors and eigenvalues
#permute = EigValues.argsort()
#EigValues = EigValues[permute]
#EigVectors = EigVectors[:,permute]
print("Eigenvalues of Covariance matrix")
for i in range(2):
print(EigValues[i])
FirstEigvector = EigVectors[:,0]
SecondEigvector = EigVectors[:,1]
print("First eigenvector")
print(FirstEigvector)
print("Second eigenvector")
print(SecondEigvector)
#thereafter we do a PCA with Scikit-learn
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2Dsl = pca.fit_transform(X)
print("Eigenvector of largest eigenvalue")
print(pca.components_.T[:, 0])
Eigenvalues of Covariance matrix
5.3343122336302145
0.7514163800957537
First eigenvector
[0.85045481 0.5260481 ]
Second eigenvector
[-0.5260481 0.85045481]
Eigenvector of largest eigenvalue
[0.85045481 0.5260481 ]
This code does not contain all the above elements, but it shows how we can use Scikit-Learn to extract the eigenvector which corresponds to the largest eigenvalue. Try to address the questions we pose before the above code. Try also to change the values of the covariance matrix by making one of the diagonal elements much larger than the other. What do you observe then?
11.4. Classical PCA Theorem#
We assume now that we have a design matrix
The PCA theorem states that minimizing the above reconstruction error
corresponds to setting
To show the PCA theorem let us start with the assumption that there is one vector
We are almost there, we have obtained a relation between minimizing the reconstruction error and the variance and the covariance matrix. Minimizing the error is equivalent to maximizing the variance of the projected data.
We could trivially maximize the variance of the projection (and
thereby minimize the error in the reconstruction function) by letting
the norm-2 of
Taking the derivative with respect to
meaning that
The direction that maximizes the variance (or minimizes the construction error) is an eigenvector of the covariance matrix! If we left multiply with
If we want to maximize the variance (minimize the construction error)
we simply pick the eigenvector of the covariance matrix with the
largest eigenvalue. This establishes the link between the minimization
of the reconstruction function
The proof
for the other eigenvectors
For more details, see for example Vidal, Ma and Sastry, chapter 2.
11.5. Geometric Interpretation and link with Singular Value Decomposition#
For a detailed demonstration of the geometric interpretation, see Vidal, Ma and Sastry, section 2.1.2.
Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it.
The following Python code uses NumPy’s svd() function to obtain all the principal components of the training set, then extracts the first two principal components. First we center the data using either pandas or our own code
import numpy as np
import pandas as pd
from IPython.display import display
np.random.seed(100)
# setting up a 10 x 5 vanilla matrix
rows = 10
cols = 5
X = np.random.randn(rows,cols)
df = pd.DataFrame(X)
# Pandas does the centering for us
df = df -df.mean()
display(df)
# we center it ourselves
X_centered = X - X.mean(axis=0)
# Then check the difference between pandas and our own set up
print(X_centered-df)
#Now we do an SVD
U, s, V = np.linalg.svd(X_centered)
c1 = V.T[:, 0]
c2 = V.T[:, 1]
W2 = V.T[:, :2]
X2D = X_centered.dot(W2)
print(X2D)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | -1.574465 | 0.259153 | 1.197370 | 0.147400 | 0.649382 |
1 | 0.689519 | 0.137652 | -1.025709 | 0.210340 | -0.076938 |
2 | -0.282727 | 0.351636 | -0.539261 | 1.216683 | 0.340782 |
3 | 0.070889 | -0.614808 | 1.074067 | -0.038300 | -1.450257 |
4 | 1.794282 | 1.458078 | -0.207545 | -0.442600 | -0.147420 |
5 | 1.112383 | 0.647473 | 1.405890 | 0.073598 | -0.276263 |
6 | 0.397700 | -1.526744 | -0.712018 | 1.216290 | 0.418506 |
7 | -0.280647 | 1.106095 | -1.646283 | -0.956563 | -1.564374 |
8 | -0.369139 | -0.751699 | 0.051649 | -0.213103 | 0.967809 |
9 | -1.557795 | -1.066837 | 0.401842 | -1.213743 | 1.138775 |
0 1 2 3 4
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0
[[-1.5378811 0.94639099]
[ 0.86145244 -0.89288636]
[-0.00445655 -0.81633628]
[ 0.07145103 1.00433417]
[ 2.03707133 0.48476997]
[ 0.72174172 1.4557763 ]
[-0.55854694 -1.60673226]
[ 1.6999536 -0.43766686]
[-1.10405456 -0.31718909]
[-2.18673098 0.17953942]]
PCA assumes that the dataset is centered around the origin. Scikit-Learn’s PCA classes take care of centering the data for you. However, if you implement PCA yourself (as in the preceding example), or if you use other libraries, don’t forget to center the data first.
Once you have identified all the principal components, you can reduce the dimensionality of the dataset
down to
W2 = V.T[:, :2]
X2D = X_centered.dot(W2)
11.6. PCA and scikit-learn#
Scikit-Learn’s PCA class implements PCA using SVD decomposition just like we did before. The following code applies PCA to reduce the dimensionality of the dataset down to two dimensions (note that it automatically takes care of centering the data):
#thereafter we do a PCA with Scikit-learn
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)
print(X2D)
[[ 1.5378811 -0.94639099]
[-0.86145244 0.89288636]
[ 0.00445655 0.81633628]
[-0.07145103 -1.00433417]
[-2.03707133 -0.48476997]
[-0.72174172 -1.4557763 ]
[ 0.55854694 1.60673226]
[-1.6999536 0.43766686]
[ 1.10405456 0.31718909]
[ 2.18673098 -0.17953942]]
After fitting the PCA transformer to the dataset, you can access the principal components using the components variable (note that it contains the PCs as horizontal vectors, so, for example, the first principal component is equal to
pca.components_.T[:, 0]
array([-0.62373464, -0.5303329 , 0.317367 , 0.01873344, 0.47815203])
Another very useful piece of information is the explained variance ratio of each principal component,
available via the
11.7. Back to the Cancer Data#
We can now repeat the above but applied to real data, in this case our breast cancer data. Here we compute performance scores on the training data using logistic regression.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data,cancer.target,random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print("Train set accuracy from Logistic Regression: {:.2f}".format(logreg.score(X_train,y_train)))
# We scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Then perform again a log reg fit
logreg.fit(X_train_scaled, y_train)
print("Train set accuracy scaled data: {:.2f}".format(logreg.score(X_train_scaled,y_train)))
#thereafter we do a PCA with Scikit-learn
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2D_train = pca.fit_transform(X_train_scaled)
# and finally compute the log reg fit and the score on the training data
logreg.fit(X2D_train,y_train)
print("Train set accuracy scaled and PCA data: {:.2f}".format(logreg.score(X2D_train,y_train)))
Train set accuracy from Logistic Regression: 0.95
Train set accuracy scaled data: 0.99
Train set accuracy scaled and PCA data: 0.96
/Users/mhjensen/miniforge3/envs/myenv/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
We see that our training data after the PCA decomposition has a performance similar to the non-scaled data.
Instead of arbitrarily choosing the number of dimensions to reduce down to, it is generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance (e.g., 95%). Unless, of course, you are reducing dimensionality for data visualization — in that case you will generally want to reduce the dimensionality down to 2 or 3. The following code computes PCA without reducing dimensionality, then computes the minimum number of dimensions required to preserve 95% of the training set’s variance:
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
You could then set
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)
11.7.1. Incremental PCA#
One problem with the preceding implementation of PCA is that it requires the whole training set to fit in memory in order for the SVD algorithm to run. Fortunately, Incremental PCA (IPCA) algorithms have been developed: you can split the training set into mini-batches and feed an IPCA algorithm one minibatch at a time. This is useful for large training sets, and also to apply PCA online (i.e., on the fly, as new instances arrive).
11.7.2. Randomized PCA#
Scikit-Learn offers yet another option to perform PCA, called Randomized PCA. This is a stochastic
algorithm that quickly finds an approximation of the first d principal components. Its computational
complexity is
11.7.3. Kernel PCA#
The kernel trick is a mathematical technique that implicitly maps instances into a very high-dimensional space (called the feature space), enabling nonlinear classification and regression with Support Vector Machines. Recall that a linear decision boundary in the high-dimensional feature space corresponds to a complex nonlinear decision boundary in the original space. It turns out that the same trick can be applied to PCA, making it possible to perform complex nonlinear projections for dimensionality reduction. This is called Kernel PCA (kPCA). It is often good at preserving clusters of instances after projection, or sometimes even unrolling datasets that lie close to a twisted manifold. For example, the following code uses Scikit-Learn’s KernelPCA class to perform kPCA with an
from sklearn.decomposition import KernelPCA
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)
11.8. Other techniques#
There are many other dimensionality reduction techniques, several of which are available in Scikit-Learn.
Here are some of the most popular:
Multidimensional Scaling (MDS) reduces dimensionality while trying to preserve the distances between the instances.
Isomap creates a graph by connecting each instance to its nearest neighbors, then reduces dimensionality while trying to preserve the geodesic distances between the instances.
t-Distributed Stochastic Neighbor Embedding (t-SNE) reduces dimensionality while trying to keep similar instances close and dissimilar instances apart. It is mostly used for visualization, in particular to visualize clusters of instances in high-dimensional space (e.g., to visualize the MNIST images in 2D).
Linear Discriminant Analysis (LDA) is actually a classification algorithm, but during training it learns the most discriminative axes between the classes, and these axes can then be used to define a hyperplane onto which to project the data. The benefit is that the projection will keep classes as far apart as possible, so LDA is a good technique to reduce dimensionality before running another classification algorithm such as a Support Vector Machine (SVM) classifier discussed in the SVM lectures.