# Understanding Principal Component Analysis and its Application in Data Science — Part 2

## Learn the mathematical intuition behind PCA

In the first part of this article, I introduced the covariance matrix and discussed the mathematical foundation of PCA. In this part first, we discuss the application of singular value decomposition (SVD) in SVD and then we see a case study of PCA for the MNIST dataset. Finally, we discuss the multivariate normal distribution and the limitations of PCA. If I refer to a figure or equation or listing which cannot be found here, then you should look for that in the first part of the article. The Eqs. 1–69, Figures 1–20 and Listings 1–17 are given in the first part.

You can find the first part of this article here: Understanding Principal Component Analysis and its Application in Data Science — Part 1

PCA using Singular Value Decomposition (SVD)

Suppose that A is an m×n matrix. Then AᵀA will be a square m×m matrix. We can easily show that this matrix is symmetric. We only need to show that it is equal to its transpose.

This result also shows why in Eq. 68, the covariance matrix is always symmetric. The eigenvectors and eigenvalues are only defined for square matrices, so we cannot define the eigenvalues of A if mn. However, we can still calculate the eigenvalues of Aᵀ A. Recall that the rank of a symmetric matrix is equal to the number of its nonzero eigenvalues. Besides, we know that the rank of A and Aᵀ A are equal. So the rank of A is equal to the number of the nonzero eigenvalues of Aᵀ A.

Since Aᵀ A is symmetric, it has n real eigenvalues and n linear independent and orthogonal eigenvectors. We call these eigenvectors v₁, v₂, …, v and we assume they are normalized. For each of these eigenvectors, we can use the definition of the length of a vector (Eq. 35) and Eq. 28 to write

Here we used the fact that the eigenvectors are normal, so their length is 1. So based on this equation the eigenvalues of AᵀA are all positive. Now assume that we label them in decreasing order, so:

Let {v₁, v₂, …, v} be the eigenvectors of AᵀA corresponding to these eigenvectors. We define the singular value of a matrix A as the square root of λ (the eigenvalue of AᵀA), and we denote it with σᵢ (don’t confuse it with standard deviation). So we have

Now suppose that the rank of A is r. Then the number of the nonzero eigenvalues of AᵀA or the number of nonzero singular values of A is r.

Now the singular value decomposition (SVD) of A can be written as

V is an n×n matrix that its columns are vᵢ (the eigenvectors of AᵀA)

As mentioned before the eigenvectors form an orthonormal set, so V is an orthogonal matrix. Σ is an m×n diagonal matrix of the form:

So all the elements of Σ are zero except the first r diagonal elements which are equal to the singular values of A. We define the orthogonal matrix U as

We define u to u as

The other uᵢ are defined in a way that together they form a basis for an m-dimensional vector basis Rᵐ. We are not going to prove the SVD equation here. You can refer to this article to learn more about SVD. Now suppose that we get the SVD of a centered design matrix ().

By definition of SVD, v₁, v₂, …, vare the eigenvectors of X̃ᵀ and the singular values are the square root of their corresponding eigenvalues. So we can write

Now we can divide both sides of the previous equation by m-1 (where m is the number of data points) and use Eq. 68, to get

So vᵢ is also the eigenvector of the covariance matrix and its corresponding eigenvalue is the square of its corresponding singular value divided by m-1. Remember that only the first r (r is the rank of ) singular values are nonzero and the rest will be equal to zero, so for those singular values, the corresponding eigenvalue of the covariance matrix is zero too. So the V matrix in SVD gives the principal components and using the singular values in Σ, we can easily calculate the eigenvalues.

As a result, using SVD we can calculate the principal components of the dataset without having to form the covariance matrix (1/(m-1))X̃ᵀ X̃. Forming this matrix for large datasets is computationally expensive. On the other, there are efficient algorithms to calculate the SVD of which makes it the preferred numerical method to perform a PCA analysis. In fact, Scikit-learn uses SVD to calculate the principal components. Let’s go back to the dataset that we defined in Listing 10.

`np.random.seed(0)mu = [2, 2]Sigma = [[6, 4],         [4, 6]]points = np.random.multivariate_normal(mu, Sigma, 150)pca = PCA(n_components=2)pca.fit(points)`

The eigenvectors were

`pca.explained_variance_`

Output

`array([10.24723443, 1.96099192])`

We can also retrieve the calculated singular values of the centered dataset

`# Listing 18sigma = pca.singular_values_sigma`

Output

`array([39.07477357, 17.09350158])`

Now using Eq. 70 we can calculate the eigenvalues using these singular values

`M = 150sigma ** 2 / (m — 1)`

Output

`array([10.24723443, 1.96099192])`

As you see we get the same numbers that `pca.explained_variance_` returned.

Case Study: MNIST Dataset

The MNIST dataset (MNIST stands for Modified National Institute of Standards and Technology) is a preprocessed dataset of handwritten digits 0 to 9, and It is widely used in image processing and machine learning. It has a training set of 60,000 examples and a test set of 10,000 examples. Each example contains an array and label. The array represents a centered 28×28 grayscale image that shows a handwritten digit, and the label gives the number that this image represents. The intensity of each pixel is an integer between 0 and 255, where zero corresponds to black and 255 corresponds to white. We can use this dataset to see the application of PCA in dimensionality reduction. We can use the `fetch_openml()` in Scikit-learn to load this dataset.

`# Listing 20from sklearn.datasets import fetch_openmlXtil, y = fetch_openml('mnist_784', version=1, return_X_y=True)print(Xtil.shape)print(y.shape)`

Output

`(70000, 784)(70000,)`

Here `Xtil` is the design matrix. Each row of `Xtil` is a vector with 784 elements that represents a flattened 28×28 image (28×28=784). So by reshaping it into a 28×28 array and using the `matplotlib` function `imshow()` we can show the image. For example, to show the first image we can write

`plt.imshow(Xhat.reshape(28, 28), cmap=’gray’)plt.show()`

The i-th column of `Xtil` is the i-th feature of this dataset and represents the i-th pixel in a flattened 784x784 image (Figure 22).

The array `y` contains the label of each image. So the label of this image is

`y`

Output

`'5'`

First, we normalize the pixels by dividing them by 255. So now the intensity of each pixel is in the interval [0, 1].

`Xhat /= 255`

Now we can use PCA for this dataset

`pca = PCA().fit(Xhat)`

Since `Xtil` is 70000×784, the covariance matrix will be 784×784, and we will have 784 eigenvectors or principal components. We can check the eigenvalues to see if any of them is zero.

`len(pca.explained_variance_[pca.explained_variance_ <= 1e-15])`

Output:

`71`

So 71 eigenvalues are extremely small and we can assume that they are zero which means that the rank of the covariance matrix is 784–71=713. These eigenvectors won’t be exactly zero since we are using a numerical method to calculate them. Instead, they will be a very small number. Since the eigenvalues reported by `pca.explained_variance_` are sorted, the nonzero eigenvalues are the first 713 elements of `pca.explained_variance_`. Listing 21 takes the first 5 eigenvectors (corresponding to the 5 greatest eigenvalues), reshapes them into a 28×28 array, and plots them (Figure 23).

The eigenvectors corresponding to nonzero eigenvalues form a basis for this dataset. So, each vector that represents an image can be written in terms of the first 713 eigenvectors which have a nonzero eigenvalue (Figure 24).

Now we find the scalar projections of each vector on each principal component. These scalar projections are the coordinates of that vector relative to the basis formed by the principal components.

`coordinates = pca.transform(Xhat)`

We can reconstruct a vector using the first k principal components (eigenvectors). This is an example of dimensionality reduction. When we only use the first k principal components, we reduce the dimension of the original dataset from 713 to k. If the original vector is xᵢ, we can write it as

Where each vᵢ is one of the principal components. By taking the first k principal components, we approximate xᵢ as

Now we only have k features and a much smaller space is needed to store the dataset. Listing 22 reconstructs a sample vector using the first 2, 3, 10, 50, 100, 400, and 713 eigenvectors. After the reconstruction, the vector is reshaped into a 28×28 array to be shown as an image. The result is shown in Figure 25.

When we only have 2 principal components, the image is blurred and doesn’t look like a ‘3’. It means that we have lost too much information by dimensionality reduction. As the number of principal components increases, the reconstructed image becomes closer to the original image. In fact, when k=400, they look identical. Let’s discuss it in more detail. We know that each eigenvalue gives the variance of the dataset along its corresponding eigenvector. So the sum of all eigenvectors gives the total variance of the dataset. `pca.explained_variance_ratio_` gives the share of each eigenvalue in the total sum of them. The output of this function is a 1-d array. Remember that the i-th element of `pca.explained_variance_` is the i-th eigenvector λ. The i-th element of `pca.explained_variance_ratio_` is

where n is the total number of eigenvalues returned by `pca.explained_variance_`. So if we calculate the cumulative sum of all the elements of `pca.explained_variance_ratio_`, it will be 1.

However, if we ignore some of the eigenvalues (and their corresponding eigenvectors) the cumulative sum will be less than one. Listing 23 plots the cumulative sum of `pca.explained_variance_ratio_` for MNIST dataset. Indeed the cumulative sum of `pca.explained_variance_ratio_` gives the sum of the variance ratio of the eigenvalues that are used to do the dimensionality reduction.

`# Listing 23plt.plot(range(1, 785), np.cumsum(pca.explained_variance_ratio_), marker=”o”)plt.xlabel(‘Number of components’, fontsize=14)plt.ylabel(‘Explained variance ratio’, fontsize=14)plt.show()`

For each point in this plot, if the x-coordinate is k (number of principal components included), the y coordinate is

This ratio gives the total variance explained by the first k principal components divided by the total variance of the dataset. As you see in Figure 26, the first 100 eigenvectors, explains ~90% of the total variance of the dataset. As a result, the reconstructed image using k=100 is a very good approximation of the original image. The curve in the plot reaches a plateau and doesn’t change that much when k is greater than 100. This plot is a very useful tool to decide how many principal components are necessary to have a good approximation of the original dataset.

Listing 24 takes 10 sample images with labels from ‘0’ to ‘9’ and reconstructs them using the first 2 principal components. The result is shown in Figure 27.

As you see the images that represent digits ‘0’, ‘1’, and ‘9’ are reconstructed much better compared to the others. Listing 25 plots all the data points on a 2-d plane formed by the 1st and 2nd principal components, so these points are the projection of the original data points onto this 2-d plane. The points are colored according to their labels. The result is shown in Figure 28 (left).

Figure 28 (right ) zooms in on a portion of the left plot. As you see in this figure, the points that represent the digit ‘1’ (colored by orange) are rather isolated from the other points and occupy a specific region on this 2-d plane. However, the points that represent the other digits are mixed and don’t have a specific region. As a result, when we reconstruct the vector that represents ‘1’, the result is a rather good approximation of the original image. However, for a digit like ‘8’, the reconstructed image is a mixture of the image of ‘8’ with those of some other digits like ‘3’, ‘6’, and ‘9’.

Listing 26 creates a rectangular grid out of an array of the coordinates of the first principal component and an array of the coordinates of the second principal component (Figure 29 left). For each point on this grid, it reconstructs its image using the corresponding coordinates (Figure 29 right).

Based on this figure, the reconstructed vectors of the digits ‘0’, ‘1’, and ‘9’ have a separate region on this 2-d plane of the 1st and 2nd principal components. In fact, when we project the original vectors of these digits onto this plane, each of them is placed in a rather separate region. However, the other digits share the same region, so the reconstructed image is a mixture of the image of several digits (Figure 30).

Removing noise using PCA

PCA can be used to reduce the noise in the data. Listing 27 shows an example. Here we take a sample data point and add some random noise to it. Figure 31 compares the reconstructed image using the first 50 principal components for both the original image and the original image with noise.

The reconstruction of the noisy image can remove most of the noise present in that. To explain the reason, we plot the coordinates of the principal components for the reconstructed vectors of the original image with and without noise. The result is shown in Figure 32.

As the figure shows, PCA assigns most of the noise to the principal components with a higher index (which have a lower eigenvalue). So for the first 50 principal components, the coordinates are more or less similar, however, as the index of the principal component increases, the difference between them increases too. Since we are only using the first 50 principal components, we ignore the other principal components that contain most of the noise, and the noise is reduced significantly.

A closer look at the multivariate normal (MVN) distribution

So far we assumed we have a random vector X that can generate our data points. We use these data points to calculate its mean vector and covariance matrix. But what if we want to do the opposite? Suppose that we have a mean vector µ and a covariance matrix Σ. Now we want to find a random vector X such that if we generate all the possible vectors that X can take, the mean of these vectors is µ and their covariance matrix is Σ. A random vector is an abstract concept, and what actually generates the data points is the PDF that the random vector represents. A PDF is a mathematical function that gives the probabilities of occurrence of different possible values that the random vector can take. Using the PDF, we can sample a random vector and generate some of the possible values that it can take. So our task is to find the probability distribution of X using µ and Σ. As we will show in this section the MVN distribution serves this purpose.

First, we start with the standard normal distribution. Let X be a continuous random variable. We say that X has a standard normal distribution if it has the following PDF:

We can easily show that (the proof is given in the appendix):

Figure 33 (left) shows a plot of this PDF. In fact, X can take any real values, however, the chance of taking values greater than 2 or less than -2 is small.

So the standard normal distribution can give us a random variable that its mean is zero and its variance is 1. Now let Z be a random variable having a standard normal distribution. We say X has a normal distribution with parameters µ and σ if

So a random variable with a normal distribution is just a linear function of another random variable with a standard normal distribution. We can show that the PDF of X is

The proof is given in the appendix. Figure 33 (right) shows an example of this PDF with µ=2 and σ²=3. We know that the mean of Z is zero. We can calculate the mean of X using Eq. 3

and its variance using Eq. 6

So the mean of X is µ and its standard deviation is σ. In Eq. 71, if we have µ = 0 and σ² = 1, then X=Z which means that a standard normal distribution is just a normal distribution with a mean of zero and standard deviation (and variance) of 1. So a normal distribution can give you a random variable X such that if we randomly generate an infinitely large set of values that X can take (an infinitely large sample), the mean of these values is µ and their variance is σ².

Now we want to generalize the normal distribution to random vectors. Suppose that we have n independent random variables X₁, X₂, …, X, and each of them has a standard normal distribution. So for each X:

Since these random variables are assumed to be independent of each other. We can use Eq. 15 to write:

Standard multivariate normal distribution

Now we define the random vector X as:

When X=x, it means that we have X₁=x₁, X₂=x₂, …, X=x at the same time. Since we assumed that all these random variables are independent, we can use Eqs. 19 and 73 to calculate the PDF of X

This is the PDF of the random vector X. We can calculate the mean of X using Eq. 21:

The covariance matrix of X is:

where we used Eqs. 75 and 76. So the covariance matrix of X is the identity matrix. As you see X is a generalization of the standard normal distribution to a random vector. As a result, if a random vector has such a PDF, we say that it has standard multivariate normal (SMVN) distribution. Figure 34 (left) shows a plot of the PDF of a 2-d standard multivariate normal distribution. Here the PDF is

Now imagine all the vectors x that form a circle with a radius of c. We call them x𝒸. For these vectors we have

So all the vectors on this circle have the same value of the PDF. In fact, as Figure 34 (right) shows each contour line of this PDF is a circle.

MVN distribution

Now Let Z be a random vector with n elements that has an SMVN distribution, and X be a random vector with n elements which has an MVN distribution with the parameters µ and Σ. Recall that the PDF of X is

Now we can show that

where

This means that a random vector with an MVN distribution is just a linear transformation of a random vector with an SMVN distribution (proof is given in the appendix). Now we can calculate the mean of X. Since Z has an SMVN distribution, its mean is zero (Eq. 77). So we get

The covariance matrix of X can be calculated using Eq. 30

Since Z has an SMVN distribution, its mean is zero and its covariance matrix is the identity matrix. So we have

If we substitute this equation into the previous equation we have

So we conclude that µ and Σ are the mean vector and the covariance matrix of X respectively. In fact, the multivariate normal distribution can give you a random vector X such that if we randomly generate an infinitely large set of values that X can take, the mean of these values is µ and their covariance matrix is Σ.

It is important to note that when we get a small sample from an MVN, the covariance of the points in that sample is not exactly equal to Σ. To get Σ we need all the possible points which is an infinite set. However, by increasing the number of points in the sample, the covariance of the sample gets closer to Σ. For example for the dataset defined in Listing 10 we had

Our sample had 150 points and the covariance of the sample is

`np.random.seed(0)mu = [2, 2]Sigma = [[6, 4],         [4, 6]]points = np.random.multivariate_normal(mu, Sigma, 150)np.round(np.cov(points.T), 2)`

Output

`array([[5.82, 4.13],       [4.13, 6.39]])`

If we increase the number of points to 30000, the covariance matrix will be

`points = np.random.multivariate_normal(mu, Sigma, 30000)np.round(np.cov(points.T), 2)`

Output

`array([[6.03, 4.02],       [4.02, 5.96]])`

This matrix is much closer to Σ.

Let’s see how we can A get from Σ. First, we need to get the eigendecomposition of Σ. Based on Eq. 65 we can write

where he columns of P are the eigenvectors of Σ. Remember that each eigenvalue gives the variance of the random vector along its corresponding eigenvector and variance is always greater than or equal to zero. So we can take its square root. We define D ^1/2 as

It be can be easily shown that

Besides, we have

since D^ 1/2 is a symmetric matrix. Now we can show that

To prove it we need to show that

For the first identity, we can write

The second identity can be proved similarly. As Eq. 85 shows A is a symmetric matrix since it is equal to its transpose. So we can also write

Recall that in Eq. 71, we used the standard deviation (σ) of X to transform Z to X, and the variance of X is the square of its standard deviation. In Eq. 79 we use matrix A to transform Z to X, and the covariance matrix of X is the square of A. So A plays the same role of standard deviation for a random vector.

Recall that when we multiply the covariance matrix by the vectors of the unit circle, it transforms the unit circle by stretching or shrinking it along the principal components, and the amount of stretching along each principal component is proportional to its corresponding eigenvalue or the variance of the dataset along that principal component (Eq. 52). This is true for any circle with a different radius. Suppose that u₁ is an arbitrary vector on the unit vector (Figure 35), and suppose that we have a circle with a radius of c, and r₁ is an arbitrary vector on that circle which is also along u₁. So we have r₁=cu₁. If we multiply S by the vector r₁ we get

So the result is a scaled copy of the ellipse formed by Su₁ and the scale factor is c. For example, if the maximum of || Su₁ || is λ₁, then the maximum of || Sr₁ || is cλ₁ (Figure 35).

Suppose that Z has a 2-d SMVN distribution. Now imagine all the vectors z𝒸 that form a circle with a radius of c. As mentioned before, in the SMVN distribution, the value of the PDF for all these vectors is the same

Now let the random vector X have an MVN distribution with a mean of µ and covariance matrix of Σ. We know that

We also know that X = µ+AZ. Let’s see how z𝒸 is transformed. When we multiply A by z𝒸, the circle turns into an ellipse, and by adding µ to it, we only move the center of this ellipse from origin to µ (refer to Figure 36). Using Eq. 79 we can write

By replacing this equation into the PDF of MVN distribution (Eq. 78) we get

where Eqs. A.14 and A.15 are given in the appendix. So we have

Here x𝒸=Az𝒸+μ. Based on this result we see that the PDF value of all the vectors on this ellipse is the same, and their PDF is equal to the PDF of the original circle divided by the square root of the determinant of the covariance matrix (Figure 36).

We mentioned that when a matrix transforms a circle into an ellipse, the principal axes of the ellipse are along the eigenvectors of that matrix. So the principal axis of each ellipse that forms a contour of the MVN distribution is along the eigenvectors of A. Now if v is an eigenvector of A we have

Using Eq. 86 we can write

which means that Σ and A have the same eigenvectors. So, the principal axes of the covariance matrix are along the principal axes of the ellipse defined by Az𝒸. (Figure 37).

Listing 29 shows how a contour line of an SMVN distribution can be transformed to that of an MVN distribution. The result is showin in Figure 38.

Now let’s see how an SMVN distribution is transformed into an MVN distribution. Each contour line of the SMVN distribution is a circle and each circle is transformed to its corresponding ellipse using the matrix A. The addition of µ moves the center of each ellipse from origin to µ, and each ellipse is the contour line of the MVN distribution. Listing 30 shows this transformation for an arbitrary covariance matrix, and the result is shown in Figure 39.

On the bottom left of this figure, some of the contour lines of the SMVN distribution are plotted. To plot these lines the `multivariate_normal()` class in the Scipy library has been used. It takes µ and Σ as parameters. The `pdf()` method of this class returns the PDF of the distribution. We first calculate the PDF of the circles which have a radius of 0.5, 1, 1.5, 2, and 2.5. Each PDF is divided by the square root of the determinant of Σ to get the PDF of the ellipse which is the result of transforming that circle (Az). The addition of µ moves the center of each ellipse from origin to µ. These ellipses are the contour lines of the multivariate normal distribution with a mean of µ and covariance matrix of Σ. On the top of this figure, you can see the plot of the PDF of each distribution. The principal axes of each ellipse are along the eigenvectors of Σ (the red vectors in Figure 39).

As you see the peak of the PDF of SMVN distribution is higher than that of the MVN distribution. That is because we divide f_Z(z) by the square root of the determinant of Σ. We know that the integral of a joint PDF over the entire space must be 1. Here this integral is equal to the volume under the PDF surface. When matrix A is applied to a circle in the SMVN distribution, it will shrink or stretch that circle. So the resulting ellipse can have a lower or higher area. As a result, the volume under the PDF surface also changes, and it won’t be equal to 1 anymore. By dividing f_Z(z) by the square root of the determinant of Σ, we change the height of the surface and correct the volume to be equal to 1.

The shape of the contours of an MVN distribution is determined by the covariance matrix. In SMVN distribution, the mean vector is zero and the covariance matrix is the identity matrix, so the contours are circles centered at the origin (Figure 40). We can have a specific form of MVN distribution in which the covariance matrix is a multiple of the identity matrix. In that case, the contours are still circles since this covariance matrix transforms the contour circles of SMVN into smaller or bigger circles. These circles are centered at the mean vector (Figure 40). Now suppose that we have an MVN PDF

and its covariance matrix is diagonal

As mentioned before, this means that the covariance along the vectors of the standard basis is zero. Since the coordinate axes are represented by x₁ and x₂ we have

So X₁ and X₂ are uncorrelated, and e₁ and e₂ are the principal components. As we explained in Figure 16, we still have a non-zero covariance along any other pair of orthogonal vectors. That is why the contours are ellipses, not circles. Recall that the principal axes of the ellipse are always along the principal components. So when the axes of our coordinate system are parallel to the principal axes of the ellipse, it means that the random variables of the MVN PDF are uncorrelated (Figure 40).

Finally, these results can be generalized to an n-dimensional space for a random vector X with n elements X₁, X₂, …, X. Here the contours of the SMVN distribution are n-dimensional spheres, and a covariance matrix that is not a scalar multiple of the identity matrix turns these spheres sphere into ellipsoids. When the covariance matrix is diagonal, the principal axes of this ellipsoid are parallel to the axes of the coordinate system, and the random variables X₁, X₂, …, X are uncorrelated.

Correlation and independence

To understand the limitations of PCA, first, we need to discuss the relation between independence and correlation. Remember that in Eq. 15 we showed that two independent variables have a zero covariance. Based on Eq. 16, they also have a zero correlation. Hence independence implies the lack of correlation. However, you should note that the opposite is not true. The lack of correlation does not imply independence. In fact, if ρ(X₁, X₂)=0, it does not necessarily mean that X₁ and X₂ are independent.

Let’s see an example. Suppose that we have a random vector

and X₂=cos(X₁). Let X₁ have a uniform distribution on the interval [-π, π]. A uniform distribution on the interval [-π, π] is defined as

It can be shown that the mean of a uniform distribution on the interval [a, b] is (a+b)/2. So, E[X₁]=0. We can also write

Figure 41 shows the curve x₂=f(x₁). Each point on this curve is a possible value that the random vector X can take.

Now if we calculate the covariance of X₁ and X₂ we get

So X₁ and X₂ are uncorrelated, however, they depend on each other since X₂=cos(X₁). In addition, the vectors of the standard basis (e₁ and e₂) are the principal components. Why do we get this result? The reason is that cos(x) is a nonlinear function. x₁cos(x₁) is an even function (it is symmetric about the y-axis), so its integral between x₁=-a and x₁=a is always equal to zero.

Recall that a positive covariance of two random variables means that they tend to move in the same direction, and a negative covariance indicates that they tend to move in the opposite direction. If there is a nonlinear relationship between these variables, they can move in the same direction in one region and move in the opposite direction in another region. For example in Figure 41, x₁ and x₂ move in the same direction when x₁ is below zero, and move in the opposite direction when x₁ is above zero. Since the function is symmetric about the y-axis, these moves cancel each other, so on average x₁ doesn’t move when x₂ move, so their covariance and correlation becomes zero.

Now let’s go back to the MVN distribution. Suppose that we have an MVN distribution as defined in Eq. 78. Now we move the origin to µ to have a new coordinate system. In this new coordinate system, each vector x is replaced with x- µ. So in this coordinate system Eq. 78 is written as

We also want to change the coordinate system to be along the principal components of the covariance matrix of the MVN distribution. So now the principal components form our coordinate system. Figure 42 shows an example of this change of the coordinate system for a 2-d MVN distribution. Recall that we need to decompose the covariance matrix relative to the standard basis to get the covariance matrix relative to the basis formed by the principal components (Eq. 63), and remember that the covariance matrix in the new basis will be

Where each λ is an eigenvalue of Σ. The inverse of Σ_B is

and you can easily prove it by showing that

In addition, it can be easily shown that

Let’s assume that in the new coordinate system, the axes are called x_Bi. So in this coordinate system, the PDF is written as

If you compare this equation with Eq. 72, you see that this PDF is the product of the PDF of n normal distributions with a mean of zero and variance of λ. For the 2-d MVN distribution in Figure 42, these PDFs have been plotted too. Recall that the marginal PDF of each X can be derived from the joint PDF using this equation

where we used the fact that each normal distribution integrates to one over its space and only the i-th distribution is left. So each normal PDF in Eq. 87 is a marginal PDF of the MVN distribution, and we can write this equation as

From Eq. 19 we know that if the random variables X_B1, X_B2, …, X_Bn have a PDF like that of Eq. 88, then they are mutually independent. We know that the random variables (X_Bi) that are represented by the principal components are uncorrelated, so in the MVN distribution, the lack of correlation implies independence. This is an important result, and MVN is the only distribution with this property.

In fact, we need to correct our previous description of MVN distribution. MVN can give you a random vector X such that if we take an infinitely large sample of X, the sample mean is µ and its covariance matrix is Σ. In addition, if we find the principal components of X, the random variables that represent the eigenvectors of Σ are both uncorrelated and independent.

Figure 43 shows a 2-d MVN as an example. The axes x₁ and x₂ are uncorrelated, so the random variables X₁ and X₂ should be independent. This figure shows the cross-sections of this PDF at x₁=0 and x₁=3. The shape of these cross-sections is a function of x₁, so they look different. However, it is important to notice that a cross-section is not a PDF since it does not integrate to one. In fact

is a PDF and integrates to one, but

is only a slice of that PDF, and its integral is less than one. Similar to Eq. Equation 9 we can define the Bayes’ rule for a joint PDF of a random vector with 2 random variables:

where

is a conditional PDF. Now for the MVN distribution, we get

So for each cross-section, the conditional PDF is the same and independent of the specific value of x

which means that X₁ and X₂ are independent.

Listing 31 shows an example of uncorrelated random variables in an MVN distribution. It first randomy samples 2500 points samples from an MVN distribution (Figure 44-left).

As this figure shows the points along x₁=0 have a higher dispersion than the points along x₁=7.5. That is because the PDF of the MVN distribution has higher values along x₁=0. Now we benrate 10000 random points of the same distribution but only select the points for which x₁=7.5+/-0.1. Then we select the same number of points for which x₁=+/-0.1. These points are plotted in Figure 44-right. As you see the points along x₁=0 and x₁=7.5 now have a similar dispersion. Indeed, by restricting the value of x₁ to an iterval, we get the conditional PDF of the distribution, and this conditional distribution is the same for all values of x₁.

Where PCA fails?

PCA is based on two assumptions. The random vector X that represents our data set is a continuous random vector. So the data set is real-valued. Besides, we assume that there are no missing values in the dataset. We can always apply PCA to real-valued datasets with no missing values. Remember that we use PCA to find an orthonormal basis for a dataset in which the features of this basis are uncorrelated. random variables that represent the basis vectors are uncorrelated. We also know that if we sort the eigenvectors based on their corresponding eigenvalue {v₁, v₂, …, v}, the maximum of the variance of the dataset over all the vectors in D which are perpendicular to v₁, v₂, …, v₋₁ is along v and is equal to its corresponding eigenvalue.

If we have a real-valued dataset with no missing values, PCA can always do its task to find an orthogonal basis in which the features relative to this basis are uncorrelated (which means that the random variables that represent these features are uncorrelated). However, as we mentioned before, what we are looking for is the independent features of the dataset. So we want the uncorrelated features to be independent too. This only happens if the dataset has an MVN distribution.

As an exapmle, we sample 200 points from X₂=cos(X₁) on the interval [-π, π], and calculate the principal components of this dataset in Listing 32. The result is shown in Figure 45.

We already showed that the vectors of the standard basis (e₁ and e₂) are the principal components of X₂=cos(X₁). The calculated principal components are shown in Figure 45 (red vectors). They are not exactly equal to e₁ and e₂ since we only sampled 200 points from X₂=cos(X₁). However, the calculated vectors get closer to them as the number of points in the sample increases.

Here PCA does what it is supposed to do. It finds an orthogonal basis in which the features relative to this basis are uncorrelated. It also finds the vectos along which the dataset has the highest variance (v₁). However, the maximum variance of this dataset is not along the cosine curve not a line. If we change the basis to {v₁, v₂}, and call new random variables represented by the new basis, X_B1 and X_B2, then [these random variables are uncorrelated, but they are not independent.

In summary, PCA never fails to give an uncorrelated orthogonal basis for our data set. However the random variables represented by this basis are not necessarily independent. If the dataset is sampled from an MVN distribution, the uncorrelated random variable are independent too.

You can find the first part of this article here: Understanding Principal Component Analysis and its Application in Data Science — Part 1

Appendix

Properties of standard normal distribution

The mean of standard normal distribution is zero:

The variance of standard normal distribution is one:

The first term on the right-hand side has the indeterminate form at infinity.

So we should apply L’Hospital’s Rule to evaluate its limit:

The second term is equal to the integral of the PDF standard normal distribution over its space. So it integrates to one:

So finally we get

PDF of normal distribution

Suppose that we have two random variables X and Z and their PDFs are f_X and f_Z respectively. R_Z (also called the support of Z) is the interval at which f_Z(z)>0. Suppose that g(z) is a differentiable and strictly increasing function over R_Z

and R_X is defined as the range of g

Then we have

Now we can use this formula to derive the PDF of a normal distribution from the PDF of standard normal distribution. From Eq. 71 we know that

where

So we have

PDF of multivariate normal distribution

Suppose that we have two random vectors X and Z each with n elements and their PDFs are f_X and f_Z respectively. R_Z is a subset of Rⁿ at which f_Z(z)>0. Suppose that g(z) is a one-to-one and differentiable and function over R_Z

and R is defined as the range of g

The Jacobian matrix of g⁻¹(z) is defined as

where z is the i-th component of z, and x is the i-th component of x = g(z). The joint PDF of X is

Now we can use this formula to derive the PDF of a multivariate normal distribution from the PDF of standard normal distribution. From Eq. 71 we know that

where

We also have

So we have

Now we can write

To simplify this equation, we can use some rules. It can be shown that

Now we can substitute Eq. A.8 into Eq. A.10 and simplify it