# Understanding Principal Component Analysis and its Application in Data Science — Part 1

## Learn the mathematical intuition behind PCA

Principal Component Analysis (PCA), is a method used to reduce the dimensionality of large datasets. We will also study the covariance matrix and the multivariate normal distribution in detail since understanding them will result in a better understanding of PCA. The Python scripts in this article show you how PCA can be implemented from scratch and using the Scikit Learn library.

**Notation**

Currently Medium supports subscripts and superscripts only for some characters. So to write the name of the variables, I use this notation: Every character after ^ is a superscript character and every character after _ (and before ^ if it is present) is a subscript character. For example

is written as *P**_B**^**T* in this notation. In this article, bold-face italic lower-case letters (like ** a**) refer to vectors. Bold-face italic capital letters (like

**) refer to matrices, and italic lower-case letters (like**

*A**a*) refer to scalars. Capital letters (like X) refer to random variables and bold-face capital letters (like

**X**) refer to both random vectors. The numbers above an equality sign like refer to the number of the equations that were used to derive the expression on the righ-hand side of that equality sign.

**Dimensionality reduction**

Let’s start with a simple example. Suppose that we have a small dummy set with two features *x* and *y*

The dataset has two dimensions, and we can see that x and y depend on each other (*x*=*y*). This dataset is plotted in Figure 1, and the features *x* and *y* are represented by the axes *x* and *y* respectively.

Now suppose that we want to reduce its dimension to one so that we can describe it with only one feature. However, we want to do it with minimal loss of information. The first thing that we can try is simply removing one of the features. However, we lose lots of information about the spreading of data in this way. In this dataset, the distance between each two consecutive data points is the square root of 2. Suppose that we remove the second feature *y* and only keep *x*. Now the distance between each two consecutive data points along *x* is 1. So we are underestimating the spread of data points, and it happens since *x* and *y* depend on each other (I will explain the mathematical reason for that later).

We can try another method. Let’s define a new coordinate system as shown in Figure 2. Here one of the axes (*u*) passes through the data points and the other one (*v*) is perpendicular to that. In fact, the new coordinate system (*u*-*v*) is the result of the rotation of the original coordinate system (*x*-*y*) by 45 degrees.

Now we need to calculate the coordinates of the data points in the *u-v* coordinate system. To do that we project each data point onto each new axis. So, we have

The features *x* and *y* are now replaced by features *u* and v. The value of *v* is not a function of *u* since it does not change by *u. W*e say that *u* and *v* are independent of each other since if we change one of them, the other one won’t be affected. However, we still need to decide which one should be eliminated. Here it is clear that we should keep *u*. By eliminating *v*, we minimize the loss of information. Since *u* and *v* are independent, we are getting the correct distance between the consecutive data points, and the only piece of information that is lost is the *v*-coordinate of these points.

Now we can summarize the steps that we took to perform the dimensionality reduction. First, we need to define a new coordinate system (by rotating the original one) in which the transformed features are independent. Then we need to sort these features based on the amount of information that they hold and eliminate one or more features that are ranked lower than the others. PCA roughly follows the same procedure. The only difference is that it finds a coordinate system in which the features are uncorrelated. The difference between the uncorrelated and independent features will be explained later, but to understand them, first, we need to review some statistical concepts.

**Random variables**

A *random variable* is a variable whose numerical value depends on the outcome of a random phenomenon. So its value is initially unknown, but it becomes known once the outcome of the random phenomenon is realized. We commonly use uppercase letters to denote a random variable, and lowercase italic letters to denote the particular values that a random variable can assume.

For example, the two outcomes of tossing a coin are heads or tails. We can assign values to each outcome and denote getting heads by 0 and tails by 1. Now we can define the random variable X with possible values of X=0 or X=1. A particular value that X can take on is denoted by *x*. This is an example of a discrete random variable. A discrete random variable can only take certain values. On the other hand, a continuous random variable can take any values within an interval. For example, the outcome of measuring a person’s weight can be described by a continuous random variable. If X is a continuous random variable, the probability that X takes a value in the interval [a, b] can be written as

where *f(x)* is called the *probability density function* (PDF) of X. We know that

So the integral of a PDF over the entire space must be equal to 1

**Mean**

Let X be a continuous random variable that can take the value *x* with the PDF *f(x)*. The *mean* or *expectation* of X, denoted by *E*(X), is a number defined as follows:

The most important property of mean is linearity. If X₁, X₂, …, X**ₙ** are *n* random variables and *a*₁, *a*₂, …, *a***ₙ** and *c* are constants, then we can show that

The proof is given in the appendix.

**Variance**

*Variance* is a measure of the spread of the data around its mean. If X is a random variable then the variance of X, denoted by *Var*(X), is defined as follows:

If the random variable X takes a constant value, then its variance is zero, since X is always equal to its mean. On the other hand, the larger X deviates from its mean, the larger the variance of X is. The *standard deviation* of X, denoted by *σX*, is the nonnegative square root of *Var*(X) if the variance exists:

If *a* is a constant then we can show that

The prove it first we use Eq. 3 to get

Now we can use the definition of variance

**Independence**

Suppose that event Y has occurred and we wish to compute the probability of another event X taking into account that we know that Y has occurred. The probability of event X is denoted by *P*(X). The *conditional probability* of X is the probability of X given that Y has occurred and is denoted by *P*(X|Y). Two events are independent if the occurrence of one does not affect the probability of occurrence of the other. So if X and Y are independent then we can write

Eqs. 7 and 8 can be shown in another form using the Bayes’ rule. The Bayes’ rule states that

Where *P*(X, Y) is the *joint probability* of X and Y or the probability of X and Y occurring together. So using the Bayes’ rule we can say that two events X and Y are independent if and only if

If X and Y are two independent random variables, then it can be shown that

**Covariance**

Let X and Y be two random variables. The *covariance* of X and Y, which is denoted by *Cov*(X, Y) is defined as

The covariance of X and Y provides a measure of the association between X and Y. A positive covariance indicates that the deviations of X and Y from their respective means tend to have the same sign. So when one of them (X or Y) is high, the other one tends to be high, and when one of them is low the other one tends to be low. It also means that they tend to increase or decrease together or move in the same direction. On the other hand, a negative covariance indicates that the deviations of X and Y from their respective means tend to have opposite signs. When one of them is high, the other one tends to be low, and when one of them increases, the other one tends to decrease. So they tend to move in the opposite direction.

The covariance of a random variable with itself is equal to its variance:

We can also write the covariance in a different form. If X is a random variable and *c* is a constant then based on Eq. Equation 3 we have

Now we can use the fact that E[X] and E[Y] are constants and write

Now if X and Y are independent, we can use Eq. 11 and write

So two independent random variables have a zero covariance.

**Correlation**

The *correlation* between X and Y, denoted by *ρ*(X, Y), is defined as follows:

Based on Cauchy-Schwarz inequality we have

So we can write

And by dividing the previous inequality by *σ*_X *σ*_Y we get

Based on Eq. 16, the correlation and covariance have the same sign (since standard deviation is always positive). When *ρ*(X, Y)>0, X and Y are positively correlated and tend to move together, and when *ρ*(X, Y)<0, they are negatively correlated and tend to move in the opposite direction. If *ρ*(X, Y)=0, X and Y are said to be *uncorrelated*.

Besides, it can be shown that if there is a linear relationship between X and Y

where *a* and *b* are constants and *a*≠0, then we have

So we can write |*ρ*(X, Y)|=1. If *a*>0, then the line has a positive slope and X and Y should increase or decrease together. So we have a positive covariance and correlation and *ρ*(X, Y)=1. Similarly if *a*<0, X and Y move in the opposite direction, and *ρ*(X, Y)=-1. Finally, If we have two independent random variables, their covariance is zero (Eq. 15), so their correlation is zero and we say they are uncorrelated.

In summary, the value of *ρ*(X, Y) provides a measure of the extent to which two random variables X and Y are linearly related. If the values that X and Y take are relatively concentrated around a straight line with a positive slope, then *ρ*(X, Y) will be close to 1. On the other hand, if these values are relatively concentrated around a straight line with a negative slope, then *ρ*(X, Y) will be close to −1.

In Numpy we can use the function `corrcoef()`

to calculate the correlation between two arrays. We can think of each array as a list of the observed values of a random variable. Suppose that we have two random variables X and Y. The observed values of X (denoted by *x*) are 19 evenly spaced numbers on the interval [0, 10]. So *x*={0, 1.5, 2, …, 10}. Besides, suppose that Y=X, so they have a linear relationship and *y*=*x*. Figure 3.a shows the plot of *y* vs *x*.

We can now easily calculate the correlation between X and Y using Listing 1:

`# Listing 1`

x = np.linspace(1, 10, 19)

y = x

np.corrcoef(x, y)

The output is:

`array([[1., 1.], `

[1., 1.]])

The return value of `corrcoef(x, y)`

is a 2-d array that can be thought of as a matrix defined as follows:

Using Eqs. Equation 5, Equation 13, and Equation 16 we can write

So the diagonal elements of this matrix are always equal to 1. From the definition of covariance, we can see that *Cov*(X, Y)= *Cov*(Y, X) and *ρ*(X, Y)= *ρ*(Y, X). Hence, the off-diagonal elements of this matrix are always equal and either of them gives the correlation between X and Y. So based on the output of Listing 1, the calculated correlation is equal to 1 which is the expected result. Now we add some noise to the *y* values:

`# Listing 2`

x = np.linspace(1, 10, 19)

y = np.linspace(1, 10, 19) + np.random.normal(0, 2, 19)

Figure. Figure 3.b shows the plot of *y* vs *x*. We can calculate the correlation again:

`#Listing 3`

np.corrcoef(x, y)

Output

`array([[1. , 0.826], `

[0.826, 1. ]])

As you see the correlation is now ~0.8. It is less than 1 since all the data points don’t lie on a straight line anymore. However, they are relatively concentrated around a straight line with a positive slope, so ρ(X, Y) is close to 1.

**Random vectors**

Dimensionality reduction only applies to multi-dimensional data which can be represented by random vectors. A random vector is a vector of random variables. If we have *n* random variables X₁, X₂, …, X**ₙ**, we can place them in the random vector **X**:

We use bold uppercase letters to denote a random vector. To show a possible value of a random vector (which is also a vector) we use bold italic lowercase letters such as ** x**. So a possible value of

**X**is

In a random vector **X**, the probability that X₁ ∈ [a₁, b₁], X₂ ∈ [a₂, b₂], …, X_n ∈ [a_n, b_n] can be written as

where

is the *joint probability density function* of **X**. The integral of a joint PDF over the entire space must be equal to 1. If we have the joint PDF of the random vector **X**, we can derive the distribution of one of its components X**ᵢ** using *marginalization*. The *marginal* *probability density function* of X**ᵢ** can be derived from the joint PDF of **X** as follows

which means that we integrate the joint PDF with respect to all the variables except x**ᵢ**. It can be shown that the random variables X₁, X₂, …, X**ₙ** are mutually independent if and only if the PDF of **X** can be written as

Let’s see an example. Suppose that **X** is a random vector with *n* elements. ** µ** is a column vector with

*n*elements, and

**is an**

*Σ**n*×

*n*matrix. We say that

**X**has a

*multivariate normal distribution*(MVN) if its joint PDF is defined as

where ** x** is a possible value of

**X**, and det(

**) is the determinant of**

*Σ***.**

*Σ***and**

*µ***are the parameters of the MVN distribution (later we will discuss this distribution in more details). Now suppose that**

*Σ**n*=2, and we have a random a vector

with an MVN distribution. We can generate some possible values of **X** using NumPy’s function `random.multivariate_normal()`

. We just need to determine the parameters of the distribution. Suppose that the parameters are:

In Listing 4, we generate and plot 150 random values of **X**.** **The result is plotted in Figure 4.

# Listing 4

np.random.seed(0)

mu = [2, 2]

Sigma = [[6, 4],

[4, 6]]points = np.random.multivariate_normal(mu, Sigma, 150)

x1, x2 = points.Tplt.scatter(x1, x2)

plt.axhline(y=0, color='k')

plt.axvline(x=0, color='k')

plt.axis('equal')

plt.xlim([-5, 10])

plt.ylim([-7, 11])

plt.xlabel("$x_1$", fontsize=16)

plt.ylabel("$x_2$", fontsize=16)plt.show()

Random vectors can describe the statistical nature of a dataset. A dataset is usually in the form of a table like Table 3. Each column of this table is called a *feature* of the dataset and we denote it by *x***ᵢ**, and each row is called a *record* that contains one observation for all the features. So, the dataset of Table 3 has *n* features and *m* records, and the value *x***ᵢⱼ** in this dataset is the observed value of feature *j* in the *i*-th record.

Now we can assume that the *i*-th feature is represented by the random variable X**ᵢ***, *and as usual, *x***ᵢ** denotes the possible values that X**ᵢ** can take in general. In this dataset, the specific values of *x***ᵢ** are *x*_1*i*, *x*_2*i* ,… *x*_*mi*. If we place X₁, X₂, …, X**ₙ** in a random vector **X**, then **X** describes the whole dataset. Again, the possible values that **X** can take are denoted by

Each record of this dataset

is a specific value of ** x**. If we could plot this dataset in an

*n*-dimensional space, then each axis of this space represents one of the features and its corresponding random variable, so we can call the

*i*-th axis,

*x*

**ᵢ**, and each record of this dataset is an

*n*-dimensional point in that space.

**Random matrix**

Similar to a random vector we can define a random matrix. A random matrix is a matrix whose entries are random variables. An *m×n* random matrix **X** can be written as

where each element X**ᵢⱼ** is a random variable.

**Mean of a random vector**

The mean of a random vector

is defined as:

**Mean of random matrix**

The mean of a random matrix **X** (Eq. 20) is defined as the matrix of the means of the elements of **X**:

If **X** is a random matrix, and ** A**,

**and**

*B,***are constant matrices then we have:**

*C*The proof is given in the appendix. Vectors can be considered as a matrix in which the number of columns is 1 (for row vectors the number of rows is 1). So Eqs. 22 and 23 are also valid for these vectors. If **X** is a random vector and ** a** is a constant vector, then based on Eq. 22 we have

Where ** aᵀ** is the transpose of

**. The transpose of an**

*a**m*×

*n*matrix

**is an**

*A**n*×

*m*matrix whose columns are formed from the corresponding rows of

**.**

*A***In fact, the (**

*i*,

*j*) element of the transposed matrix is equal to the (

*j*,

*i*) element of the original matrix

So the transpose of a row vector becomes a column vector with the same elements and vice versa.

**Variance of a random vector**

Defining the variance of a multi-dimensional dataset is a little more difficult than defining its mean. The variance defined in Eq. 4 is based on a scalar random variable X. So, to use it for a random vector, we first need to convert this random vector to a scalar random variable, and we can use the scalar projection for this purpose.

Suppose that we want to calculate the variance of the random vector **X** along the vector ** u**. The scalar projection of a random vector

**X**onto another vector

**is equal to**

*u*where||** u**|| is the length of

**and is defined as**

*u*Since we only care about the direction of ** u** we can assume that it is a unit vector, and its length is equal to 1:

So ** u**.

**X**=

*uᵀ***X**gives the scalar projection of

**X**onto

**. Now we can use Eq. 4 to calculate the variance of**

*u***X**along

**which is equal to the variance of the scalar random variable**

*u*

*uᵀ***X**:

Now if we have a dataset represented by **X**, each possible value that **X** can take is a vector ** x** and is represented by a record in our dataset. If

**has**

*x**n*elements, we can also think of

**as a point in an**

*x**n*-dimensional space (Figure 5). In fact in an

*n*-dimensional space, a point can be always represented by a vector. We first project all these vectors (or points) on

**. So for each vector**

*u***, we have the scalar value**

*x*

*uᵀ***X**and we can simply use Eq. 4 to calculate the variance of these scalar values (Figure 5).

The term in the square brackets can be simplified further. We know that

is a scalar (since the result of a dot product is a scalar quantity). Besides, we know that the transpose of a scalar quantity is equal to itself. So we can write

Finally, we know that for any vectors *u **and** v*

So we can write

Since ** u** and

**are constant vectors, we can use Eqs. 22 and 23 to have**

*uᵀ*where

is called the* covariance matrix*. So we can write the variance of **X** along ** u** in terms of the covariance matrix:

Let’s take a more detailed look at the covariance matrix. If the random vector **X** has *n* element (Eq. 18), *E*[**X**] will have the same number of elements (Eq. 21). So **X**-*E*[**X**] is a column vector with *n* elements and (**X**-*E*[**X**])** ᵀ** is a row vector with n elements and by multiplying them, we get an

*n×n*matrix (so

**is a square matrix). If we do the multiplication we have:**

*S*Now using the definitions of variance and covariance we can write the previous equation as:

So the (*i, j*) element of the covariance matrix is the covariance of variables X**ᵢ*** *and X**ⱼ** and the *i*-th diagonal element gives the variance of X**ᵢ**.

A *symmetric matrix* is a matrix that is equal to its transpose. So its (*i*, *j*) element is equal to its (*j*, *i*) element. We know that *Cov*(X**ᵢ**, X**ⱼ**)=*Cov*(X**ⱼ**, X**ᵢ**), so the covariance matrix is symmetric and is equal to its transpose:

Now let’s take a second look at the dataset plotted in Figure 4. We can calculate its mean using Eq. 21:

`# Listing 5`

mean = np.array([x1.mean(), x2.mean()])

mean

Output

`array([1.98407831, 2.15752234])`

We can also calculate its variance along the two arbitrary lines *x*₁=*x*₂ and *x*₁=-*x*₂. We first need the unit vectors along these lines which are

respectively. We can write

# Listing 6

u1 = np.array([[1/np.sqrt(2)],[1/np.sqrt(2)]])

u2 = np.array([[-1/np.sqrt(2)],[1/np.sqrt(2)]])projected_points1 = np.dot(points, u1).T

var_X_u1 = np.var(projected_points1[0])projected_points2 = np.dot(points, u2).T

var_X_u2 = np.var(projected_points2[0])print("Variance of X along u1", round(var_X_u1, 3))

print("Variance of X along u2", round(var_X_u2, 3))

Output

`Variance of X along u1 10.169`

Variance of X along u2 1.958

Here we used `np.dot()`

to calculate the dot product of the data points with `u1`

and `u2`

and `np.var()`

to calculate the variance. The result shows that the spread of data along *x*₁=*x*₂ is much larger than the spread of data along and *x*₁=-*x*₂ (Figure 6).

Before we calculate the covariance of this random vector, we need to introduce the basis of a vector space.

**Basis**

A set of vectors {** v**₁,

**₂, …,**

*v***} form a**

*v*ₙ*basis*for the vector space

*V*, if they are linearly independent and span

*V*. When a set of vectors is linearly independent, it means that no vector in the set can be written as a linear combination of the other vectors. Mathematically, a set of vectors {

**₁,**

*v***₂, …,**

*v***} is linearly independent if the vector equation**

*v*ₙhas only the trivial solution

A set of vectors span a vector space if every other vector in the space can be written as a linear combination of this set. So every value of the random vector **X **in *V* can be written as:

where *a*₁, *a*₂, …, *a***ₙ** are some constants. The vector space *V* can have many different vector bases, but each basis always has the same number of vectors. The number of vectors in a basis of a vector space is called the *dimension* of that vector space. For now, you can assume that a vector with *n*-elements belongs to an *n*-dimensional vector space (with *n* basis vectors). But as we show later that is not always the case.

**Orthonormal basis**

Suppose that the random vector **X** belongs to an *n*-dimensional vector space *V*.* *A basis {** v**₁,

**₂, …,**

*v***} is**

*v*ₙ*orthonormal*when all the vectors are normalized (the length of a normalized vector is 1) and orthogonal (mutually perpendicular). When two vectors

**and**

*v*ᵢ**are orthogonal their inner product is zero:**

*v*ⱼThe length of ** vᵢ** is

So for the orthonormal basis {** v**₁,

**₂, …,**

*v***} we can write:**

*v*ₙWhen the basis {** v**₁,

**₂, …,**

*v***} is orthonormal, we can easily calculate the coefficients of Eq. 34. To get**

*v*ₙ*a*

**ᵢ**, we can write

Now using Eq. 36, we have

So *a***ᵢ** is equal to the dot product of ** x** and

**or the scalar projection of**

*v*ᵢ**onto**

*x***. The main reason to find a basis for a vector space is to have a coordinate system for that. If the set of vectors**

*v*ᵢ*B*={

**₁,**

*v***₂, …,**

*v***} form a basis for a vector space**

*v*ₙ*V*, then every vector

**in that space can be uniquely specified using those basis vectors:**

*x*And the coordinate of ** x** relative to this basis

*B*is:

The vectors:

form the standard basis for ** V** (in

**the**

*e*ᵢ*i*-th element is one and all the other elements are zero). These vectors form an orthonormal basis since we have

**Covariance of random vectors**

Suppose that we have the random vector

We know that the covariance matrix for **X** is given in Eq. 32. The covariance in Eq. 12 is defined based on two scalar random variables. So, if we have the random vector **X**, we need to convert it into two scalar random variables to be able to use Eq. 12 for it. We can use the scalar projection, but we need to project **X** onto two different vectors. The scalar projection of the whole dataset (which is represented by a random vector) onto each of these vectors results in a scalar random variable and using Eq.12 we can calculate the covariance of these two scalar random variables.

Suppose that we want to calculate the covariance of the random vector **X** using the unit vectors ** uᵢ** and

**(we only care about the direction of these vectors, so we assume they are both normalized). The scalar projection of**

*u*ⱼ**X**onto

**and**

*u*ᵢ**will be**

*uⱼ*

*u*ᵢ*ᵀ***X**and

*u*ⱼ*ᵀ***X**respectively. Now we can use Eq. 12 to calculate the covariance of them:

where we used the fact that the scalar quantity

is equal to its transpose. So based on Eq. 39, ** uᵢᵀSuⱼ** gives the covariance of the scalar projection of

**X**onto

**and the scalar projection of**

*u*ᵢ**X**onto

**(Figure 7).**

*u*ⱼWe can easily show that:

To prove it, we use the fact that ** uᵢᵀ Suⱼ** is a scalar quantity, so we can write:

Now suppose that we have two scalar random variables X₁ and X₂. Their covariance is given by Eq. 12. Though these random variables are scalar, together they form a random vector

And the possible values of this random vector are

We can plot these values in a 2-d space. Then the horizontal and vertical axes represent *x*₁ and *x*₂ respectively. The vector **X** can be written in terms of the standard basis

where

and

So, X₁ and X₂ are the scalar projection of **X** onto ** e**₁ and

**₂ respectively, and we can write**

*e*As a result, if we combine the scalar random variables X₁ and X₂ as the random vector **X, **the covariance of them is the covariance of the scalar projection of **X** onto ** e**₁ and

**₂. This result is shown in Figure 8. Recall that the covariance in Eq. Equation 12 was initially defined for two scalar random variables. However, we can always combine them in a random vector and get the same result by calculating the covariance of that random vector along the vectors of the standard basis.**

*e*The same result for an *n*-dimensional dataset can be obtained using Eq. 39. Let’s see what happens if we use two different vectors of the standard basis in Eq. 39. Using the defintion of covariance matrix (Eq. 32) we can write:

So ** eᵢᵀSeⱼ** gives the covariance of the random variables X

**ᵢ**and X

**ⱼ**or simply the (

*i*,

*j*) element of the covariance matrix. That is because each X

**ᵢ**is the scalar projection of

**X**onto

**. Now using Eqs. 39 and 40, the covariance of X**

*e*ᵢ**ᵢ**and X

**ⱼ**can be written as

So each covariance element in the covariance matrix is the covariance of X along two different vectors taken from the standard basis. Besides, if *i=j* in Eq. 40, then we get

**Change of basis**

Suppose that we have a dataset with 2 features. This dataset can be represented with the random vector

where each X**ᵢ** is a random variable that represents a feature of the dataset. As mentioned before, Each value of **X**

in this space can be written using the standard basis

So each *x***ᵢ** is represented by an axis of the coordinate system defined by the standard basis, and ** eᵢᵀ Seⱼ** gives the covariance of random variables X

**ᵢ**and X

**ⱼ**. However, we are not limited to the standard basis. Indeed, we can choose a different basis {

**₁,**

*u***₂} with the constraint that these vectors are orthonormal. It is important to note that orthogonality is not a necessary condition for a basis, and we can have a basis in which the vectors are linearly independent but not orthogonal. However, in PCA, we only focus on the orthogonal bases.**

*u*By choosing the basis {** u**₁,

**₂} we define a new coordinate system for**

*u***X**, and we can call the axes of this coordinate system x_

*B*1 and x_

*B*2. As Figure 9 shows this new coordinate is the result of the rotation of the original coordinate system. In the new coordinate system,

**₁ and**

*u***₂ play the role of the standard basis and can be called**

*u***₁ and**

*e***₂. The random variables that represent the axes x_**

*e**B*1 and x_

*B*2 are called X_

*B*1 and X_

*B*2 and can be placed in the random vector

**X**_B.

By changing the coordinate system, the coordinate of each point in Figure 9 which can be generally written as ** x** is changes to

**_**

*x**B*. The features (

*x*

**ᵢ**) change to (

*x*_

*Bi*). Their corresponding random variables changes from X

**ᵢ**to X_

*Bi*, and the random vector that represents the whole dataset changes from

**X**to

**X**_

*B*(Figure 9). The covariance matrix of the dataset also changes from

**to**

*S***_**

*S**B*. If we have a basis

*B*={

**₁,**

*u***₂, …,**

*u***}, the matrix**

*u*ₙis called the *change-of-coordinate* matrix. The columns of this matrix are the vectors in basis B. Now if we know the coordinate of the vector ** x** in Rⁿ (which is simply

**itself), the equation**

*x*gives** **its coordinate relative to basis *B*. We can also write

Let’s assume that in the new coordinate system, the axes are called *x*_*Bi*. The cooridate of a vector in the new coordinate system is written as

Now let’s focous on matrix ** P**_

*B*. A square matrix whose columns form an orthonormal set is called an

*orthogonal matrix*, and the inverse of an orthogonal matrix is equal to its transpose. So if the basis

*B*is orthonormal, then

**_**

*P**B*is an orthogonal matrix and

It is important to note that the covariance of a random vector along two arbitrary vectors doesn’t change by changing the coordinate system. Suppose that we have a vector ** uᵢ** in

**X**, and after changing the coordinate system it is converted to

*u_**iB*in the new dataset represented by

**X**_

*B*. Now we can write

which means that the scalar projection of one vector onto another vector doesn’t change by changing the coordinate system (since it is just a distance). So using this equation we get

For example, in Figure 9, changing the coordinate system transforms ** u**₁ and

**₂ to**

*u***₁ and**

*e***₂ respectively. However, we have**

*e***A closer look at the covariance matrix**

We can think of a matrix ** A** as a transformation that acts on a vector

**by multiplication to produce a new vector**

*x***. By looking at Eq. 39**

*Ax*we see that the covariance is the inner product of ** uᵢ** and the vector

**. This vector is the result of the transformation of**

*Su*ⱼ**by the covariance matrix. Let’s see what is the mathematical meaning of this transformation and what the vector**

*u*ⱼ**represents? We start with vector space**

*Su*ⱼ*R*² as an example. Assume that we have a circle that contains all the vectors that are one unit away from the origin. These vectors have the general form of

and the set of all such vectors is called *D*

Now we can calculate ** Su **for all the vectors in this circle (the vectors in

*D*). The set of all these vectors is called

*T*.

These are the vectors in *D* which have been transformed by the covariance matrix ** S**. The transformation matrix can change the direction and magnitude of the vectors in {

**}. So generally, the original circle turns into an ellipse. Listing 7 shows an example of this transformation, and the result is shown in Figure 10.**

*u*A sample vector ** u**₁ in

*D*is transformed to

**₁ and we have**

*t*The vector ** t**₁ can have a different direction and magnitude than that of

**₁**

*u***(Figure 10).**

**₁ can be written in terms of the standard basis, however, we are free to choose a different basis. Here we choose an orthonormal basis in which one of the basis vectors is along**

*t***₁. The other one is perpendicular to it and we call it**

*u***₂. So, the new basis vectors are**

*u***₁ and**

*u***₂. Now we can write**

*u***₁ using these basis vectors**

*t**c*₁ and *c*₂ are the scalar projections of ** t**₁ onto

**₁ and**

*u***₂, and to determine**

*u**them, we can write*

where we used the fact that the basis vectors form an orthonormal set, so they are normalized and perpendicular to each other. Now we can write

So we have

If we assume that ** S** transforms

**₂ into a vector named**

*u***₂**

*t*then we can similarly show that

So the vector ** Su**₁ which is the result of the transformation of

**₁ by the covariance matrix can be written as the sum of two vectors. One is along**

*u***₁ and its magnitude is equal to the variance of**

*u***X**along

**₁, and the other is along**

*u***₂ (which is perpendicular to**

*u***₁), and its magnitude is equal to the covariance of**

*u***X**along

**₁ and**

*u***₂ (Figure 11).**

*u*This result can be extended to a dataset with *n* features which has an *n×n* covariance matrix. Let the set *S* be the set of all the other unit vectors in an *n*-dimensional vector space. Here we choose an orthonormal basis {** u**_1,

**₂,…,**

*u***}. We also assume that**

*u*ₙNow for each ** tᵢ** we can write

And we have

So we can write

So by multiplying the covariance matrix with a unit vector** uᵢ**, we get a new vector. The components of this vector relative to an orthonormal basis that includes

**, are the variance of**

*u*ᵢ**X**along

**and the covariance of**

*u*ᵢ**X**along

**and each of the other basis vectors.**

*u*ᵢ**Principal component analysis (PCA)**

Now we are ready to discuss PCA. Let **X** be a random vector with *n* elements that represent our dataset. Each element of **X** is a random variable that describes a feature of the dataset. We want to find an orthonormal basis such that the correlation of the dataset along any pair of the vectors in this orthonormal basis is zero. The vectors of this basis are called the *principal components* of the dataset. If we think of this new basis as a new coordinate system for the dataset, then each basis vector defines a new axis for the dataset and represents a feature of the dataset in this coordinate system. So the goal of PCA is to find an orthonormal basis in which the features relative to this basis are uncorrelated (which means that the random variables that represent these features are uncorrelated).

Suppose that we have an orthonormal basis {** u**₁,

**₂, …,**

*u***}. Remember that two random variables are uncorrelated when their covariance is zero. So the covariance of**

*u*ₙ**X**along

**and**

*u*ᵢ**for all different values of**

*u*ⱼ*i*and

*j*should be zero

If we substitute Eq. 49 into Eq. 48, we get

for all the vectors ** u**₁ to

**. We know that the variance of**

*u*ₙ**X**along

**is a scalar. If we assume it is equal to**

*u*ᵢ*λ*

**ᵢ**, we can write the previous equation as

So when the covariance matrix transforms the vector ** uᵢ**, it does not change its direction, and can only change its magnitude. This is not true for all the vectors in {

**}. In fact, in each matrix like**

*u***, only some of the vectors have this property. These special vectors are called the**

*S**eigenvectors*of

**and their corresponding scalar quantity**

*S**λ*

**ᵢ**is called an

*eigenvalue*of

**for that eigenvector. Generally, the eigenvector of an**

*A**n*×

*n*matrix

**is defined as a nonzero vector**

*A***such that:**

*v*ᵢwhere *λ***ᵢ*** *is an eigenvalue (eigenvalues and eigenvectors are only defined for square matrices). If we assume that the eigenvectors in Eq. 51 are normalized then each eigenvalue of the covariance matrix is equal to the variance of **X** along its corresponding eigenvector

The function `numpy.linalg.eig()`

calculates the eigenvalues and eigenvectors of a square array in Numpy. For example, to calculate the eigenvalues and eigenvectors of the covariance matrix

We can write

`# Listing 8`

S = np.array([[3, 1],

[1, 2]])

lam, v = LA.eig(S)

lam, v

Output

`(array([3.61803399, 1.38196601]), `

array([[ 0.85065081, -0.52573111],

[ 0.52573111, 0.85065081]]))

It returns a 1-d array (`lam`

) which contains the eigenvalues and a 2-d array (`v`

) that contains the normalized eigenvectors, such that the column `v[:, i]`

is the eigenvector corresponding to the eigenvalue `lam[i]`

. So for this matrix, we have

Listing 9 plots the transformed vectors of the covariance matrix and its eigenvectors. The result is shown in Figure 12.

It can be shown that an *n*×*n* real symmetric matrix has *n *linearly independent and orthogonal eigenvectors, and it also has *n* real eigenvalues corresponding to those eigenvectors (spectral theorem). So if we normalize the eigenvectors, they form an orthonormal set. It can be shown that in an *n*-dimensional vector space, a set of *n* linearly independent vectors form a basis for that space. So the set of the *n* eigenvectors of the *n*×*n* covariance matrix ** S **(which is symmetric) is an orthonormal basis for an

*n*-dimensional vector space. This is in agreement with our initial assumption that the basis for PCA should be orthonormal.

So far, we wrote ** Su**₁ relative to an orthonormal basis in which one of the basis vectors is

**₁ (others (**

*u***₂ to**

*u***) are perpendicular to**

*u*ₙ**and each other). Now we can use the set of the eigenvectors as a new basis. Since the covariance matrix is**

*u*ᵢ*n×n*, the original vector

**₁ belongs to an**

*u**n*-dimensional vector space, and we can use the set of the eigenvectors of the covariance matrix as a basis for that. Let’s call these eigenvectors with {

**₁,**

*v***₂, …,**

*v***}. In fact {**

*v*ₙ**₁,**

*v***₂, …,**

*v***} is a subset of**

*v*ₙ*D*(the circle of unit vectors), and we only use different letters for them to distinguish these eigenvectors from the other vectors in

*D*. Now we can write

where each *c***ᵢ** is the scalar projection of ** u**₁ onto

**(Figure 13).**

*v*ᵢIf we assume that ** Su**₁

*belongs to an n-dimensional vector space (as we see later this is not always the case) then we can write*

If we substitute Eq. 53 into Eq. 54, we get

So ** Su**₁

*can be written in terms of*

*the eigenvectors (Figure 14 ).*

Now the variance of **X** along ** u**₁ can be written as (Eq. 31)

Since the set of eigenvectors are orthonormal, we can use Eq. Equation 36 to simplify the previous equation

We know that ** u**₁ is a normalized vector, and its length is 1. Therefore, using Eq. 53 we can write

So, we have

As a result, in Eq. 57 the variance of **X** along ** u**₁ is a weighted average of the variance of

**X**along all the eigenvectors. As

**₁ gets closer to one of the eigenvectors**

*u***,**

*v*ᵢ*c*

**ᵢ**increases (refer to Figure 13), and the weight for that eigenvector increases too. Suppose that we have sorted the eigenvectors based on their corresponding eigenvalues, so

is the greatest eigenvalue. Now in Eq. 57, the term on the right-hand side is maximized when *λ*₁ has the biggest weight in the average. So the variance of **X** along ** u**₁ is maximized when

*c*₁² = 1 and all other

*c*

**ᵢ**are zero. This means that based on Eq. 53,

**₁ is along the first eigenvector**

*u***₁. Hence the greatest variance of**

*v***X**is along

**₁ and is equal to its corresponding eigenvalue**

*v*Now assume that we only focus on the vectors that are perpendicular to the greatest eigenvector (** v**₁). So for these vectors, there is no component for

**₁, and in Eq. 53 we have**

*v**c*₁=0

Now in Eq. 57, *λ*₁ is absent, and the term on the right-hand side is maximized when *λ*₂ has the biggest weight in the average. This means that the variance of **X** along ** u**₁ is maximized when

*c*₂² = 1 and all other

*c*

**ᵢ**are zero. Hence the maximum of the variance of

**X**over all the vectors in

*D*which are perpendicular to

**₁ is along**

*v***₂ and is equal to its corresponding eigenvalue**

*v*Similarly, we can show that** **the maximum of the variance of **X** over all the vectors in *D *which are perpendicular to ** v**₁,

**₂, …,**

*v***₋₁ is along**

*v*ᵢ**and is equal to its corresponding eigenvalue**

*v*ᵢWe can also calculate the length of the transformed vector ** t**₁ using Eqs. 35 and 60

By following the same reasoning used for the Eq. Equation 57, we can show that the maximum of ||** Su**₁ || over all vectors in

*D*which are perpendicular to

**₁,**

*v***₂, …,**

*v***₋₁ occurs at**

*v*ᵢ**₁=**

*u***and is equal to the corresponding eigenvalue of**

*v*ᵢ

*v*ᵢThe ellipse shown in Figure 12 can be thought of as a circle stretched or shrunk along its principal axes. In fact, when we multiply the covariance matrix by the vectors of the unit circle, it transforms the unit circle by stretching or shrinking it along the principal components (Figure 15), and the principal axes of the ellipse (the major and minor axes) will be along these principal components.

The amount of stretching or shrinking along each principal component is proportional to its corresponding eigenvalue or the variance of the dataset along those principal components (Eq. 52). If the covariance matrix is the identity matrix

Then we have **S u**₁=

**₁. So the covariance matrix doesn’t change the shape of the unit circle at all. If it is a scalar multiple of the identity matrix**

*u*Then we get **S u**₁=

*c*

**₁ for any vector**

*u***₁ in the unit circle. So, the result is a bigger or smaller circle but not an ellipse. Now using Eq. 46 we can also write**

*u*Where ** u**₂ is perpendicular to

**₁. We can get this result for any vectors**

*u***₁ in the unit circle. This means that if we choose two arbitrary orthogonal vectors, the covariance along them will be zero. Hence a covariance matrix which is a scalar multiple of the identity matrix represents a dataset in which the covariance along any pair of orthogonal vectors is zero. On the other hand, when the covariance matrix transforms the unit circle into an ellipse, it means that we have covariance along some vectors in the dataset.**

*u*Suppose that we have a covariance matrix in which all the covariance elements (non-diagonal elements) are zero, and the diagonal elements are different

This matrix still results in an ellipse. Recall that each covariance element in the covariance matrix is the covariance of **X** along two different vectors taken from the standard basis. Hence here the covariance of X along ** e**₁ and

**₂ is zero. Indeed**

*e***₁ and**

*e***₂ are the principal components, and the principal axes of the ellipse are along them. However, it is important to note that the covariance of**

*e***X**along any other pair of orthogonal vectors is not zero (Figure 16), and that is why we have an ellipse.

Finally, these results can be generalized to an *n*-dimensional space in which we have a unit n-dimensional sphere instead of the unit circle, and a covariance matrix which is not a scalar multiple of the identity matrix turns the unit n-dimensional sphere into an ellipsoid.

**Change of basis for the covariance matrix**

Suppose that *S** *is the covariance of the random vector* ***X**, and we want to choose a new basis *B*= {** u**₁,

**₂, …,**

*u***_n} for**

*u***X.**What is the covariance of

**X**relative to this new basis? First, we need to write

**X**relative to the new basis. From Eqs. 43 and 44 we have

where

The same thing can be written for the mean of **X**

We can also write

Now using Eq. 30, the covariance matrix of **X** relative to basis *B* is

So we have

Now to get the (*i*, *j*) element of *S_**B,*** **we can use

**Eq.**

**40**

So, the covariance matrix relative to basis *B* is

If we have the original covariance matrix and the basis vectors, we can easily use this equation to calculate the covariance matrix relative to the new basis. Using Eq. 61, we can write

So we get

which means that we can decompose the covariance matrix relative to the standard basis to get the covariance matrix relative to a new basis.

Now let’s see what happens if we use the eigenvectors of the covariance matrix as a basis. Suppose that {** v**₁,

**₂, …,**

*v***} are the eigenvectors of**

*v*ₙ**, and their corresponding eigenvalues are {**

*S**λ*₁,

*λ*₂, ..,

*λ*_n}. Let

Since these eigenvectors are the principal components, from Eqs. 49 and 52 we have

By substituting them into Eq. 62 we have

**Eigendecomposition**

As mentioned before, an *n*×*n* real symmetric matrix has *n *linearly independent and orthogonal eigenvectors, and we assumed that they are normalized, so these eigenvectors form an orthonormal set, and *P_**B* is an orthogonal matrix. An *n*×*n* symmetric matrix ** A** can be written as

where ** D** is an

*n*×

*n*diagonal matrix comprised of the

*n*eigenvalues of

**.**

*A***is an orthogonal**

*P**n*×

*n*matrix, and the columns of

**are the**

*P**n*linearly independent and orthogonal eigenvectors of

**that correspond to those eigenvalues in**

*A***respectively.**

*D*The decomposition of ** A** based on Eq. 65 is called

*eigendecomposition*. So in Eq. 64, the eigendecomposition of the covariance matrix gives the covariance matrix relative to the principal components as a new basis. If we simplify the right-hand side of Eq. 64 we get

Now if we multiply both of this equation by ** u**₁ we have

Which gives ** Su**₁ in terms of the eigenvectors and the terms in the brackets are components of

**₁ along each eigenvector. So the eigendecomposition of**

*Su***gives the components of**

*S***₁ along each eigenvector.**

*Su***Design matrix**

Now let’s see how we can implement PCA in Python. So far, we have used random vectors to derive the mathematical equations. But in practice what we have is a set of data points or observations, and we are going to see how we can apply the PCA analysis to these data points. Suppose that we have a dataset with *n* features and *m* datapoints

Each data point is a vector

which is a possible value for the random vector

that represents the dataset. We place all the datapoints in a matrix defined as

which represents the whole dataset and we call it the *design matrix*. In the textbooks, the design matrix is usually denoted by ** X**. Here I use

**(it is read**

*X̃***tilde) to make it easier for the reader to distinguish it from a random vector. The**

*X**i*-th row of

**is the transpose of data point**

*X̃***, and the**

*x*ᵢ*j*-th column of

**gives the different values that the random variable X**

*X̃***ⱼ**in

**X**can take. Indeed the design matrix is the result of placing the elements of the data set in Table 3 into a matrix.

It is usual to *center *the dataset around zero before the PCA analysis. It means that we create a new dataset by subtracting the mean of **X** from **X**

which means that

Now we can see that the mean of **X**𝒸 is zero

This is like moving the origin to the mean of the dataset. The (*i, j*) element of the covariance matrix of **X** is

And the (*i, j*) element of the covariance matrix of **X**𝒸 is

which means that **X** and **X**𝒸 have the same covariance matrix, so their eigenvectors and eigenvalues will be the same. As a result, we can do the PCA analysis on **X**𝒸 instead of **X **and get the same results. From now we don’t use the subscript *c* to denote a centered dataset and simply use **X** for a centered dataset. Now suppose that our dataset {** x**₁,

**₂, …,**

*x**x*

**ₘ**} is already centered. Since the mean of this dataset is zero, we have

The (*i, j*) element of the covariance matrix for this dataset is

Each *x***ᵢⱼ** in this equation is the (*i, j*) element of *X̃**.* Here to get all the different values of X**ᵢ** and X**ⱼ*** *in the dataset, we get all the values of the *i*-th and *j*-th column of ** X̃ **which are

*x*

**ₖᵢ**and

*x*

**ₖⱼ**for

*k=1..m*. But to take the average we divide the sum by

*m-1,*not

*m*. That is because

*Cov*(X

**ᵢ**, X

**ⱼ**) is the sample covariance which is used as an estimate of the population covariance. By dividing by

*m-1*we, make sure that we get an

*unbiased*estimate of the population covariance. Now we can write the previous equation as

By using the definition of matrix multiplication we get

Please note that this equation is only valid when the dataset used to generate ** X̃** is centered. Let’s use PCA for a real dataset. We use the same dataset that we created in Listing 4. We should first center it and calculate

**(which is called**

*X̃*

`Xtil`

in Listing 10). This listing shows the original dataset versus the centered dataset (Figure 17).We can now calculate the covariance matrix using *X̃*

`# Listing 11`

S = (1/(Xtil.shape[0]-1)) * Xtil.T @ Xtil

S

Output

`array([[5.81736517, 4.13318634], `

[4.13318634, 6.39086118]])

We can also use the function `cov()`

in Numpy for this purpose. However, it accepts the transpose of `Xtil`

as its argument

`S = np.cov(Xtil.T)`

We also need to calculate the eigenvalues and eigenvectors of the covariance matrix.

`# Listing 12`

lam, v = LA.eig(S)

print(‘eigenvalues=’, lam)

print(‘eigenvectors=’, v) # each column is an eigenvector

Output

`eigenvalues= [ 1.96099192 10.24723443]`

eigenvectors= [[-0.73116709 -0.68219842]

[ 0.68219842 -0.73116709]]

Each column of `v`

is an eigenvector. These eigenvectors form a new orthogonal basis. To get the coordinate of the data points relative to this basis we project them onto the basis vectors in Listing 13. Only the values for the first 10 data points are shown here.

`# Liting 13`

Xtil_v1 = np.dot(Xtil, v[:,0])

Xtil_v2 = np.dot(Xtil, v[:,1])

print(“v1 coordinates=”, Xtil_v1[:10])

print(“v2 coordinates=”, Xtil_v2[:10])

Output

v1 coordinates= [ 0.63962439 3.15526738 -1.29586127 -0.22899389 0.44992067 1.95208948 0.13619991 0.40110269 -0.24546895 -1.29197232]v2 coordinates= [ 5.65979639 3.08776513 6.05436485 3.11436688 -0.242004 0.48833218 2.50352458 1.4907509 4.83621938 1.13555066]

Listing 14 plots the original dataset, the transformed vectors of the unit circle by the covariance matrix, and the projection of data points onto the principal components. The result is shown in Figure 18.

We can also use the Scikit-learn library to do PCA. We can use the original dataset in Listing 4 (before centering) since it automatically centers the dataset before doing the PCA analysis. First, we need to give it the data points

`Listing 15`

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

pca.fit(points)

The argument `n_components`

gives the number of principal components that we want to use during the calculations. We usually don’t need all of them since we want to project our data points onto a lower dimension space. For example, if our dataset is 4 dimensional, it will have 4 principal components, but we can only keep the first two ones that have the highest eigenvalues. If `n_components`

is not set all components are kept. We can use the `explained_variance_`

property to see the eigenvalues

`pca.explained_variance_`

Output

`array([10.24723443, 1.96099192])`

The output is a 1-d array that gives the sorted eigenvalues. So the first element is the greatest eigenvalue. Remember that each eigenvalue of the covariance matrix is equal to the variance of **X** along its corresponding eigenvector (Eq. 52). That is why this property is called the explained variance. We can also use the `components_`

property to get the principal components (eigenvectors).

`pca.components_`

Output

`array([[ 0.68219842, 0.73116709], `

[ 0.73116709, -0.68219842]])

The output is a 2-d array. Each row of this array gives a principal component of the dataset. The *i*-th row (`pca.components_[i, :]`

) gives the corresponding eigenvector (principal component) of the *i*-th element of `pca.explained_variance_`

. You may have noticed that the eigenvectors reported by this property are in the opposite direction of the eigenvectors that were obtained in Listing 12. For any matrix ** A**, If

**is a corresponding eigenvector of**

*v*ᵢ*λ*

**ᵢ**, then any multiplier of

**like**

*v*ᵢ*c*

**is also the corresponding eigenvector of**

*v*ᵢ*λ*

**ᵢ**.

So in this case, if we call the eigenvectors obtained in Listing 12, ** v**₁ and

**₂ then**

*v*`pca.components_`

returns -**₁ and -**

*v***₂ as eigenvectors. Finally, to get the scalar projection of the data points onto the principal components, we can use the**

*v*`transform()`

method.`# Listing 16`

pca.transform(points) [:10,:]

Output

`array([[-5.65979639, -0.63962439],`

[-3.08776513, -3.15526738],

[-6.05436485, 1.29586127],

[-3.11436688, 0.22899389],

[ 0.242004 , -0.44992067],

[-0.48833218, -1.95208948],

[-2.50352458, -0.13619991],

[-1.4907509 , -0.40110269],

[-4.83621938, 0.24546895],

[-1.13555066, 1.29197232]])

It returns a 2-d array, and the data points in this array have the same index of the data points in the design matrix passed to `pca.fit()`

. So the *i*-th row corresponds to the *i*-th data point in the *i*-th row of `points`

array. The *j*-th column gives the scalar projection of the data points onto the *j*-th principal component (remember that the principal components are sorted based on their eigenvalues). In fact, each row of this array gives the coordinates of a data point relative to the basis defined by the principal components. Here we only showed the first 10 rows of this array.

If you compare the results of Listing 16 with the ones obtained in Listing 13, you will notice that their absolute value is the same but they have opposite signs. That is because our basis is now -** v**₁ and -

**₂. Indeed, the outputs of Listing 13 and 16 are still referring to the same projected points (red dots in Figure 18). We can reconstruct the vector**

*v***that represents the**

*x*ᵢ*i*-th data point using these coordinates. Since the principal components are a basis for the dataset, we can write

where {** v**₁,

**₂, …,**

*v***} are the principal components, and**

*v*ₙ*c*

**is the scalar projection of the**

*ⱼ***onto the**

*xᵢ**j*-th principal component. For example, the first data point in the design matrix in Listing 10 is represented by the vector

`Xtil[0]`

Output

`array([-4.32877648, -3.70190609])`

The principal components are returned by `pca.components_`

and the coordinates of this vector are given by `pca.transform(points)[0,:]`

. So to reconstruct this vector we can write

`np.sum(pca.transform(points)[0,:] * pca.components_, axis =1)`

Output

`array([-4.32877648, -3.70190609])`

which returns the same vector.

**The rank of the covariance matrix (optional)**

Suppose that we have a date set with *n* features, so the number of columns of the design matrix is *n* and the covariance matrix will be *n×n *(Eq. 68). As mentioned before an *n×n* symmetric matrix has *n* real eigenvalues, and *n* linearly independent and orthogonal corresponding eigenvectors (spectral theorem). So far we assumed that these *n* eigenvectors are the principal components of the dataset, and form an orthonormal basis for that. Indeed we assumed that a dataset with *n* features belongs to an *n*-dimensional vector space. As we will see in this section this is not always true.

Let’s go back to the first dataset that we saw at the beginning of this article (Table 1). In that dataset, all the data points are along the line *y*=*x*. We can now easily determine its principal components and eigenvalues.

`# Listing 17`

x = np.linspace(1, 5, 5)

y = np.linspace(1, 5, 5)

points1 = np.vstack((x, y)).T

pca = PCA(n_components=2)

pca.fit(points1)

We can get the covariance matrix for this dataset

`S = np.cov(points1.T)`

S

Output

`array([[2.5, 2.5],`

[2.5, 2.5]])

The eigenvalues are

`pca.explained_variance_`

Output

`array([5.00000000e+00, 4.29747739e-32])`

and the eigenvectors or principal components are

`pca.components_`

Output

`array([[-0.70710678, -0.70710678],`

[ 0.70710678, -0.70710678]])

So the first eigenvalue is 5 and its corresponding eigenvector is [-0.708 -0.708]** ᵀ**. This is a unit vector along the line

*x=y*. The second eigenvalue is a tiny number, and we can assume it is zero (In fact it should be zero, but we get a tiny number since we are estimating it with a numerical method). Figure 19 shows how the vectors in the unit circle are transformed by the covariance matrix of this dataset.

As you see *T*={** Su**}

**is not an ellipse any more. It is now a straight line. Since the second eigenvalue is zero, it means that the data points have no variance along the second eigenvector. So the second principal axis is of no importance and we can ignore it. Here the first eigenvector**

**₁ = [-0.707, -0.707]**

*v***spans all the vectors in**

*ᵀ**T*, the basis of

*T*has only one vector which is

**₁. The dimension of**

*v**T*is 1 since the columns of

**are not linear independent. In this example, If we call the columns of**

*S*

*S,***₁, and**

*c***₂ respectively, then we have**

*c***₁=**

*c***₂. So we get**

*c*Generally, an *m*×*n *matrix ** A** does not necessarily transform an

*n*-dimensional vector

**into an**

*u**m*-dimensional vector

**. The dimension of the transformed vector**

*Au***can be lower if the columns of that matrix are not linearly independent. The column space of matrix**

*Au***written as**

*A**Col*

**is defined as the set of all linear combinations of the columns of**

*A***If the columns of the matrix**

*A.***are**

*A***₁**

*c**,*

**₂**

*c**, …*

**,**

*c*ₙ**then for any vector like**

**,**

*u***can be also written as**

*Au*So ** Au** is a linear combination of the columns of

**, and**

*A**Col*

**is the set of vectors that can be written as**

*A***. The number of basis vectors of**

*Au**Col*

**or the dimension of**

*A**Col*

**is called the**

*A**rank*of

**. Indeed, the rank of**

*A***is the dimension of {**

*A***}.**

*Au*The rank of ** A** is also the maximum number of linearly independent columns of

**. That is because we can write all the dependent columns as a linear combination of these linearly independent columns. Since**

*A***is a linear combination of all the columns, then it can be written as a linear combination of these linearly independent columns. Hence, these linearly independent columns span**

*Au***and form a basis for**

*Au**col*

**, and the number of these vectors becomes the rank of**

*A***. In the previous example, the rank of**

*A***is 1.**

*S*But why in this example all the columns of the covariance were not linearly independent? If we print the centered design matrix, we get

`points1 — points1.mean(axis=0)`

Output

`array([[-2., -2.], `

[-1., -1.],

[ 0., 0.],

[ 1., 1.],

[ 2., 2.]])

Here you see that the columns of the design matrix are equal, and they are not linearly independent. It can be shown that for a real matrix ** A**, the rank of

**and**

*A***are equal (refer to the appendix for proof), so according to Eq. Equation 68, the rank of**

*Aᵀ A***and**

*S***should be equal.**

*X̃*The rank of ** X̃ **is also equal to the dimension of the random vector

**X**that it represents. Here is the proof. Let

**X**be a random vector with

*n*elements, and

**be a possible value that it can take. Now the**

*x**n*-dimensional standard basis can span

**.**

*x*Now suppose that one of the elements of ** x** is dependent on the other one. For example x₂ =

*c*x₁. This means that in our all data points the second feature is equal to the first feature multiplied by

*c,*and in the design matrix, the second column is equal to

*c*times the first column. Now we can write the previous

Here we used a linear combination of ** e**₁ and

**₂ (**

*e***₁+**

*e**c*

**₂) as a new basis vector. Since**

*e***₁ and**

*e***₂ were linearly independent of the other basis vectors, a linear combination of them will be also linearly independent of them. So now we have (**

*e**n*-1) linearly independent vectors that span

**. Hence these**

*x**(n-1)*vectors form a basis for

**and the dimension of**

*x***will be**

*x**n-1*.

As a result, when one column of the design matrix (** X̃**) becomes dependent on the others, the dimension of

**X**is decreased by one, and the number of the remaining linearly independent columns is equal to the dimension of

**X**.

If you have a design matrix in which the number of rows (*m*) is smaller than the number of columns (*n*), then the rank of the matrix will be *m* at most. A dataset with *m* elements is represented by the random vector **X** with *m* elements. This random vector can be spanned with the vectors of the standard basis with *m* elements

So we have *m* linearly independent vectors here. If all of them are needed to span your vector space, the dimension of **X** will be *m*. If you can combine some of them (like what we did in Equation 69 ), the dimension will be smaller than *m*. So the dimension of X and the rank of the design matrix cannot be greater than *m*.

If the number of rows (*m*) is greater than the number of columns (*n*) then the rank of the design matrix will be *n* at most since this is the maximum number of linearly independent columns that the matrix can have.

There is an important relationship between the rank of a square matrix and the number of its zero eigenvalues. For a symmetric matrix, it can be shown that the rank of the matrix equals the number of its nonzero eigenvalues (proof is given in the appendix). Since the covariance matrix is symmetric, we can use this theorem for that. In our previous example, we had two eigenvalues and one of them was non-zero, so the rank of ** S **was 1.

So let me summarize the discussion. The covariance matrix is a symmetric matrix that has *n* linearly independent eigenvectors. The number of non-zero eigenvalues of the covariance matrix, gives us the rank of the covariance matrix, the rank of the design matrix, and the dimension of the random vector that represents our dataset.

The dimension of this random vector is not necessarily equal to the number of its elements or the number of features that we have in the dataset. The number of data points in the dataset can also limit the dimension of this random vector.

As an example, suppose that we have a few data points with 3 features, but we only include 2 of them in the design matrix. Based on the previous discussion the rank of the covariance matrix cannot be greater than 2 here.

`points3 = np.array([[1,5,3],`

[3,2,7]])

S = np.cov(points3.T)

LA.eig(S)

Output:

`(array([ 0. , 14.5, 0. ]),`

array([[-9.28476691e-01, -3.71390676e-01, 1.53134210e-17],

[-2.22834406e-01, 5.57086015e-01, 8.00000000e-01],

[ 2.97112541e-01, -7.42781353e-01, 6.00000000e-01]]))

But you see two eigenvalues are zero and the dimension of the dataset is only one (since we have only one non-zero eigenvalue). But why we get this result? Here we only have two data points and two points can be connected with a straight line. If we center the dataset, this line passes through the origin which means that the vector along this line is a basis vector for the dataset (Figure 20).

This is the eigenvector that has a non-zero eigenvalue and both the data points can be spanned by it (the blue vector in Figure 20). This example clearly shows the importance of the data points. The only thing that PCA sees is the data points that we sample from the random vector. If these data points are not a representative sample of that vector, PCA will underestimate its dimension.

I hope that you enjoyed reading this article. In the 2nd part of this article, a case study of the application of PCA in dimensionality reduction will be presented. In addition, the application of singular value decomposition (SVD) in PCA, the relation between the covariance matrix and multivariate normal distribution, and the limitations of PCA will be discussed. Please let me know if you have any suggestions.

**You can find the second part of this article here**: Understanding Principal Component Analysis and its Application in Data Science — Part 2

All the Code Listings in this article are available for download as a Jupyter notebook from GitHub at: https://github.com/reza-bagheri/PCA_article

**Appendix**

**Properties of mean**

1-If X₁, X₂, …, X_n are random variables and a_1, a_2, …, a_n and *c* are constants, then we can show that

Proof: We can first show that this is true for *n*=2:

where we have used marginalization in (a), and we used the fact that the PDF f(x₁, x₂) integrates to 1. Using this result, you can easily show that Eq. A.1 is true for each positive integer n by an induction argument.

2-If **X** is an *m×n* random matrix, and ** A** is a constant

*m×n*matrix:

Proof:

3-If **X** is an *m×n* random matrix, ** B** is a constant

*k×m*matrix and

**is a constant**

*C**n×p*matrix:

Proof: For Eq. A.3 we can write:

The proof for Eq. A.4 is similar. If we combine the results of part 1 and 2, we can write

Similarly, we have

**The rank of a symmetric matrix**

We want to show that the rank of a symmetric matrix equals its number of nonzero eigenvalues. The null space of a matrix ** A** is the set of all vectors

**such that**

*x***=**

*Ax***0**. The

*nullity*of

**is the dimension of the null space of**

*A***. The rank theorem states that for an**

*A**m×n*matrix

*A*where n is the number of columns of ** A**. Based on the

*spectral theorem*, an

*n×n*symmetric matrix

**has**

*B**n*real eigenvalues, and

**linearly independent corresponding eigenvectors. However, these eigenvalues are not necessarily distinct and some of the eigenvectors can have the same eigenvalues. Here we count the eigenvalues separately even if they are equal. So when we say the number of zero eigenvalues, it means the number of eigenvectors with a zero corresponding eigenvalue.**

*n*Now the nullity of ** B** is the set of vectors

**such that**

*v***=**

*Bv***0**

*which can be also written as*

*Bv**=0*

**So the eigenvectors of**

*v.***with a zero eigenvalue form the null space of**

*B***, and since these eigenvectors are linearly independent, they can form a basis for the null space of**

*B***, so the number of eigenvectors with a zero eigenvalue or simply the number of zero eigenvalues is the nullity of**

*B***. Now based on the rank theorem we can write**

*B*rank ** B**=n-nullity

**=**

*A**n*-number of zero eigenvalues=number of nonzero eigenvalues

and since the total number of eigenvalues is *n* we get

**The rank of A and Aᵀ A are equal**

Let** A** be a real

*m×n*matrix, and

**be a real vector with**

*x***elements. First, we will show that**

*n*If ** Ax**=

**0**, we can multiply both sides by

**to get**

*Aᵀ***=**

*Aᵀ Ax***0**

Now assume that

and let ** y**=

**, so we get**

*Ax***=**

*Aᵀ y***0.**By multiplying both sides of the previous equation by

**we have**

*Xᵀ*Using the definition of ** y** and Eq. 28 we get

Based on its definition,** y** has m elements

So using Eq. A.4 we can write

Since ** A** and

**are real matrices,**

*y***will be real too, and**

*y***is a sum of positive real numbers. So the previous equation implies that**

*yᵀ y*And ** y**=

**=**

*Ax***0**. So we just proved that all the vectors

**that make**

*x***=**

*Ax***0**make (

**)**

*Aᵀ A***=0 and vice versa. So**

*x*

*A**and*

*Aᵀ A**have the same null space and hence the same nullity. We know that both*

**and**

*Aᵀ A***have**

*A**n*columns. Now based on The rank theorem (Eq.

*A.5*) we have

Since the nullity of ** Aᵀ A** and

**are equal, their ranks will be equal too.**

*A*