The data.


The classic iris data set. Four columns containing features about a flower, and a target column with the actual species of the flower.

ToC

In [1]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
In [2]:
iris = load_iris()

df = pd.DataFrame(iris['data'], columns=iris['feature_names'])
df['target'] = iris['target']
df['target_name'] = df['target'].map(dict(zip(range(4), iris['target_names'])))

df
Out[2]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target target_name
0 5.1 3.5 1.4 0.2 0 setosa
1 4.9 3.0 1.4 0.2 0 setosa
2 4.7 3.2 1.3 0.2 0 setosa
3 4.6 3.1 1.5 0.2 0 setosa
4 5.0 3.6 1.4 0.2 0 setosa
... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2 virginica
146 6.3 2.5 5.0 1.9 2 virginica
147 6.5 3.0 5.2 2.0 2 virginica
148 6.2 3.4 5.4 2.3 2 virginica
149 5.9 3.0 5.1 1.8 2 virginica

150 rows × 6 columns

The dataset is really simple with 150 observations, 4 feature columns and 1 target column.

There are no missing values, errors or anything.

It is a perfect simple example to illustrate PCA.

In [3]:
print(f"Data set shape : {df.shape}.")
print(f"Target values : {iris['target_names']}.")
Data set shape : (150, 6).
Target values : ['setosa' 'versicolor' 'virginica'].

What is PCA ?


PCA is commonly used in data analysis and machine learning to simplify complex data sets and improve computational efficiency. It is also used for exploratory data analysis and visualization to identify patterns and relationships between variables.

ToC

Data transformation.


The main idea of the PCA is to create a new coordinate system such that the first principal component corresponds to the direction of maximum variance in the data, the second principal component corresponds to the direction of second highest variance, and so on.

Dimension reduction.


By projecting the original data onto this new coordinate system, we can reduce the dimensionality of the data while retaining as much of the original variation as possible.

Graphs.


Now, as a consequence, reducing the number or dimension can be a great way to produce graphs. Indeed, reducing the dimensions to 2 or 3 allow to visualize the data in the new coordinate system.

How to use PCA ?


Now let's take a look at how to use this tool on the iris data set with python.

ToC

In [4]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

from matplotlib import pyplot as plt
import seaborn as sns

from conf import conf_graph, palette

How to use PCA ?


Now let's take a look at how to use this tool on the iris data set with python.

ToC

In [5]:
l_features = iris['feature_names']

Data need to be scaled before a PCA can be performed.

In [6]:
scaler = StandardScaler()
df[l_features] = scaler.fit_transform(df[l_features])

PCA with Scikit-Learn.


A PCA can simply be created used the sklearn library.

Declaring a pca object allow to use it and reuse it later in the project. It the PCA "memory".

In [7]:
pca_iris = PCA()

The PCA need to be "fitted" to the features. Meaning the new coordinate system is determined and saved in the "pca_iris" object.

In [8]:
_ = pca_iris.fit(df[iris['feature_names']])

Data can now be transformed.

In this case I store the new data in the original data set, which might not always be the best practice but works fine here.

In [9]:
df[['PC_' + str(i) for i in range(1, 5)]] = pca_iris.transform(df[iris['feature_names']])
df
Out[9]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target target_name PC_1 PC_2 PC_3 PC_4
0 -0.900681 1.019004 -1.340227 -1.315444 0 setosa -2.264703 0.480027 -0.127706 -0.024168
1 -1.143017 -0.131979 -1.340227 -1.315444 0 setosa -2.080961 -0.674134 -0.234609 -0.103007
2 -1.385353 0.328414 -1.397064 -1.315444 0 setosa -2.364229 -0.341908 0.044201 -0.028377
3 -1.506521 0.098217 -1.283389 -1.315444 0 setosa -2.299384 -0.597395 0.091290 0.065956
4 -1.021849 1.249201 -1.340227 -1.315444 0 setosa -2.389842 0.646835 0.015738 0.035923
... ... ... ... ... ... ... ... ... ... ...
145 1.038005 -0.131979 0.819596 1.448832 2 virginica 1.870503 0.386966 0.256274 -0.389257
146 0.553333 -1.282963 0.705921 0.922303 2 virginica 1.564580 -0.896687 -0.026371 -0.220192
147 0.795669 -0.131979 0.819596 1.053935 2 virginica 1.521170 0.269069 0.180178 -0.119171
148 0.432165 0.788808 0.933271 1.448832 2 virginica 1.372788 1.011254 0.933395 -0.026129
149 0.068662 -0.131979 0.762758 0.790671 2 virginica 0.960656 -0.024332 0.528249 0.163078

150 rows × 10 columns

Entierly new data can be transformed as well.

In [10]:
pca_iris.transform(pd.DataFrame([[0.5, 0.5, 0.5, 0.5]], columns=iris['feature_names']))
Out[10]:
array([[0.69849405, 0.69607344, 0.15060727, 0.07003773]])

The transformation is completely reversible.

In [11]:
pd.DataFrame(pca_iris.inverse_transform(df[['PC_1', 'PC_2', 'PC_3', 'PC_4']]))
Out[11]:
0 1 2 3
0 -0.900681 1.019004 -1.340227 -1.315444
1 -1.143017 -0.131979 -1.340227 -1.315444
2 -1.385353 0.328414 -1.397064 -1.315444
3 -1.506521 0.098217 -1.283389 -1.315444
4 -1.021849 1.249201 -1.340227 -1.315444
... ... ... ... ...
145 1.038005 -0.131979 0.819596 1.448832
146 0.553333 -1.282963 0.705921 0.922303
147 0.795669 -0.131979 0.819596 1.053935
148 0.432165 0.788808 0.933271 1.448832
149 0.068662 -0.131979 0.762758 0.790671

150 rows × 4 columns

Dimension reduction.


The new cordinate system is created iteratively, meaning starting with the first component, optimizing it, going on to the second and so on.

It is designed so the first component holds the most information, or explains the most variance. Then each component is a little less important.

We can draw a skree plot that displays the cummulative explained variance to choose how many components we should keep.

In [12]:
df_explained_var = pd.DataFrame([pca_iris.explained_variance_ratio_],  columns=['PC_1', 'PC_2', 'PC_3', 'PC_4']).T

df_explained_var = df_explained_var.rename(columns={0: 'explained_variance'})
df_explained_var['cumulative_explained_variance'] = df_explained_var['explained_variance'].cumsum()
In [13]:
plt.figure(figsize=(16, 8))
plt.gcf().subplots_adjust(bottom=0.15, left=0.3)

sns.barplot(y="explained_variance", x=df_explained_var.index, data=df_explained_var, color=palette[0])
sns.lineplot(y="cumulative_explained_variance", x=df_explained_var.index, data=df_explained_var, color=palette[1])

plt.title("Cumulative explained variance by component.", **conf_graph["title_style"])
plt.xlabel("Components", **conf_graph["label_style"])
plt.ylabel("Explained variance", **conf_graph["label_style"])
plt.xticks(**conf_graph["tick_style"])
plt.yticks(**conf_graph["tick_style"])
plt.show()

The first two components have a cummulative explained variance of 95% which is more than enough for or purposes.

Graph.


We can start by plotting a correlation circle. It helps us see the relations between variable themselves and between variables and our components.

Maths details here.
In [14]:
n = pca_iris.n_samples_
p = pca_iris.n_features_

lst_eigenvalues = ((n - 1)/n) * pca_iris.explained_variance_
sqrt_eigval = np.sqrt(lst_eigenvalues)
corvar = np.zeros((p, p))

for k in range(p):
    corvar[:, k] = pca_iris.components_[k, :] * sqrt_eigval[k]
    
pcs = corvar.T
d1, d2 = 0, 1
labels = iris['feature_names']
In [15]:
fig, ax = plt.subplots(figsize=(7, 7))

plt.quiver(
    [0] * p, [0] * p, pcs[d1, :], pcs[d2, :],
    angles='xy', scale_units='xy', scale=1, color=palette[2]
)


for i, (x, y) in enumerate(pcs[[d1, d2]].T):
    plt.text(x, y, labels[i], fontsize='14', ha='center', va='center', color="black", alpha=0.75)

circle = plt.Circle((0, 0), 1, facecolor='none', edgecolor=palette[0])
plt.gca().add_artist(circle)

plt.plot([-1, 1], [0, 0], color=palette[0], ls='--')
plt.plot([0, 0], [-1, 1], color=palette[0], ls='--')

plt.xlim(-1, 1)
plt.ylim(-1, 1)

plt.title("Correlation circle.", **conf_graph["title_style"])
plt.xlabel("PC_1", **conf_graph["label_style"])
plt.ylabel("PC_2", **conf_graph["label_style"])
plt.xticks(**conf_graph["tick_style"])
plt.yticks(**conf_graph["tick_style"])
plt.show()

Seems like the petal width and lenght are highly correlated which we could have anticipated, but the sepal length and width are not.

Also we can note that PC_1 will be strongly correlated with the sepal length, petal width and petal length, and PC_2 will be mostly correlated to sepal width and slightly to sepal length.


Now that we know that most of the variance is explained by the first two components, we can plot the data in the first factorial plane which is simply the plane made up of the first two components.

In [16]:
plt.figure(figsize=(16, 8))
plt.gcf().subplots_adjust(bottom=0.15, left=0.3)

sns.scatterplot(x="PC_1", y="PC_2", hue="target_name", data=df, palette=palette[:3])

plt.title("Projections of the flowers on the first two components.", **conf_graph["title_style"])
plt.xlabel("PC_1", **conf_graph["label_style"])
plt.ylabel("PC_2", **conf_graph["label_style"])
plt.xticks(**conf_graph["tick_style"])
plt.yticks(**conf_graph["tick_style"])
plt.show()

We can observe that at least one species of iris (the Setosa) can be perfectly separated by the information in the first two components. The other two species are a little closer but the representation is not bad.

If we had more features, we could also draw the subsequent factorial planes if we wanted to.

Using what we know from our correlation circle, we can infer that the Setosa variety has slightly wider sepals but smaller petals. Virginica appear to have to bigger petals. We can verify that at a high level.

In [17]:
df[l_features] = scaler.inverse_transform(df[l_features])
df.groupby('target_name')[l_features].mean()
Out[17]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
target_name
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026

We can easily verify our hypothesis. Petals of the Setosa variety are indeed smaller, its septals are wider, and petals of the Virginica variety are bigger.