Statistics
Principal Component Aanalysis - PCA
PCA stands for Principal Component Analysis. It is a statistical technique which primary goal is to identify the underlying structure in the data by creating a new coordinate system that emphasizes the most important features of the data.
It isoften used to reduce the dimensions of a large data set by transforming it into a smaller set of uncorrelated variables called principal components.
Tristan
tristan.muscat@pm.me
Python projects list
The data.
The classic iris data set. Four columns containing features about a flower, and a target column with the actual species of the flower.
ToC
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = load_iris()
df = pd.DataFrame(iris['data'], columns=iris['feature_names'])
df['target'] = iris['target']
df['target_name'] = df['target'].map(dict(zip(range(4), iris['target_names'])))
df
The dataset is really simple with 150 observations, 4 feature columns and 1 target column.
There are no missing values, errors or anything.
It is a perfect simple example to illustrate PCA.
print(f"Data set shape : {df.shape}.")
print(f"Target values : {iris['target_names']}.")
What is PCA ?
PCA is commonly used in data analysis and machine learning to simplify complex data sets and improve computational efficiency. It is also used for exploratory data analysis and visualization to identify patterns and relationships between variables.
ToC
Data transformation.
The main idea of the PCA is to create a new coordinate system such that the first principal component corresponds to the direction of maximum variance in the data, the second principal component corresponds to the direction of second highest variance, and so on.
How to use PCA ?
Now let's take a look at how to use this tool on the iris data set with python.
ToC
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt
import seaborn as sns
from conf import conf_graph, palette
How to use PCA ?
Now let's take a look at how to use this tool on the iris data set with python.
ToC
l_features = iris['feature_names']
Data need to be scaled before a PCA can be performed.
scaler = StandardScaler()
df[l_features] = scaler.fit_transform(df[l_features])
pca_iris = PCA()
The PCA need to be "fitted" to the features. Meaning the new coordinate system is determined and saved in the "pca_iris" object.
_ = pca_iris.fit(df[iris['feature_names']])
Data can now be transformed.
In this case I store the new data in the original data set, which might not always be the best practice but works fine here.
df[['PC_' + str(i) for i in range(1, 5)]] = pca_iris.transform(df[iris['feature_names']])
df
Entierly new data can be transformed as well.
pca_iris.transform(pd.DataFrame([[0.5, 0.5, 0.5, 0.5]], columns=iris['feature_names']))
The transformation is completely reversible.
pd.DataFrame(pca_iris.inverse_transform(df[['PC_1', 'PC_2', 'PC_3', 'PC_4']]))
Dimension reduction.
The new cordinate system is created iteratively, meaning starting with the first component, optimizing it, going on to the second and so on.
It is designed so the first component holds the most information, or explains the most variance. Then each component is a little less important.
We can draw a skree plot that displays the cummulative explained variance to choose how many components we should keep.
df_explained_var = pd.DataFrame([pca_iris.explained_variance_ratio_], columns=['PC_1', 'PC_2', 'PC_3', 'PC_4']).T
df_explained_var = df_explained_var.rename(columns={0: 'explained_variance'})
df_explained_var['cumulative_explained_variance'] = df_explained_var['explained_variance'].cumsum()
plt.figure(figsize=(16, 8))
plt.gcf().subplots_adjust(bottom=0.15, left=0.3)
sns.barplot(y="explained_variance", x=df_explained_var.index, data=df_explained_var, color=palette[0])
sns.lineplot(y="cumulative_explained_variance", x=df_explained_var.index, data=df_explained_var, color=palette[1])
plt.title("Cumulative explained variance by component.", **conf_graph["title_style"])
plt.xlabel("Components", **conf_graph["label_style"])
plt.ylabel("Explained variance", **conf_graph["label_style"])
plt.xticks(**conf_graph["tick_style"])
plt.yticks(**conf_graph["tick_style"])
plt.show()
The first two components have a cummulative explained variance of 95% which is more than enough for or purposes.
Graph.
We can start by plotting a correlation circle. It helps us see the relations between variable themselves and between variables and our components.
Maths details here.n = pca_iris.n_samples_
p = pca_iris.n_features_
lst_eigenvalues = ((n - 1)/n) * pca_iris.explained_variance_
sqrt_eigval = np.sqrt(lst_eigenvalues)
corvar = np.zeros((p, p))
for k in range(p):
corvar[:, k] = pca_iris.components_[k, :] * sqrt_eigval[k]
pcs = corvar.T
d1, d2 = 0, 1
labels = iris['feature_names']
fig, ax = plt.subplots(figsize=(7, 7))
plt.quiver(
[0] * p, [0] * p, pcs[d1, :], pcs[d2, :],
angles='xy', scale_units='xy', scale=1, color=palette[2]
)
for i, (x, y) in enumerate(pcs[[d1, d2]].T):
plt.text(x, y, labels[i], fontsize='14', ha='center', va='center', color="black", alpha=0.75)
circle = plt.Circle((0, 0), 1, facecolor='none', edgecolor=palette[0])
plt.gca().add_artist(circle)
plt.plot([-1, 1], [0, 0], color=palette[0], ls='--')
plt.plot([0, 0], [-1, 1], color=palette[0], ls='--')
plt.xlim(-1, 1)
plt.ylim(-1, 1)
plt.title("Correlation circle.", **conf_graph["title_style"])
plt.xlabel("PC_1", **conf_graph["label_style"])
plt.ylabel("PC_2", **conf_graph["label_style"])
plt.xticks(**conf_graph["tick_style"])
plt.yticks(**conf_graph["tick_style"])
plt.show()
Seems like the petal width and lenght are highly correlated which we could have anticipated, but the sepal length and width are not.
Also we can note that PC_1 will be strongly correlated with the sepal length, petal width and petal length, and PC_2 will be mostly correlated to sepal width and slightly to sepal length.
Now that we know that most of the variance is explained by the first two components, we can plot the data in the first factorial plane which is simply the plane made up of the first two components.
plt.figure(figsize=(16, 8))
plt.gcf().subplots_adjust(bottom=0.15, left=0.3)
sns.scatterplot(x="PC_1", y="PC_2", hue="target_name", data=df, palette=palette[:3])
plt.title("Projections of the flowers on the first two components.", **conf_graph["title_style"])
plt.xlabel("PC_1", **conf_graph["label_style"])
plt.ylabel("PC_2", **conf_graph["label_style"])
plt.xticks(**conf_graph["tick_style"])
plt.yticks(**conf_graph["tick_style"])
plt.show()
We can observe that at least one species of iris (the Setosa) can be perfectly separated by the information in the first two components. The other two species are a little closer but the representation is not bad.
If we had more features, we could also draw the subsequent factorial planes if we wanted to.
Using what we know from our correlation circle, we can infer that the Setosa variety has slightly wider sepals but smaller petals. Virginica appear to have to bigger petals. We can verify that at a high level.
df[l_features] = scaler.inverse_transform(df[l_features])
df.groupby('target_name')[l_features].mean()
We can easily verify our hypothesis. Petals of the Setosa variety are indeed smaller, its septals are wider, and petals of the Virginica variety are bigger.