The data.


In order to simulate the income distribution of a population we use the generalized Pareto distribution (GPD).

It is a good model because of the density is centered around the mean but the probability of being far from the mean is not negigeable. It is important to represent really high incomes.

ToC

In [1]:
from tensorflow_probability import distributions as tfd

import pandas as pd
import numpy as np

from matplotlib.colors import LinearSegmentedColormap
from matplotlib.colors import hex2color
from matplotlib import pyplot as plt
import seaborn as sns

from conf import conf_graph, palette
In [2]:
dist = tfd.GeneralizedPareto(loc=800, scale=1100, concentration=0.2)

s_pareto = pd.Series(dist.sample(1500), name='Pareto')
s_pareto = s_pareto[s_pareto != np.inf]
s_pareto = s_pareto.drop_duplicates()
2023-03-28 14:55:22.479783: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-03-28 14:55:22.481126: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-28 14:55:22.507431: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-28 14:55:22.507906: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-28 14:55:22.970474: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
In [3]:
plt.figure(figsize=(14, 10))

sns.kdeplot(
    s_pareto[s_pareto < 10_000], color=palette[0],
    fill=True, linewidth=0, alpha=0.5
)

plt.title("Income distribution.", **conf_graph["title_style"])
plt.xlabel("Population", **conf_graph["label_style"])
plt.ylabel("Income", **conf_graph["label_style"])
plt.show()

A few statistics to better understand the data.

In [4]:
prop_incom_gt_3k = s_pareto[s_pareto >= 3_000].size / s_pareto.size
print(f"{prop_incom_gt_3k:.02%} of the population has an income above 3 000€ a month.")
print(f"The median income is {s_pareto.median():,.02f}€ (avg {s_pareto.mean():,.02f}€)")
19.33% of the population has an income above 3 000€ a month.
The median income is 1,618.52€ (avg 2,215.95€)

For reference, the median in France is 1,800€ and 17% of the population have an income abose 3,000€.

The model is not perfect but is good enough.

What are the Lorenz curve and the Gini coefficient?


Those two tools are used to measure the inequalities in the wealth distribution of a given population.

Let's take an example.

ToC

This example is the distribution of income for a population. It is easy to get a sense of the median and average but the extreme values, the really high income are not visible on this graph.

A Lorenz curve gives a graphical representation of this distribution. It plots the cumulative percentage of income or wealth on the y-axis against the cumulative percentage of households or individuals on the x-axis, ranked from poorest to richest.

Lorenz curve.


The green curve is the Lorenz curve.

The blue line is the case of perfect equality. The further the green curve is to the blue line, the greater the inequality.


It gives important information if read right.

Starting from 50% of the population, we intercept the Lorenz curve, and read the coordinate on the y axis at approximatively 20%.

We can say that the 50% poorest people share 20% of the wealth.

Gini coefficient.


The Gini Coefficient is a great way to reduce the Lorenz Curve to a single number that can be compared between distributions.

It is calculated as the area between the Lorenz curve and the line of perfect equality (the diagonal line from the origin to the top right corner of the graph) divided by the total area under the line of perfect equality.

The Gini coefficient ranges from 0 to 1, with 0 indicating perfect equality (i.e., all individuals receive an equal share of income) and 1 indicating perfect inequality (i.e., one individual receives all the income while the rest receive none).

How to use these tools?


ToC

Lorenz curve.


In order to compute the Lorenz curve, we just need to sort the values then get the cumulative sum.

In [5]:
# Calculating the Lorenz curve.
s_lorenz = s_pareto.sort_values().cumsum() / s_pareto.sum()
s_lorenz = pd.concat([pd.Series([0]), s_lorenz])

# Calculating the perfect equality line.
s_xaxis = pd.Series([1 / s_lorenz.size] * (s_lorenz.size - 1)).cumsum() - (
    1 / s_lorenz.size
)
s_xaxis = pd.concat([s_xaxis, pd.Series(1)])
In [6]:
plt.figure(figsize=(14, 10))

sns.lineplot(x=s_xaxis.values, y=s_xaxis.values, color=palette[0], errorbar=None)
sns.lineplot(x=s_xaxis.values, y=s_lorenz.values, color=palette[2], errorbar=None)

plt.title("Lorenz Curve.", **conf_graph["title_style"])
plt.xlabel("Cumulative population", **conf_graph["label_style"])
plt.ylabel("Cumulative income", **conf_graph["label_style"])
plt.legend(["Perfect equality line", "Actual income distribution"])
plt.show()

Gini coefficient.


To compute the Gini coefficient, we first need to compute the Area Under the Lorenz curve (AUL) and the Area Under the Bissector (AUB). Then AUB - AUL gives us the area betweend the bissector and the Lorenz curve, finally we can compute the ratio between the two.

The AUL is simply the sum of each value divided by its width and the AUB is always 1 / 2, it is half a square of length 1.

In [7]:
AUL = (s_lorenz / s_lorenz.size).sum()
AUB = 1 / 2
gini = ((AUB - AUL) / AUB) * 100

print(f"The Gini coefficient for this distribution is {gini:.2f}%.")
The Gini coefficient for this distribution is 35.93%.

For reference, the Gini coefficient for France was 32.8% in 2008.