Country Data Analysis Project

This project analyzes data related to countries, using Python libraries like pandas and scikit-learn. The Jupyter Notebook included here explores the data and performs Principal Component Analysis (PCA) to identify key factors affecting the data.

Read and Load data

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns



# To avoid warning messages

import warnings

warnings.filterwarnings('ignore')

%matplotlib inline

df = pd.read_csv("Country-data.csv", index_col=0)

df.head()

Display the correlation matrix (heatmap) of the different variables of df

correlation = df.corr()

plt.figure(figsize=(10, 10))  # Figure Size

#From seaborn library sns creates a heatmap visualization of the correlation matrix calculated earlier. 
sns.heatmap(correlation, annot=True, cmap='viridis')

The Correlation Matrix:

The heatmap created (df.corr()) helps visualize the relationships between different variables in df. Strong positive correlations, like the one between "child_death" and "total_iron," suggest these variables tend to move together (higher child death rates might be linked to lower iron levels).

PCA identifies the most important directions (axes) of variation in the data, allowing us to focus on the patterns that capture the most information with potentially fewer variables

#Store the column 'life_expec' in a new object life and to delete it from df.

life = df.life_expec

df = df.drop('life_expec', axis=1)

#class to normalize data
from sklearn.preprocessing import StandardScaler

# Creation of instance StandardScaler
scaler = StandardScaler() 

# This part uses the scaler object to normalize the data.
Z = scaler.fit_transform(df)

Import the PCA function from the sklearn.decomposition module.

Create pca, an instance of PCA.

Apply pca to the data and create Coord containing the PCA coordinates using the `fit_transform()` method.

# Use to perform PCA, a technique for dimensionality reduction.
from sklearn.decomposition import PCA

pca = PCA()  # Creation of instance PCA

Coord = pca.fit_transform(Z)  # Calculation of the coordinates of the PCA

The interest of PCA lies in this independence, since the analysis brings out very different types of information and spatial organization for each axis. In addition, the factors are hierarchical and take decreasing shares of the variance, the first axes generally concentrate most of the information, which makes the analysis even easier. We will now look at the share of variance explained for each component:

Display the explained variance for each component using the `explained_variance_` attribute of PCA.

Draw the graph representing the explained variance as a function of the number of components.

print('The eigenvalues are :', pca.explained_variance_)

plt.plot(np.arange(1, 9), pca.explained_variance_)

plt.xlabel('Number of factors')

plt.ylabel('Eigenvalues');

The eigenvalues are : [3.48753851 1.47902877 1.15061758 0.93557048 0.65529084 0.15140052 0.11588049 0.07286556]

#Display the ratio of the explained variance thanks to the attribute explained_variance_ratio for each of the components.
print('Ratio :',pca.explained_variance_ratio_)

#Plot the cumulative sum graph representing the ratio of explained variance versus the number of components.
plt.plot(np.arange(1,9),np.cumsum(pca.explained_variance_ratio_))

plt.xlabel('Factor number')

plt.ylabel('Cumsum');

We observe here that for 2 axes, the explained variance is about 62%. We can now draw the correlation circle. It allows us to evaluate the influence of each variable for each axis of representation.

sqrt_eigval = np.sqrt(pca.explained_variance_)

corvar = np.zeros((8, 8))

for k in range(8):

    corvar[:, k] = pca.components_[k, :] * sqrt_eigval[k]

# corvar

fig, axes = plt.subplots(figsize=(9, 9))

axes.set_xlim(-1, 1)

axes.set_ylim(-1, 1)

# display of labels (variable names)

for j in range(8):

    plt.annotate(df.columns[j], (corvar[j, 0], corvar[j, 1]), color='#091158')

    plt.arrow(0, 0, corvar[j, 0]*0.9, corvar[j, 1]*0.9,

              alpha=0.5, head_width=0.03, color='b')



# add axes

plt.plot([-1, 1], [0, 0], color='silver', linestyle='-', linewidth=1)

plt.plot([0, 0], [-1, 1], color='silver', linestyle='-', linewidth=1)


cercle = plt.Circle((0, 0), 1, color='#16E4CA', fill=False)

axes.add_artist(cercle)

plt.xlabel('AXIS 1')

plt.ylabel('AXIS 2')

plt.show()

We notice that variables such as 'income', 'gdpp' and 'health' are positively correlated with the first axis, whereas 'child_mort' and 'total_fer' are also positively correlated but negatively. We can then look at the representation of the countries in the two axes chosen by the PCA and observe the influence of the variable 'life_expec' on their representations.

#Transform the Life variable into a 3-class categorical variable named lifex, using the function qcut.
q = [0, 0.33, 0.66, 1]



lifex = pd.qcut(life, q)



lifex.value_counts()

Represent each country on the two axes chosen by the PCA by assigning a color according to the different lifex classes.

#positioning of individuals in the foreground

fig, axes = plt.subplots(figsize=(12,12))

axes.set_xlim(-3,3) #same limits on the x-axis

axes.set_ylim(-3,3) #and on the y-axis

#placement of observation labels

for i in range(127):

    if life[i] in lifex.cat.categories[0]:

        plt.annotate(df.index[i],(Coord[i,0],Coord[i,1]), color='#7FCFF1')

    elif life[i] in lifex.cat.categories[1]:

        plt.annotate(df.index[i],(Coord[i,0],Coord[i,1]), color='#16E4CA')

    else:

        plt.annotate(df.index[i],(Coord[i,0],Coord[i,1]), color='#091158')

            

#add axes

plt.plot([-6,6],[0,0],color='silver',linestyle='-',linewidth=1)

plt.plot([0,0],[-6,6],color='silver',linestyle='-',linewidth=1)

plt.xlabel('AXE 1')

plt.ylabel('AXE 2')

#display

plt.show()

The visualization reveals three distinct groups of countries. This suggests that life expectancy plays a role in how countries are positioned on the PCA axes. Countries with higher life expectancy tend to be located in the lower right corner of the graph. This reinforces the idea that the chosen axes effectively represent the relationships between the countries.

Instructions:

Clone this repository to your local machine using Git.
Open the country_analysis.ipynb file in a Jupyter Notebook environment.
Make sure you have the required libraries installed (pandas, scikit-learn, etc.).
Run the notebook cells to reproduce the analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
correlation circle.png		correlation circle.png
division.png		division.png
.gitattributes		.gitattributes
AxesSubplot.png		AxesSubplot.png
Country-data.csv		Country-data.csv
PCA-Life expenc.ipynb		PCA-Life expenc.ipynb
PCA.py		PCA.py
README.md		README.md
archive.zip		archive.zip
data-dictionary.csv		data-dictionary.csv
explained_variance.png		explained_variance.png
explained_variance_ratio.png		explained_variance_ratio.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Country Data Analysis Project

Read and Load data

Display the correlation matrix (heatmap) of the different variables of df

PCA identifies the most important directions (axes) of variation in the data, allowing us to focus on the patterns that capture the most information with potentially fewer variables

We observe here that for 2 axes, the explained variance is about 62%. We can now draw the correlation circle. It allows us to evaluate the influence of each variable for each axis of representation.

Represent each country on the two axes chosen by the PCA by assigning a color according to the different lifex classes.

About

Releases

Packages

Languages

dondon199712/country_data_analysis

Folders and files

Latest commit

History

Repository files navigation

Country Data Analysis Project

Read and Load data

Display the correlation matrix (heatmap) of the different variables of df

PCA identifies the most important directions (axes) of variation in the data, allowing us to focus on the patterns that capture the most information with potentially fewer variables

We observe here that for 2 axes, the explained variance is about 62%. We can now draw the correlation circle. It allows us to evaluate the influence of each variable for each axis of representation.

Represent each country on the two axes chosen by the PCA by assigning a color according to the different lifex classes.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages