Skip to content

Latest commit

 

History

History
371 lines (248 loc) · 16 KB

tutorial-air-quality-analysis.md

File metadata and controls

371 lines (248 loc) · 16 KB
jupytext kernelspec
formats text_representation
ipynb,md:myst
extension format_name format_version jupytext_version
.md
myst
0.13
1.11.5
display_name language name
Python 3 (ipykernel)
python
python3

Analyzing the impact of the lockdown on air quality in Delhi, India

A grid showing the India Gate in smog above and clear air below

What you'll do

Calculate Air Quality Indices (AQI) and perform paired Student's t-test on them.

What you'll learn

  • You'll learn the concept of moving averages

  • You'll learn how to calculate Air Quality Index (AQI)

  • You'll learn how to perform a paired Student's t-test and find the t and p values

  • You'll learn how to interpret these values

What you'll need

  • SciPy installed in your environment

  • Basic understanding of statistical terms like population, sample, mean, standard deviation etc.


+++

The problem of air pollution

Air pollution is one of the most prominent types of pollution we face that has an immediate effect on our daily lives. The COVID-19 pandemic resulted in lockdowns in different parts of the world; offering a rare opportunity to study the effect of human activity (or lack thereof) on air pollution. In this tutorial, we will study the air quality in Delhi, one of the worst affected cities by air pollution, before and during the lockdown from March to June 2020. For this, we will first compute the Air Quality Index for each hour from the collected pollutant measurements. Next, we will sample these indices and perform a paired Student's t-test on them. It will statistically show us that the air quality improved due to the lockdown, supporting our intuition.

Let's start by importing the necessary libraries into our environment.

import numpy as np
from numpy.random import default_rng
from scipy import stats

Building the dataset

We will use a condensed version of the Air Quality Data in India dataset. This dataset contains air quality data and AQI (Air Quality Index) at hourly and daily level of various stations across multiple cities in India. The condensed version available with this tutorial contains hourly pollutant measurements for Delhi from May 31, 2019 to June 30, 2020. It has measurements of the standard pollutants that are required for Air Quality Index calculation and a few other important ones: Particulate Matter (PM 2.5 and PM 10), nitrogen dioxide (NO2), ammonia (NH3), sulfur dioxide (SO2), carbon monoxide (CO), ozone (O3), oxides of nitrogen (NOx), nitric oxide (NO), benzene, toluene, and xylene.

Let's print out the first few rows to have a glimpse of our dataset.

! head air-quality-data.csv

For the purpose of this tutorial, we are only concerned with standard pollutants required for calculating the AQI, viz., PM 2.5, PM 10, NO2, NH3, SO2, CO, and O3. So, we will only import these particular columns with np.loadtxt. We'll then slice and create two sets: pollutants_A with PM 2.5, PM 10, NO2, NH3, and SO2, and pollutants_B with CO and O3. The two sets will be processed slightly differently, as we'll see later on.

pollutant_data = np.loadtxt("air-quality-data.csv", dtype=float, delimiter=",",
                            skiprows=1, usecols=range(1, 8))
pollutants_A = pollutant_data[:, 0:5]
pollutants_B = pollutant_data[:, 5:]

print(pollutants_A.shape)
print(pollutants_B.shape)

Our dataset might contain missing values, denoted by NaN, so let's do a quick check with np.isfinite.

np.all(np.isfinite(pollutant_data))

With this, we have successfully imported the data and checked that it is complete. Let's move on to the AQI calculations!

+++

Calculating the Air Quality Index

We will calculate the AQI using the method adopted by the Central Pollution Control Board of India. To summarize the steps:

  • Collect 24-hourly average concentration values for the standard pollutants; 8-hourly in case of CO and O3.

  • Calculate the sub-indices for these pollutants with the formula:

    $$ Ip = \dfrac{\text{IHi – ILo}}{\text{BPHi – BPLo}}\cdot{\text{Cp – BPLo}} + \text{ILo} $$

    Where,

    Ip = sub-index of pollutant p
    Cp = averaged concentration of pollutant p
    BPHi = concentration breakpoint i.e. greater than or equal to Cp
    BPLo = concentration breakpoint i.e. less than or equal to Cp
    IHi = AQI value corresponding to BPHi
    ILo = AQI value corresponding to BPLo

  • The maximum sub-index at any given time is the Air Quality Index.

The Air Quality Index is calculated with the help of breakpoint ranges as shown in the chart below.

Chart of the breakpoint ranges

Let's create two arrays to store the AQI ranges and breakpoints so that we can use them later for our calculations.

AQI = np.array([0, 51, 101, 201, 301, 401, 501])

breakpoints = {
    'PM2.5': np.array([0, 31, 61, 91, 121, 251]),
    'PM10': np.array([0, 51, 101, 251, 351, 431]),
    'NO2': np.array([0, 41, 81, 181, 281, 401]),
    'NH3': np.array([0, 201, 401, 801, 1201, 1801]),
    'SO2': np.array([0, 41, 81, 381, 801, 1601]),
    'CO': np.array([0, 1.1, 2.1, 10.1, 17.1, 35]),
    'O3': np.array([0, 51, 101, 169, 209, 749])
}

Moving averages

For the first step, we have to compute moving averages for pollutants_A over a window of 24 hours and pollutants_B over a window of 8 hours. We will write a simple function moving_mean using np.cumsum and sliced indexing to achieve this.

To make sure both the sets are of the same length, we will truncate the pollutants_B_8hr_avg according to the length of pollutants_A_24hr_avg. This will also ensure we have concentrations for all the pollutants over the same period of time.

def moving_mean(a, n):
    ret = np.cumsum(a, dtype=float, axis=0)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

pollutants_A_24hr_avg = moving_mean(pollutants_A, 24)
pollutants_B_8hr_avg = moving_mean(pollutants_B, 8)[-(pollutants_A_24hr_avg.shape[0]):]

Now, we can join both sets with np.concatenate to form a single data set of all the averaged concentrations. Note that we have to join our arrays column-wise so we pass the axis=1 parameter.

pollutants = np.concatenate((pollutants_A_24hr_avg, pollutants_B_8hr_avg), axis=1)

Sub-indices

The subindices for each pollutant are calculated according to the linear relationship between the AQI and standard breakpoint ranges with the formula as above:

$$ Ip = \dfrac{\text{IHi – ILo}}{\text{BPHi – BPLo}}\cdot{\text{Cp – BPLo}} + \text{ILo} $$

The compute_indices function first fetches the correct upper and lower bounds of AQI categories and breakpoint concentrations for the input concentration and pollutant with the help of arrays AQI and breakpoints we created above. Then, it feeds these values into the formula to calculate the sub-index.

def compute_indices(pol, con):
    bp = breakpoints[pol]
    
    if pol == 'CO':
        inc = 0.1
    else:
        inc = 1
    
    if bp[0] <= con < bp[1]:
        Bl = bp[0]
        Bh = bp[1] - inc
        Ih = AQI[1] - inc
        Il = AQI[0]

    elif bp[1] <= con < bp[2]:
        Bl = bp[1]
        Bh = bp[2] - inc
        Ih = AQI[2] - inc
        Il = AQI[1]

    elif bp[2] <= con < bp[3]:
        Bl = bp[2]
        Bh = bp[3] - inc
        Ih = AQI[3] - inc
        Il = AQI[2]

    elif bp[3] <= con < bp[4]:
        Bl = bp[3]
        Bh = bp[4] - inc
        Ih = AQI[4] - inc
        Il = AQI[3]

    elif bp[4] <= con < bp[5]:
        Bl = bp[4]
        Bh = bp[5] - inc
        Ih = AQI[5] - inc
        Il = AQI[4]

    elif bp[5] <= con:
        Bl = bp[5]
        Bh = bp[5] + bp[4] - (2 * inc)
        Ih = AQI[6]
        Il = AQI[5]

    else:
        print("Concentration out of range!")
        
    return ((Ih - Il) / (Bh - Bl)) * (con - Bl) + Il

We will use np.vectorize to utilize the concept of vectorization. This simply means we don't have loop over each element of the pollutant array ourselves. Vectorization is one of the key advantages of NumPy.

vcompute_indices = np.vectorize(compute_indices)

By calling our vectorized function vcompute_indices for each pollutant, we get the sub-indices. To get back an array with the original shape, we use np.stack.

sub_indices = np.stack((vcompute_indices('PM2.5', pollutants[..., 0]),
                        vcompute_indices('PM10', pollutants[..., 1]),
                        vcompute_indices('NO2', pollutants[..., 2]),
                        vcompute_indices('NH3', pollutants[..., 3]),
                        vcompute_indices('SO2', pollutants[..., 4]),
                        vcompute_indices('CO', pollutants[..., 5]),
                        vcompute_indices('O3', pollutants[..., 6])), axis=1)

Air quality indices

Using np.max, we find out the maximum sub-index for each period, which is our Air Quality Index!

aqi_array = np.max(sub_indices, axis=1)

With this, we have the AQI for every hour from June 1, 2019 to June 30, 2020. Note that even though we started out with the data from 31st May, we truncated that during the moving averages step.

+++

Paired Student's t-test on the AQIs

Hypothesis testing is a form of descriptive statistics used to help us make decisions with the data. From the calculated AQI data, we want to find out if there was a statistically significant difference in average AQI before and after the lockdown was imposed. We will use the left-tailed, paired Student's t-test to compute two test statistics- the t statistic and the p value. We will then compare these with the corresponding critical values to make a decision.

Normal distribution plot showing area of rejection in one-tailed test (left tailed)

Sampling

We will now import the datetime column from our original dataset into a datetime64 dtype array. We will use this array to index the AQI array and obtain subsets of the dataset.

datetime = np.loadtxt("air-quality-data.csv", dtype='M8[h]', delimiter=",",
                         skiprows=1, usecols=(0, ))[-(pollutants_A_24hr_avg.shape[0]):]

Since total lockdown commenced in Delhi from March 24, 2020, the after-lockdown subset is of the period March 24, 2020 to June 30, 2020. The before-lockdown subset is for the same length of time before 24th March.

after_lock = aqi_array[np.where(datetime >= np.datetime64('2020-03-24T00'))]

before_lock = aqi_array[np.where(datetime <= np.datetime64('2020-03-21T00'))][-(after_lock.shape[0]):]

print(after_lock.shape)
print(before_lock.shape)

To make sure our samples are approximately normally distributed, we take samples of size n = 30. before_sample and after_sample are the set of random observations drawn before and after the total lockdown. We use random.Generator.choice to generate the samples.

rng = default_rng()

before_sample = rng.choice(before_lock, size=30, replace=False)
after_sample = rng.choice(after_lock, size=30, replace=False)

Defining the hypothesis

Let us assume that there is no significant difference between the sample means before and after the lockdown. This will be the null hypothesis. The alternative hypothesis would be that there is a significant difference between the means and the AQI improved. Mathematically,

$H_{0}: \mu_\text{after-before} = 0$
$H_{a}: \mu_\text{after-before} &lt; 0$

+++

Calculating the test statistics

We will use the t statistic to evaluate our hypothesis and even calculate the p value from it. The formula for the t statistic is:

$$ t = \frac{\mu_\text{after-before}}{\sqrt{\sigma^{2}/n}} $$

where,

$\mu_\text{after-before}$ = mean differences of samples
$\sigma^{2}$ = variance of mean differences
$n$ = sample size

def t_test(x, y):
    diff = y - x
    var = np.var(diff, ddof=1)
    num = np.mean(diff)
    denom = np.sqrt(var / len(x))
    return np.divide(num, denom)

t_value = t_test(before_sample, after_sample)

For the p value, we will use SciPy's stats.distributions.t.cdf() function. It takes two arguments- the t statistic and the degrees of freedom (dof). The formula for dof is n - 1.

dof = len(before_sample) - 1

p_value = stats.distributions.t.cdf(t_value, dof)

print("The t value is {} and the p value is {}.".format(t_value, p_value))

What do the t and p values mean?

We will now compare the calculated test statistics with the critical test statistics. The critical t value is calculated by looking up the t-distribution table.

Table of selected t values at different confidence levels. T value for 29 dof at 95% confidence level is highlighted with a yellow square

From the table above, the critical value is 1.699 for 29 dof at a confidence level of 95%. Since we are using the left tailed test, our critical value is -1.699. Clearly, the calculated t value is less than the critical value so we can safely reject the null hypothesis.

The critical p value, denoted by $\alpha$, is usually chosen to be 0.05, corresponding to a confidence level of 95%. If the calculated p value is less than $\alpha$, then the null hypothesis can be safely rejected. Clearly, our p value is much less than $\alpha$, so we can reject the null hypothesis.

Note that this does not mean we can accept the alternative hypothesis. It only tells us that there is not enough evidence to reject $H_{a}$. In other words, we fail to reject the alternative hypothesis so, it may be true.

+++


In practice...

  • The pandas library is preferable to use for time-series data analysis.

  • The SciPy stats module provides the stats.ttest_rel function which can be used to get the t statistic and p value.

  • In real life, data are generally not normally distributed. There are tests for such non-normal data like the Wilcoxon test.

Further reading