jupytext | kernelspec | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Calculate Air Quality Indices (AQI) and perform paired Student's t-test on them.
-
You'll learn the concept of moving averages
-
You'll learn how to calculate Air Quality Index (AQI)
-
You'll learn how to perform a paired Student's t-test and find the
t
andp
values -
You'll learn how to interpret these values
-
SciPy installed in your environment
-
Basic understanding of statistical terms like population, sample, mean, standard deviation etc.
+++
Air pollution is one of the most prominent types of pollution we face that has an immediate effect on our daily lives. The COVID-19 pandemic resulted in lockdowns in different parts of the world; offering a rare opportunity to study the effect of human activity (or lack thereof) on air pollution. In this tutorial, we will study the air quality in Delhi, one of the worst affected cities by air pollution, before and during the lockdown from March to June 2020. For this, we will first compute the Air Quality Index for each hour from the collected pollutant measurements. Next, we will sample these indices and perform a paired Student's t-test on them. It will statistically show us that the air quality improved due to the lockdown, supporting our intuition.
Let's start by importing the necessary libraries into our environment.
import numpy as np
from numpy.random import default_rng
from scipy import stats
We will use a condensed version of the Air Quality Data in India dataset. This dataset contains air quality data and AQI (Air Quality Index) at hourly and daily level of various stations across multiple cities in India. The condensed version available with this tutorial contains hourly pollutant measurements for Delhi from May 31, 2019 to June 30, 2020. It has measurements of the standard pollutants that are required for Air Quality Index calculation and a few other important ones: Particulate Matter (PM 2.5 and PM 10), nitrogen dioxide (NO2), ammonia (NH3), sulfur dioxide (SO2), carbon monoxide (CO), ozone (O3), oxides of nitrogen (NOx), nitric oxide (NO), benzene, toluene, and xylene.
Let's print out the first few rows to have a glimpse of our dataset.
! head air-quality-data.csv
For the purpose of this tutorial, we are only concerned with standard pollutants required for calculating the AQI, viz., PM 2.5, PM 10, NO2, NH3, SO2, CO, and O3. So, we will only import these particular columns with np.loadtxt. We'll then slice and create two sets: pollutants_A
with PM 2.5, PM 10, NO2, NH3, and SO2, and pollutants_B
with CO and O3. The
two sets will be processed slightly differently, as we'll see later on.
pollutant_data = np.loadtxt("air-quality-data.csv", dtype=float, delimiter=",",
skiprows=1, usecols=range(1, 8))
pollutants_A = pollutant_data[:, 0:5]
pollutants_B = pollutant_data[:, 5:]
print(pollutants_A.shape)
print(pollutants_B.shape)
Our dataset might contain missing values, denoted by NaN
, so let's do a quick check with np.isfinite.
np.all(np.isfinite(pollutant_data))
With this, we have successfully imported the data and checked that it is complete. Let's move on to the AQI calculations!
+++
We will calculate the AQI using the method adopted by the Central Pollution Control Board of India. To summarize the steps:
-
Collect 24-hourly average concentration values for the standard pollutants; 8-hourly in case of CO and O3.
-
Calculate the sub-indices for these pollutants with the formula:
$$ Ip = \dfrac{\text{IHi – ILo}}{\text{BPHi – BPLo}}\cdot{\text{Cp – BPLo}} + \text{ILo} $$ Where,
Ip
= sub-index of pollutantp
Cp
= averaged concentration of pollutantp
BPHi
= concentration breakpoint i.e. greater than or equal toCp
BPLo
= concentration breakpoint i.e. less than or equal toCp
IHi
= AQI value corresponding toBPHi
ILo
= AQI value corresponding toBPLo
-
The maximum sub-index at any given time is the Air Quality Index.
The Air Quality Index is calculated with the help of breakpoint ranges as shown in the chart below.
Let's create two arrays to store the AQI ranges and breakpoints so that we can use them later for our calculations.
AQI = np.array([0, 51, 101, 201, 301, 401, 501])
breakpoints = {
'PM2.5': np.array([0, 31, 61, 91, 121, 251]),
'PM10': np.array([0, 51, 101, 251, 351, 431]),
'NO2': np.array([0, 41, 81, 181, 281, 401]),
'NH3': np.array([0, 201, 401, 801, 1201, 1801]),
'SO2': np.array([0, 41, 81, 381, 801, 1601]),
'CO': np.array([0, 1.1, 2.1, 10.1, 17.1, 35]),
'O3': np.array([0, 51, 101, 169, 209, 749])
}
For the first step, we have to compute moving averages for pollutants_A
over a window of 24 hours and pollutants_B
over a
window of 8 hours. We will write a simple function moving_mean
using np.cumsum and sliced indexing to achieve this.
To make sure both the sets are of the same length, we will truncate the pollutants_B_8hr_avg
according to the length of
pollutants_A_24hr_avg
. This will also ensure we have concentrations for all the pollutants over the same period of time.
def moving_mean(a, n):
ret = np.cumsum(a, dtype=float, axis=0)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
pollutants_A_24hr_avg = moving_mean(pollutants_A, 24)
pollutants_B_8hr_avg = moving_mean(pollutants_B, 8)[-(pollutants_A_24hr_avg.shape[0]):]
Now, we can join both sets with np.concatenate to form a single data set of all the averaged concentrations. Note that we have to join our arrays column-wise so we pass the
axis=1
parameter.
pollutants = np.concatenate((pollutants_A_24hr_avg, pollutants_B_8hr_avg), axis=1)
The subindices for each pollutant are calculated according to the linear relationship between the AQI and standard breakpoint ranges with the formula as above:
The compute_indices
function first fetches the correct upper and lower bounds of AQI categories and breakpoint concentrations for the input concentration and pollutant with the help of arrays AQI
and breakpoints
we created above. Then, it feeds these values into the formula to calculate the sub-index.
def compute_indices(pol, con):
bp = breakpoints[pol]
if pol == 'CO':
inc = 0.1
else:
inc = 1
if bp[0] <= con < bp[1]:
Bl = bp[0]
Bh = bp[1] - inc
Ih = AQI[1] - inc
Il = AQI[0]
elif bp[1] <= con < bp[2]:
Bl = bp[1]
Bh = bp[2] - inc
Ih = AQI[2] - inc
Il = AQI[1]
elif bp[2] <= con < bp[3]:
Bl = bp[2]
Bh = bp[3] - inc
Ih = AQI[3] - inc
Il = AQI[2]
elif bp[3] <= con < bp[4]:
Bl = bp[3]
Bh = bp[4] - inc
Ih = AQI[4] - inc
Il = AQI[3]
elif bp[4] <= con < bp[5]:
Bl = bp[4]
Bh = bp[5] - inc
Ih = AQI[5] - inc
Il = AQI[4]
elif bp[5] <= con:
Bl = bp[5]
Bh = bp[5] + bp[4] - (2 * inc)
Ih = AQI[6]
Il = AQI[5]
else:
print("Concentration out of range!")
return ((Ih - Il) / (Bh - Bl)) * (con - Bl) + Il
We will use np.vectorize to utilize the concept of vectorization. This simply means we don't have loop over each element of the pollutant array ourselves. Vectorization is one of the key advantages of NumPy.
vcompute_indices = np.vectorize(compute_indices)
By calling our vectorized function vcompute_indices
for each pollutant, we get the sub-indices. To get back an array with the original shape, we use np.stack.
sub_indices = np.stack((vcompute_indices('PM2.5', pollutants[..., 0]),
vcompute_indices('PM10', pollutants[..., 1]),
vcompute_indices('NO2', pollutants[..., 2]),
vcompute_indices('NH3', pollutants[..., 3]),
vcompute_indices('SO2', pollutants[..., 4]),
vcompute_indices('CO', pollutants[..., 5]),
vcompute_indices('O3', pollutants[..., 6])), axis=1)
Using np.max, we find out the maximum sub-index for each period, which is our Air Quality Index!
aqi_array = np.max(sub_indices, axis=1)
With this, we have the AQI for every hour from June 1, 2019 to June 30, 2020. Note that even though we started out with the data from 31st May, we truncated that during the moving averages step.
+++
Hypothesis testing is a form of descriptive statistics used to help us make decisions with the data. From the calculated AQI data, we want to find out if there was a statistically significant difference in average AQI before and after the lockdown was imposed. We will use the left-tailed, paired Student's t-test to compute two test statistics- the t statistic
and the p value
. We will then compare these with the corresponding critical values to make a decision.
We will now import the datetime
column from our original dataset into a datetime64 dtype array. We will use this array to index the AQI array and obtain subsets of the dataset.
datetime = np.loadtxt("air-quality-data.csv", dtype='M8[h]', delimiter=",",
skiprows=1, usecols=(0, ))[-(pollutants_A_24hr_avg.shape[0]):]
Since total lockdown commenced in Delhi from March 24, 2020, the after-lockdown subset is of the period March 24, 2020 to June 30, 2020. The before-lockdown subset is for the same length of time before 24th March.
after_lock = aqi_array[np.where(datetime >= np.datetime64('2020-03-24T00'))]
before_lock = aqi_array[np.where(datetime <= np.datetime64('2020-03-21T00'))][-(after_lock.shape[0]):]
print(after_lock.shape)
print(before_lock.shape)
To make sure our samples are approximately normally distributed, we take samples of size n = 30
. before_sample
and after_sample
are the set of random observations drawn before and after the total lockdown. We use random.Generator.choice to generate the samples.
rng = default_rng()
before_sample = rng.choice(before_lock, size=30, replace=False)
after_sample = rng.choice(after_lock, size=30, replace=False)
Let us assume that there is no significant difference between the sample means before and after the lockdown. This will be the null hypothesis. The alternative hypothesis would be that there is a significant difference between the means and the AQI improved. Mathematically,
+++
We will use the t
statistic to evaluate our hypothesis and even calculate the p value
from it. The formula for the t
statistic is:
where,
def t_test(x, y):
diff = y - x
var = np.var(diff, ddof=1)
num = np.mean(diff)
denom = np.sqrt(var / len(x))
return np.divide(num, denom)
t_value = t_test(before_sample, after_sample)
For the p
value, we will use SciPy's stats.distributions.t.cdf()
function. It takes two arguments- the t statistic
and the degrees of freedom (dof
). The formula for dof
is n - 1
.
dof = len(before_sample) - 1
p_value = stats.distributions.t.cdf(t_value, dof)
print("The t value is {} and the p value is {}.".format(t_value, p_value))
We will now compare the calculated test statistics with the critical test statistics. The critical t
value is calculated by looking up the t-distribution table.
From the table above, the critical value is 1.699 for 29 dof
at a confidence level of 95%. Since we are using the left tailed test, our critical value is -1.699. Clearly, the calculated t
value is less than the critical value so we can safely reject the null hypothesis.
The critical p
value, denoted by p
value is less than p
value is much less than
Note that this does not mean we can accept the alternative hypothesis. It only tells us that there is not enough evidence to reject
+++
-
The pandas library is preferable to use for time-series data analysis.
-
The SciPy stats module provides the stats.ttest_rel function which can be used to get the
t statistic
andp value
. -
In real life, data are generally not normally distributed. There are tests for such non-normal data like the Wilcoxon test.
-
There are a host of statistical tests you can choose according to the characteristics of the given data. Read more about them at A Gentle Introduction to Statistical Data Distributions.
-
There are various versions of the Student's t-test that you can adopt according to your needs.