Posted on Mar 29, 2023 • Edited on Nov 1, 2024

Quick tip: Analysing Air Quality Data with SingleStoreDB

Abstract

Air quality has become a significant concern in many cities globally due to increased pollution levels. In this short article, we'll see how to calculate the Air Quality Index (AQI), store air quality data in SingleStoreDB and quickly visualise the AQI.

The notebook file used in this article is available from GitHub.

Introduction

In this article, we'll obtain air quality data from the Department for Environment, Food and Rural Affairs (Defra) in the UK and store it in SingleStoreDB. SingleStoreDB integrates with many external tools and libraries.

Create a SingleStoreDB Cloud account

A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use the Free Shared Tier and take the default names for the Workspace and Database.

Obtain Air Quality Data

We'll acknowledge Defra and uk-air.defra.gov.uk as the source of information as well as the licence:

© Crown 2023 copyright Defra via uk-air.defra.gov.uk, licenced under the Open Government Licence (OGL).

We'll use the following code to obtain the metadata:

metadata_url = "http://uk-air.defra.gov.uk/openair/R_data/AURN_metadata.RData" with urlopen(metadata_url) as metadataset: metadata = rdata.read_rda(metadataset) metadata_df = metadata["AURN_metadata"] metadata_df.info()

Example output:

<class 'pandas.core.frame.DataFrame'> RangeIndex: 2836 entries, 1 to 2836 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 site_id 2836 non-null string 1 site_name 2836 non-null string 2 location_type 2836 non-null string 3 latitude 2836 non-null float64 4 longitude 2836 non-null float64 5 parameter 2836 non-null string 6 Parameter_name 2836 non-null string 7 start_date 2836 non-null string 8 end_date 2836 non-null string 9 ratified_to 2836 non-null string 10 zone 2836 non-null string 11 agglomeration 1656 non-null object 12 local_authority 2815 non-null object dtypes: float64(2), object(2), string(9) memory usage: 288.2+ KB

Next, let's search for "Kensington" in local_authority and retrieve the site_id values:

# Filter the DataFrame for rows where 'Kensington' is included in 'local_authority' (case-insensitive) kensington_sites = metadata_df[ metadata_df["local_authority"].str.contains("Kensington", case = False, na = False) ] kensington_site_ids = kensington_sites["site_id"].unique() kensington_site_ids

Example output:

<StringArray> ['CRD', 'CRD2', 'KC1', 'WL'] Length: 4, dtype: string

We'll use the value KC1 and the year 2022 and download the data and store it in a Dataframe:

site_id = kensington_site_ids[2] year = 2022 dataset_url = f"https://uk-air.defra.gov.uk/openair/R_data/{site_id}_{year}.RData" with urlopen(dataset_url) as dataset: data = rdata.read_rda(dataset) aqi_df = data[f"{site_id}_{year}"]

We'll ensure that the date is in the correct format:

aqi_df["date"] = pd.to_datetime(aqi_df["date"], unit = "s")

Let's take a quick look at the data:

aqi_df.head()

Example output:

 date O3 NO NO2 NOXasNO2 SO2 CO PM10 PM2.5 wd ws temp site code 1 2022-01-01 00:00:00 39.11572 0.24946 9.94500 10.51875 0.26609 0.209556 14.9 12.076 212.0 3.0 10.4 London N. Kensington KC1 2 2022-01-01 01:00:00 38.71658 0.12473 7.84125 8.22375 -0.53218 0.197914 16.3 12.642 214.8 3.7 10.5 London N. Kensington KC1 3 2022-01-01 02:00:00 41.51056 0.12473 6.88500 7.07625 0.26609 0.221198 15.4 11.604 212.9 3.6 10.2 London N. Kensington KC1 4 2022-01-01 03:00:00 45.70153 0.12473 4.97250 5.16375 0.26609 0.186272 13.6 10.189 208.8 3.9 10.6 London N. Kensington KC1 5 2022-01-01 04:00:00 48.49551 0.12473 4.59000 4.59000 0.26609 0.186272 10.9 8.302 206.7 4.0 11.3 London N. Kensington KC1

We can see the range of values:

aqi_df.describe()

Example output:

 date O3 NO NO2 NOXasNO2 SO2 CO PM10 PM2.5 wd ws temp count 8760 8617.000000 8593.000000 8593.000000 8593.000000 7572.000000 7737.000000 8622.000000 8622.000000 8664.000000 8664.000000 8664.000000 mean 2022-07-02 11:30:00 50.337057 4.756186 18.884452 26.177035 0.557447 0.179206 14.600139 8.870449 193.805794 3.286034 10.889578 min 2022-01-01 00:00:00 -0.399140 -0.124730 0.382500 0.382500 -1.330450 0.034926 0.700000 0.377000 0.100000 0.000000 -5.900000 25% 2022-04-02 05:45:00 33.527760 0.249460 7.650000 8.606250 0.266090 0.104778 8.200000 4.151000 107.475000 2.200000 6.500000 50% 2022-07-02 11:30:00 51.688630 0.748380 13.196250 14.535000 0.532180 0.139704 11.900000 6.226000 219.100000 2.900000 10.700000 75% 2022-10-01 17:15:00 66.855950 1.870950 24.480000 27.157500 0.532180 0.197914 17.400000 10.283000 265.000000 4.100000 15.300000 max 2022-12-31 23:00:00 166.441380 368.701880 108.438750 652.927500 10.377510 1.723016 110.400000 84.623000 360.000000 12.900000 31.900000 std NaN 26.469337 18.191091 16.328962 39.461979 0.677925 0.141622 10.499228 8.222276 98.460865 1.567511 6.410335

We can see data for the polluting gases:

Ozone (O3)
Nitrogen monoxide (NO)
Nitrogen dioxide (NO2)
Sulphur dioxide (SO2)
Carbon monoxide (CO)
Particulates (PM10 and PM2.5)

We'll check for missing data:

aqi_df.isna().sum()

Example output:

date 0 O3 143 NO 167 NO2 167 NOXasNO2 167 SO2 1188 CO 1023 PM10 138 PM2.5 138 wd 96 ws 96 temp 96 site 0 code 0

For our initial analysis, we'll remove the rows with missing data:

aqi_df.dropna(inplace = True)

Calculate Air Quality Index (AQI)

Next, we'll use the python-aqi library to calculate the AQI. This library uses algorithms from:

The United States Environmental Protection Agency (EPA)
The China Ministry of Environmental Protection (MEP)

We'll initially use the particulates (PM10 and PM2.5) to calculate the AQI using the EPA algorithm and create a new column in the Pandas Dataframe to store the result:

aqi_df["aqi"] = aqi_df.apply( lambda row: aqi.to_aqi([ (aqi.POLLUTANT_PM25, row["PM2.5"]), (aqi.POLLUTANT_PM10, row["PM10"]) ]), axis = 1 )

Now we'll store the Pandas Dataframe in SingleStoreDB.

Store data in SingleStoreDB

We'll now create a connection to SingleStoreDB:

from sqlalchemy import * db_connection = create_engine(connection_url)

Now we'll create a new table in SingleStoreDB using the Pandas Dataframe:

aqi_df.to_sql( "aqi", con = db_connection, if_exists = "replace", index = False, chunksize = 1000 )

We'll check some data in the table:

SELECT * FROM aqi LIMIT 5;

Example output:

+---------------------+---------+---------+---------+----------+---------+----------+------+--------+-------+------+------+----------------------+------+------+ | date | O3 | NO | NO2 | NOXasNO2 | SO2 | CO | PM10 | PM2.5 | wd | ws | temp | site | code | aqi | +---------------------+---------+---------+---------+----------+---------+----------+------+--------+-------+------+------+----------------------+------+------+ | 2022-01-05 02:00:00 | 55.0813 | 0.12473 | 8.22375 | 8.415 | 0.53218 | 0.11642 | 7.7 | 4.151 | 299.3 | 5.2 | 1.2 | London N. Kensington | KC1 | 17 | | 2022-01-09 11:00:00 | 53.4848 | 3.7419 | 22.185 | 27.9225 | 0.79827 | 0.17463 | 11.1 | 5.755 | 296.9 | 3.5 | 5.8 | London N. Kensington | KC1 | 24 | | 2022-01-09 19:00:00 | 3.39269 | 9.10529 | 69.615 | 83.5762 | 0.79827 | 0.512248 | 28.4 | 22.264 | 271.5 | 2.5 | 1.9 | London N. Kensington | KC1 | 72 | | 2022-01-10 11:00:00 | 5.38839 | 29.3116 | 70.1888 | 115.132 | 1.33045 | 0.46568 | 24.2 | 11.698 | 131.5 | 1.5 | 1.2 | London N. Kensington | KC1 | 48 | | 2022-01-11 09:00:00 | 30.9333 | 1.87095 | 26.3925 | 29.2612 | 0 | 0.197914 | 6.8 | 5.472 | 208.4 | 2.1 | 8.1 | London N. Kensington | KC1 | 22 | +---------------------+---------+---------+---------+----------+---------+----------+------+--------+-------+------+------+----------------------+------+------+

A variety of tools could now be used with SingleStoreDB to perform analytics on the data. For example, we have previously discussed the ease with which we can use analytics and visualisation tools with SingleStoreDB, such as Apache Superset, Metabase and Looker.

Bonus: Create visualisation

From our Python environment, we can also create a quick visualisation using Plotly:

# Define the ranges for each AQI category aqi_bounds = [0, 51, 101, 151, 201, 301, 500] # Define the names of each AQI category aqi_categories = [ "Good", "Moderate", "Unhealthy for Sensitive Groups", "Unhealthy", "Very Unhealthy", "Hazardous", "Very Hazardous" ] # Define the colours for each AQI category aqi_colors = [ [0.0, "green" ], [0.1, "green" ], [0.102, "yellow"], [0.2, "yellow"], [0.202, "orange"], [0.3, "orange"], [0.302, "red" ], [0.4, "red" ], [0.402, "purple"], [0.6, "purple"], [0.602, "maroon"], [1.0, "maroon"] ] # Define the tick values and labels for the colorbar tickvals = [(aqi_bounds[i] + aqi_bounds[i + 1]) / 2 for i in range(len(aqi_bounds)-1)] ticktext = aqi_categories[:-1] # Create a new figure object with a heatmap trace fig = go.Figure(data = go.Heatmap( x = aqi_df["date"].to_numpy(), y = aqi_df["site"], z = aqi_df["aqi"], colorscale = aqi_colors, zmin = aqi_bounds[0], zmax = aqi_bounds[-1], colorbar = dict(title = "Air Quality Index", tickmode = "array", ticktext = ticktext, tickvals = tickvals ) ) ) # Customise the layout of the figure fig.update_layout( xaxis_title = "Date", yaxis_title = "Location", title = "Air Quality Index by Location and Date", width = 700, height = 600, yaxis = dict(tickangle = -90) ) # Show the figure fig.show()

This produces the heatmap shown in Figure 1.

Figure 1. AQI for KC1.

The overall quality of air throughout 2022 was Good. The next steps would be to drill down and further analyse the data, check the data for the other polluting gases, and so on. We could also make the heatmap more interesting by adding additional weather data.

Summary

In this short article, we have quickly analysed some air quality data from Defra in the UK, using the US EPA algorithm with data for particulates (PM10 and PM2.5). However, we note that many countries use different algorithms to calculate their AQI, although the approach described in this article could be easily adapted for other countries.