Geospatial exploratory data analysis with GeoPandas and DuckDB

In this article, I’ll show you how to use two popular Python libraries to carry out some geospatial analysis of traffic accident data within the UK.

I was a relatively early adopter of DuckDB, the fast OLAP database, after it became available, but only recently realised that, through an extension, it offered a large number of potentially useful geospatial functions.

Geopandas was new to me. It’s a Python library that makes working with geographic data feel like working with regular pandas, but with geometry (points, lines, polygons) built in. You can read/write standard GIS formats (GeoJSON, Shapefile, GeoPackage), manipulate attributes and geometries together, and quickly visualise layers with Matplotlib.

Eager to try out the capabilities of both, I set about researching a useful mini-project that would be both interesting and a helpful learning experience.

Long story short, I decided to try using both libraries to determine the safest city in the UK for driving or walking around. It’s possible that you could do everything I’m about to show using Geopandas on its own, but unfortunately, I’m not as familiar with it as I am with DuckDB, which is why I used both.

A few ground rules.

The accident and casualty data I’ll be using is from an official UK government source that covers the whole country. However, I’ll be focusing on the data for only six of the UK’s largest cities: London, Edinburgh, Cardiff, Glasgow, Birmingham, and Manchester.

My method for determining the “safest” city will be to calculate the total number of road traffic-related casualties in each city over a five-year period and divide this number by the area of each city in km². This will be my “safety index”, and the lower that number is, the safer the city.

Getting our city boundary data

This was arguably the most challenging part of the entire process.

Rather than treating a “city” as a single administrative polygon, which leads to various anomalies in terms of city areas, I modelled each one by its built-up area (BUA) footprint. I did this using the Office of National Statistics (ONS) Built-Up Areas map data layer and then aggregated all BUA parts that fall within a sensible administrative boundary for that location. The mask comes from the official ONS boundaries and is chosen to reflect each wider urban area:

London → the London Region (Regions Dec 2023).
Manchester → the union of the 10 Greater Manchester Local Authority Districts (LAD ) for May 2024.
Birmingham → the union of the 7 West Midlands Combined Authority LADs.
Glasgow → the union of the 8 Glasgow City Region councils.
Edinburgh and Cardiff → their single LAD.

How the polygons are built in code

I downloaded boundary data directly from the UK Office for National Statistics (ONS) ArcGIS FeatureServer in GeoJSON format using the requests library. For each city we first build a mask from official ONS admin layers: the London Region (Dec 2023) for London; the union of Local Authority Districts (LAD, May 2024) for Greater Manchester (10 LADs), the West Midlands core (7 LADs) and the Glasgow City Region (8 councils); and the single LAD for Edinburgh and Cardiff.

Next, I query the ONS Built-Up Areas (BUA 2022) layer for polygons intersecting the mask’s bounding box, keeping only those that intersect the mask, and dissolve (merge) the results to create a single multipart polygon per city (“BUA 2022 aggregate”). Data are stored and plotted in EPSG:4326 (WGS84), i.e latitude and longitude are expressed as degrees. When reporting areas, we reproject them to EPSG:27700 (OSGB) and calculate the area in square kilometres to avoid distortions.

Each city boundary data is downloaded to a GeoJSON file and loaded into Python using the geopandas and requests libraries.

To show that the boundary data we have is correct, the individual city layers are then combined into a single Geodataframe, reprojected into a consistent coordinate reference system (EPSG:4326), and plotted on top of a clean UK outline derived from the Natural Earth dataset (via a GitHub mirror). To focus only on the mainland, we crop the UK outline to the bounding box of the cities, excluding distant overseas territories. The area of each city is also calculated.

Boundary data licensing

All the boundary datasets I’ve used are open data with permissive reuse terms.

London, Edinburgh, Cardiff, Glasgow, Birmingham & Manchester

Source: Office for National Statistics (ONS) — Counties and Unitary Authorities (May 2023), UK BGC and Regions (December 2023) EN BGC.
License: Open Government Licence v3.0 (OGL).
Terms: You are free to use, modify, and share the data (even commercially) as long as you provide attribution.

UK Outline

Source: Natural Earth — Admin 0 — Countries.
License: Public Domain (No restrictions).Citation:
Natural Earth. Admin 0 — Countries. Public domain. Available from: https://www.naturalearthdata.com

City boundary code

Here is the Python code you can use to download the data files for each city and verify the boundaries on a map.

Before running the main code, please ensure you have installed the following libraries. You can use pip or another method of your choice for this.

geopandas matplotlib requests pandas duckdb jupyter

import requests, json import geopandas as gpd import pandas as pd import matplotlib.pyplot as plt # ---------- ONS endpoints ---------- LAD24_FS = "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Local_Authority_Districts_May_2024_Boundaries_UK_BGC/FeatureServer/0/query" REGION23_FS = "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Regions_December_2023_Boundaries_EN_BGC/FeatureServer/0/query" BUA_FS = "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/BUA_2022_GB/FeatureServer/0/query" # ---------- Helpers ---------- def arcgis_geojson(url, params): r = requests.get(url, params={**params, "f": "geojson"}, timeout=90) r.raise_for_status() return r.json() def sql_quote(s: str) -> str: return "'" + s.replace("'", "''") + "'" def fetch_lads_by_names(names): where = "LAD24NM IN ({})".format(",".join(sql_quote(n) for n in names)) data = arcgis_geojson(LAD24_FS, { "where": where, "outFields": "LAD24NM", "returnGeometry": "true", "outSR": "4326" }) gdf = gpd.GeoDataFrame.from_features(data.get("features", []), crs="EPSG:4326") if gdf.empty: raise RuntimeError(f"No LADs found for: {names}") return gdf def fetch_lad_by_name(name): return fetch_lads_by_names([name]) def fetch_region_by_name(name): data = arcgis_geojson(REGION23_FS, { "where": f"RGN23NM={sql_quote(name)}", "outFields": "RGN23NM", "returnGeometry": "true", "outSR": "4326" }) gdf = gpd.GeoDataFrame.from_features(data.get("features", []), crs="EPSG:4326") if gdf.empty: raise RuntimeError(f"No Region feature for: {name}") return gdf def fetch_buas_intersecting_bbox(minx, miny, maxx, maxy): data = arcgis_geojson(BUA_FS, { "where": "1=1", "geometryType": "esriGeometryEnvelope", "geometry": json.dumps({ "xmin": float(minx), "ymin": float(miny), "xmax": float(maxx), "ymax": float(maxy), "spatialReference": {"wkid": 4326} }), "inSR": "4326", "spatialRel": "esriSpatialRelIntersects", "outFields": "BUA22NM,BUA22CD,Shape__Area", "returnGeometry": "true", "outSR": "4326" }) return gpd.GeoDataFrame.from_features(data.get("features", []), crs="EPSG:4326") def aggregate_bua_by_mask(mask_gdf: gpd.GeoDataFrame, label: str) -> gpd.GeoDataFrame: """ Fetch BUA 2022 polygons intersecting a mask (LAD/Region union) and dissolve to one polygon. Uses Shapely 2.x union_all() to build the mask geometry. """ # Union the mask polygon(s) mask_union = mask_gdf.geometry.union_all() # Fetch candidate BUAs via mask bbox, then filter by real intersection with the union minx, miny, maxx, maxy = gpd.GeoSeries([mask_union], crs="EPSG:4326").total_bounds buas = fetch_buas_intersecting_bbox(minx, miny, maxx, maxy) if buas.empty: raise RuntimeError(f"No BUAs intersecting bbox for {label}") buas = buas[buas.intersects(mask_union)] if buas.empty: raise RuntimeError(f"No BUAs actually intersect mask for {label}") dissolved = buas[["geometry"]].dissolve().reset_index(drop=True) dissolved["city"] = label return dissolved[["city", "geometry"]] # ---------- Group definitions ---------- GM_10 = ["Manchester","Salford","Trafford","Stockport","Tameside", "Oldham","Rochdale","Bury","Bolton","Wigan"] WMCA_7 = ["Birmingham","Coventry","Dudley","Sandwell","Solihull","Walsall","Wolverhampton"] GLASGOW_CR_8 = ["Glasgow City","East Dunbartonshire","West Dunbartonshire", "East Renfrewshire","Renfrewshire","Inverclyde", "North Lanarkshire","South Lanarkshire"] EDINBURGH = "City of Edinburgh" CARDIFF = "Cardiff" # ---------- Build masks ---------- london_region = fetch_region_by_name("London") # Region mask for London gm_lads = fetch_lads_by_names(GM_10) # Greater Manchester (10) wmca_lads = fetch_lads_by_names(WMCA_7) # West Midlands (7) gcr_lads = fetch_lads_by_names(GLASGOW_CR_8) # Glasgow City Region (8) edi_lad = fetch_lad_by_name(EDINBURGH) # Single LAD cdf_lad = fetch_lad_by_name(CARDIFF) # Single LAD # ---------- Aggregate BUAs by each mask ---------- layers = [] london_bua = aggregate_bua_by_mask(london_region, "London (BUA 2022 aggregate)") london_bua.to_file("london_bua_aggregate.geojson", driver="GeoJSON") layers.append(london_bua) man_bua = aggregate_bua_by_mask(gm_lads, "Manchester (BUA 2022 aggregate)") man_bua.to_file("manchester_bua_aggregate.geojson", driver="GeoJSON") layers.append(man_bua) bham_bua = aggregate_bua_by_mask(wmca_lads, "Birmingham (BUA 2022 aggregate)") bham_bua.to_file("birmingham_bua_aggregate.geojson", driver="GeoJSON") layers.append(bham_bua) glas_bua = aggregate_bua_by_mask(gcr_lads, "Glasgow (BUA 2022 aggregate)") glas_bua.to_file("glasgow_bua_aggregate.geojson", driver="GeoJSON") layers.append(glas_bua) edi_bua = aggregate_bua_by_mask(edi_lad, "Edinburgh (BUA 2022 aggregate)") edi_bua.to_file("edinburgh_bua_aggregate.geojson", driver="GeoJSON") layers.append(edi_bua) cdf_bua = aggregate_bua_by_mask(cdf_lad, "Cardiff (BUA 2022 aggregate)") cdf_bua.to_file("cardiff_bua_aggregate.geojson", driver="GeoJSON") layers.append(cdf_bua) # ---------- Combine & report areas ---------- cities = gpd.GeoDataFrame(pd.concat(layers, ignore_index=True), crs="EPSG:4326") # ---------- Plot UK outline + the six aggregates ---------- # UK outline (Natural Earth 1:10m, simple countries) ne_url = "https://raw.githubusercontent.com/nvkelso/natural-earth-vector/master/geojson/ne_10m_admin_0_countries.geojson" world = gpd.read_file(ne_url) uk = world[world["ADMIN"] == "United Kingdom"].to_crs(4326) # Crop frame to our cities minx, miny, maxx, maxy = cities.total_bounds uk_crop = uk.cx[minx-5 : maxx+5, miny-5 : maxy+5] fig, ax = plt.subplots(figsize=(9, 10), dpi=150) uk_crop.boundary.plot(ax=ax, color="black", linewidth=1.2) cities.plot(ax=ax, column="city", alpha=0.45, edgecolor="black", linewidth=0.8, legend=True) # Label each polygon using an interior point label_pts = cities.representative_point() for (x, y), name in zip(label_pts.geometry.apply(lambda p: (p.x, p.y)), cities["city"]): ax.text(x, y, name, fontsize=8, ha="center", va="center") ax.set_title("BUA 2022 Aggregates: London, Manchester, Birmingham, Glasgow, Edinburgh, Cardiff", fontsize=12) ax.set_xlim(minx-1, maxx+1) ax.set_ylim(miny-1, maxy+1) ax.set_aspect("equal", adjustable="box") ax.set_xlabel("Longitude"); ax.set_ylabel("Latitude") plt.tight_layout() plt.show()

After running this code, you should have 6 GeoJSON files stored in your current directory, and you should also see an output like this, which visually tells us that our city boundary files contain valid data.

Getting our accident data

Our final piece of the data puzzle is the accident data. The UK government publishes reports on vehicle accident data covering five-year periods. The latest data covers the period from 2019 to 2024. This data set covers the entire UK, so we will need to process it to extract only the data for the six cities we are interested in. That’s where DuckDB will come in, but more on that later.

To view or download the accident data in CSV format (Source: Department for Transport — Road Safety Data), click the link below

https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-last-5-years.csv

As with the city boundary data, this is also published under the Open Government Licence v3.0 (OGL 3.0) and as such has the same licence conditions.

The accident data set contains a large number of fields, but for our purposes, we are only interested in 3 of them:

latitude longitude number_of_casualties

To obtain our casualty counts for each city is now just a 3-step process.

1/ Loading the accident data set into DuckDB

If you’ve never encountered DuckDB before, the TL;DR is that it is a super-fast, in-memory (can also be persistent) database written in C++ designed for analytical SQL workloads.

One of the main reasons I like it is its speed. It’s one of the fastest third-party data analytics libraries I’ve used. It is also very extensible through the use of extensions such as the geospatial one, which we’ll use now.

Now we can load the accident data like this.

import requests import duckdb # Remote CSV (last 5 years collisions) url = "https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-last-5-years.csv" local_file = "collisions_5yr.csv" # Download the file r = requests.get(url, stream=True) r.raise_for_status() with open(local_file, "wb") as f: for chunk in r.iter_content(chunk_size=8192): f.write(chunk) print(f"Downloaded {local_file}") # Connect once con = duckdb.connect(database=':memory:') # Install + load spatial on THIS connection con.execute("INSTALL spatial;") con.execute("LOAD spatial;") # Create accidents table with geometry con.execute(""" CREATE TABLE accidents AS SELECT TRY_CAST(Latitude AS DOUBLE) AS latitude, TRY_CAST(Longitude AS DOUBLE) AS longitude, TRY_CAST(Number_of_Casualties AS INTEGER) AS number_of_casualties, ST_Point(TRY_CAST(Longitude AS DOUBLE), TRY_CAST(Latitude AS DOUBLE)) AS geom FROM read_csv_auto('collisions_5yr.csv', header=True, nullstr='NULL') WHERE TRY_CAST(Latitude AS DOUBLE) IS NOT NULL AND TRY_CAST(Longitude AS DOUBLE) IS NOT NULL AND TRY_CAST(Number_of_Casualties AS INTEGER) IS NOT NULL """) # Quick preview print(con.execute("DESCRIBE accidents").df()) print(con.execute("SELECT * FROM accidents LIMIT 5").df())

You should see the following output.

Downloaded collisions_5yr.csv column_name column_type null key default extra 0 latitude DOUBLE YES None None None 1 longitude DOUBLE YES None None None 2 number_of_casualties INTEGER YES None None None 3 geom GEOMETRY YES None None None latitude longitude number_of_casualties \ 0 51.508057 -0.153842 3 1 51.436208 -0.127949 1 2 51.526795 -0.124193 1 3 51.546387 -0.191044 1 4 51.541121 -0.200064 2 geom 0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ... 1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ... 2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ... 3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ... 4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...

2/ Loading the city boundary data using DuckDB spatial functions

The function we’ll be using to do this is called ST_READ, which can read and import a variety of geospatial file formats using the GDAL library.

city_files = { "London": "london_bua_aggregate.geojson", "Edinburgh": "edinburgh_bua_aggregate.geojson", "Cardiff": "cardiff_bua_aggregate.geojson", "Glasgow": "glasgow_bua_aggregate.geojson", "Manchester": "manchester_bua_aggregate.geojson", "Birmingham": "birmingham_bua_aggregate.geojson" } for city, file in city_files.items(): con.execute(f""" CREATE TABLE {city.lower()} AS SELECT '{city}' AS city, geom FROM ST_Read('{file}') """) con.execute(""" CREATE TABLE cities AS SELECT * FROM london UNION ALL SELECT * FROM edinburgh UNION ALL SELECT * FROM cardiff UNION ALL SELECT * FROM glasgow UNION ALL SELECT * FROM manchester UNION ALL SELECT * FROM birmingham """)

3/ The next step is to join accidents to the city polygons and count casualties.

The key geospatial function we use this time is called ST_WITHIN. It returns TRUE if the first geometry point is within the boundary of the second.

import duckdb casualties_per_city = con.execute(""" SELECT c.city, SUM(a.number_of_casualties) AS total_casualties, COUNT(*) AS accident_count FROM accidents a JOIN cities c ON ST_Within(a.geom, c.geom) GROUP BY c.city ORDER BY total_casualties DESC """).df() print("Casualties per city:") print(casualties_per_city)

Note that I ran the above query on a powerful desktop, and it still took a few minutes to return results, so please be patient. However, eventually, you should see an output similar to this.

Casualties per city: city total_casualties accident_count 0 London 134328.0 115697 1 Birmingham 14946.0 11119 2 Manchester 4518.0 3502 3 Glasgow 3978.0 3136 4 Edinburgh 3116.0 2600 5 Cardiff 1903.0 1523

Analysis

It’s no surprise that London has, overall, the most significant number of casualties. But for its size, is it more or less dangerous to drive or be a pedestrian in than in the other cities?

There is clearly an issue with the casualty figures for Manchester and Glasgow. They should both be larger, based on their city sizes. The suggestion is that this could be because many of the busy ring roads and metro fringe roads (M8/M74/M77; M60/M62/M56/M61) associated with each city are sitting just outside their tight BUA polygons, leading to a significant under-representation of accident and casualty data. I’ll leave that investigation as an exercise for the reader!

For our final determination of driver safety, we need to know each city’s area size so we can calculate the casualty rate per km².

Luckily, DuckDB has a function for that. ST_AREA computes the area of a geometry.

# --- Compute areas in km² (CRS84 -> OSGB 27700) --- print("\nCalculating areas in km^2...") areas = con.execute(""" SELECT city, ST_Area( ST_MakeValid( ST_Transform( -- ST_Simplify(geom, 0.001), -- Experiment with epsilon value (e.g., 0.001 degrees) geom, 'OGC:CRS84','EPSG:27700' ) ) ) / 1e6 AS area_km2 FROM cities ORDER BY area_km2 DESC; """).df() print("City Areas:") print(areas.round(2))

I obtained this output, which appears to be about right.

 city area_km2 0 London 1321.45 1 Birmingham 677.06 2 Manchester 640.54 3 Glasgow 481.49 4 Edinburgh 123.00 5 Cardiff 96.08

We now have all the data we need to declare which city has the safest drivers in the UK. Remember, the lower the “safety_index” number, the safer.

 city area_km2 casualties safety_index (casualties/area) 0 London 1321.45 134328 101.65 1 Birmingham 677.06 14946 22.07 2 Manchester 640.54 4518 7.05 3 Glasgow 481.49 3978 8.26 4 Edinburgh 123.00 3116 25.33 5 Cardiff 96.08 1903 19.08

I’m not comfortable including the results for both Manchester and Glasgow due to the doubts on their casualty counts that we mentioned before.

Taking that into account, and because I’m the boss of this article, I’m declaring Cardiff as the winner of the prize for the safest city from a driver and pedestrian perspective. What do you think of these results? Do you live in one of the cities I looked at? If so, do the results support your experience of driving or being a pedestrian there?

Summary

We examined the feasibility of performing exploratory data analysis on a geospatial dataset. Our goal was to determine which of the UK’s six leading cities were safest to drive in or be a pedestrian in. Using a combination of GeoPandas and DuckDB, we were able to:

Use Geopandas to download city boundary data from an official government website that reasonably accurately represents the size of each city.
Download and post-process an extensive 5-year accident survey CSV to obtain the three fields of interest it contained, namely latitude, longitude and number of road accident casualties.
Join the accident data to the city boundary data on latitude/longitude using DuckDB geospatial functions to obtain the total number of casualties for each city over 5 years.
Used DuckDB geospatial functions to calculate the size in km² of each city.
Calculated a safety index for each city, being the number of casualties in each city divided by its size. We disregarded two of our results due to doubts about the accuracy of some of the data. We calculated that Cardiff had the lowest safety index score and was, therefore, deemed the safest of the cities we surveyed.

Given the right input data, geospatial analysis using the tools I describe could be an invaluable aid across many industries. Think of traffic and accident analysis (as I’ve shown), flood, deforestation, forest fire and drought analysis come to mind too. Basically, any system or process that is connected to a spatial coordinate system is ripe for exploratory data analysis by anyone.

Geospatial exploratory data analysis with GeoPandas and DuckDB

Getting our city boundary data

How the polygons are built in code

Boundary data licensing

City boundary code

Getting our accident data

Summary

Related Articles

Evaluating Cinematic Dialogue - Which syntactic and semantic features are predictive of genre?

Feature Engineering with Microsoft Fabric and Dataflow Gen2

Understanding Predictive Maintenance - Wave Data: Feature Engineering (Part 2 Spectral)

Gauss, Imposters, and Making Room for Creativity

Python for Data Scientists: Choose Your Own Adventure

Minimum Meeting Rooms Problem in SQL

Geospatial exploratory data analysis with GeoPandas and DuckDB

Getting our city boundary data

How the polygons are built in code

Boundary data licensing

City boundary code

Getting our accident data

Summary

Related Articles

Evaluating Cinematic Dialogue - Which syntactic and semantic features are predictive of genre?

A Complete Guide to Master Step Functions on AWS

Feature Engineering with Microsoft Fabric and Dataflow Gen2

Understanding Predictive Maintenance - Wave Data: Feature Engineering (Part 2 Spectral)

Gauss, Imposters, and Making Room for Creativity

Python for Data Scientists: Choose Your Own Adventure

Minimum Meeting Rooms Problem in SQL