Navigating GeoParquet: Lessons Learned from the eMOTIONAL Cities Project

The eMOTIONAL Cities project has set out to understand how the natural and built environment can shape the feelings and emotions of those who experience it. At its core lies a Spatial Data Infrastructure (SDI) which combines a variety of datasets from the Urban Health domain. These datasets should be available to urban planners, neuroscientists and other stakeholders, for analysis, creating data products and eventually making decisions based upon them.

Although the average size of a dataset is small (with few exceptions), scientists often want to combine several of these datasets in the same analysis, which creates a use case where we could benefit from format efficiency. For this reason, we recently decided to offer GeoParquet as an alternate encoding for the 100+ vector datasets published in the SDI.

What is GeoParquet

For those who have been distracted, GeoParquet is a format which encodes vector data in Apache Parquet. There is no reinventing the wheel here: Apache Parquet is a free and open-source column-oriented data storage format, which provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk; GeoParquet is just adding the spatial support on top of Parquet, leveraging the fact that most cloud data warehouses already understand it to achieve interoperability. Although GeoParquet started as a community effort, it is now on the path to become an OGC Standard and you can follow (or even contribute to) the spec on: https://github.com/opengeospatial/geoparquet

GeoParquet extends Parquet by adding some metadata about the file and for each geometry column; the number of mandatory columns is kept to a minimum, with some nice-to-have optional features.

Converting & Publishing the Data

Although a relatively new format, GeoParquet already spots a vibrant ecosystem of implementations to choose from. After a few experiments, we decided to use the GDAL library to convert the datasets, as it integrates better with our existing pipeline.

It should be noted that our source datasets are hosted in a S3 bucket in GeoJSON format and the idea was to place the GeoParquet files also in S3, so the idea was to read/write the files directly from S3.

Our pipeline uses GDAL wrapped in a bash script, which reads all the GeoJSON files in a folder on a S3 bucket and places the resulting GeoParquet files on a different folder in the same bucket. 

It should be noted that in order to support GeoParquet, GDAL > 3.8.4 should be used; to make things easier, we run GDAL with the right GDAL version from a docker container. The script is available in the etl-tools repository of the eMOTIONAL Cities project with an MIT license.

The GeoParquet files were validated directly from the S3 bucket using the gpq, a lightweight tool written in GO, which creates as well validates GeoParquet files. The gpq cli was wrapped in another bash script, available here. It should be noted that no validation errors were spotted from the created GeoParquet files.
In order to make the GeoParquet datasets discoverable, they were added to each collection record of the eMOTIONAL Cities catalogue using an item type “application/vnd.apache.parquet”, which can be negotiated by clients. See an example below for the  hex350_grid_obesity_1920 collection.

        {

            “href”:”https://emotional-cities.s3.eu-central-1.amazonaws.com/geoparquet/hex350_grid_obesity_1920.parquet”,

            “rel”:”item”,

            “type”:”application/vnd.apache.parquet”,

            “title”:”GeoParquet download link for hex350_grid_obesity_1920″

        },

The chart above shows the size of one of our largest datasets (activity_level_ldn) in different formats. GeoParquet translates into a smaller size, even when compared with binary formats such as Shapefile or GeoPackage. These smaller sizes, specially when multiplied by a large number of datasets, will translate in cost saving for hosting data; they will also provide a better experience for users which stream these datasets over the web for the purpose of analysis.

The chart above shows the total size of eMOTIONAL Cities datasets in various formats.

Socializing the Results

Although the GeoParquet files are discoverable by machines through the OGC API – Records catalogue, more work needs to be done in order to ensure that humans are aware of them. These are a few initiatives that we did, or plan to do, in order to socialise these results and encourage users to leverage the geoparquet datasets that we expose in the SDI:

  • May 2024: GeomobLX – Lighting Talk about “GeoParquet”.
  • October 2024: eMOTIONAL Cities webinar about GeoParquet (TBD)
  • December 2024: FOSS4G World – “Adding GeoParquet to a Spatial Data Infrastructure: What, Why and How” (submitted talk)

The image bellow shows an eMOTIONAL Cities GeoParquet dataset in QGIS. The out-of-the-box support in widely used tools like QGIS, is one of the most exciting things about GeoParquet, but we need to make sure that users know about it.

If you are curious to get your hands on some nice examples, the entire eMOTIONAL Cities catalogue is available to you. In each metadata record, you will find a link to the corresponding geoparquet file, which you can download locally or stream to your jupyter notebook.

This blog post ,and the work leading to it, was possible with the collaboration of my colleague Pascallike. A Big Thanks to him!

Creating Responsive Maps with Vector Tiles

Vector tiles have been around for a while and they seem to combine the best of both worlds. They provide design flexibility, something we usually associate to vector data, while enabling fast delivery, like we generally see on raster services. The mvt specification, based on Google’s protobuf format, packages geographic data into pre-defined roughly-square shaped “tiles” for transfer over the web.

The OGC API – Tiles standard, enables sharing vector tiles while ensuring interoperability among services. It is a very simple format, which formalizes what most applications are already doing in terms of tiling, while adding some interesting (optional) features. You can find more information on: tiles.developer.ogc.org .

If you want to publish vector tiles using this standard, you could use pygeoapi, which is a Python server implementation of the OGC API suite of standards and a reference implementation of OGC API – Tiles. With its plugin architecture, pygeoapi supports many different providers to render the tiles in the backend. One option could be to use the elastic search backend (mvt-elastic), which enables rendering vector tiles on the fly, from any index stored in elasticsearch. Recently, this provider also supports retrieving the properties (e.g.: fields) along with the geometry, which is needed for client side styling.

You can check some OGC API – Tiles collections in the eMOTIONAL Cities catalogue. On this map, we show the results of urban health outcomes (Prevalence rates of cardiovascular diseases) in 350m hexagonal grids of Inner London. It is rendered according to the mean value.

On the developer console, we can inspect how the attribute values of the vector tiles are exposed to the client.

Another option for interactive maps that require access to attributes, would be to retrieve a GeoJSON from an OGC API – Features endpoint. In that case, the client would need to load all the features at the start, and then carry these features in memory. If we have a high number of features, or many different layers, this could result in a less responsive application.

As an experiment, we loaded a web application with a base layer and two collections with 3480 and 3517 features (“hex350_grid_cardio_1920” and “hex350_grid_pm10_2019”). When the collections were loaded as vector tiles, the application took 20 milliseconds to load. On the other hand, when the collections were loaded as features it took 6887 milliseconds.

You can check out the code for this experiment at: https://github.com/emotional-cities/vtiles-example/tree/ecities and a map showing the vector tile layers at: https://emotional-cities.github.io/vtiles-example/demo-oat.htm

Geocoding in QGIS with OpenCage

Anyone working with geospatial data, had probably encountered at some point the need for geocoding. The task of transforming an address (e.g.: a placename, city, postcode) into a pair of coordinates (e.g.: a point geometry) is called forward geocoding, while the task of transforming a pair of coordinates into an address is called reverse geocoding.

As of today, there is some support to geocoding in QGIS, using third-party geocoding APIs. A geocoding API is a service which receives as an input an address or a pair of coordinates and returns a point or an address as result. There are many commercial geocoding APIs on the market (including the well-known Google Maps API) and there is one free API (Nominatum) which relies on OSM data. There is no silver bullet in what concerns geocoding, and you should evaluate carefully the option that best suits your use case.

The table bellow shows different QGIS plugins which support geocoding . Some of them are focused on geocoding, while others do a bunch of other things.

PluginDownloadsLast ReleaseForwardReverseAPI KeyFocus on geocodingGeocoding API
MMQGIS1574182021yynnGoogle/OSM/…
GeoCoding1469602018yyyyOSM, Google
GoogleMaps527172021ynyyGoogle
Maptiler156962022ynynMaptiler
Nominatim LF98832021yynyOSM
TravelTime74602023yyynTravelTime
TomTom14502020ynyyTomTom
Comparison between geocoding plugins in QGIS (data from 09/01/2023)

After reviewing these plugins, it became clear that there would be space for one plugin which would address the following items:

  • Bulk processing: Although in some occasions it may be useful to geocode a single instance, this is rarely the case in GIS projects. Moreover, this functionality can be accomplished by an online tool or even using the bulk processing. This line of thought renders the location filter less interesting than a bulking tool.
  • Responsive and performant: Some of the existing geocoding tools are unresponsive while handling a large number of rows. The ability to perform batch (e.g.: asynchronous) geocoding can address some of these issues.
  • Forward/reverse geocoding: Forward geocoding is disproportionately more implemented than reverse geocoding. This could be due to market demand, but also to technological reasons (e.g.: reverse geocoding is not implemented in the QGIS core). Still, if there is not too much effort, it could be nice to offer reverse geocoding to users, even if it is just for a few use cases.
  • Support to options: It would be nice to offer some of the options offered by the API, through the plugin. These could include the restriction to a country (or bounding box) and the ability to control the output fields.
  • Help/Documentation: A lot of the existing plugins have UIs which are not intuitive and do not offer any useful help/documentation. This makes using the plugins (or even finding them) very challenging. Even some resources like a tutorial or a README page on GitHub which could be referenced from the plugin, could improve this situation.
  • Intuitive UI: One of the problems with QGIS plugins is the lack of standardisation of the UI. Some plugins add icons on the toolbar, others add entries in the plugins menu or even in other menus. Some plugins add all of these things, and instead of one widget, they add multiple widgets. This renders the task of finding, setting up and using the plugin sometimes very complicated. One way of overcoming this, is to use the processing UI, which is more or less standard. Although the menu entries can be configured, the look & feel is always the same, and the plugin can always be found through the processing toolbox.

The OpenCage Geocoding plugin is a processing plugin that offers forward and reverse geocoding within QGIS. Being a processing plugin, it benefits from many features out-of-the -box, such as batch/asynchronous processing, integration with the modeller or the ability to run on the python console. It also features a standard UI, with inputs, outputs, options and feedback which should be familiar to processing users.

This plugin relies on the OpenCage geocoding API, and API that offers geocoding worldwide based on different datasets. While OpenCage makes extensive use of Nominatum, it is worth to mention that they do contribute to back to the project, both in terms of funding and of actual code.

Being a commercial API, you will need to sign-up for a key before using this plugin. You can check the different plans on their website. If you choose a trial key, you can sign-up without the need of using a credit card, which is not always the case with other providers.

Although the plugin can be run with minimal configuration using the default options, the configuration parameters leverage the capabilities of the underlying API to generate results that best fit our use case. For instance if you want to geocode addresses and you know that your addresses are all within a given region, you can feed the algorithm with a country name or even a bounding box. This bounding box can be hardcoded, but it can also be calculated from the layer extent, canvas extend or even drawn by hand.

Apart from the formatted address and the coordinates, optionally the algorithm can also return additional structured information about the location in the results. This includes for instance the timezone, the flag of the country and the currency (you can read here what are the different annotations that the API returns). As this may slow down the response, it is switched off by default, to ensure people only request it if they are really interested on this feature.

Whether you want to geocode addresses or coordinates, you may want the resulting address to be in a specific language. If you set the language parameter, the API will do the best effort to return results in that language.

I hope this plugin can be useful to users with different degrees of expertise: from the simplest use case, to the more advanced ones (through the options). Overall, the merits of this plugin are largely due to the capabilities of the processing toolbox and of the OpenCage API.

If you find any issues, please report them in the issue tracker of the project. This plugin is released under GPLV2. Feel free to fork it, look at the code and modify it for other use cases. If you feel like contributing back to the project, Pull Requests are also welcome (:

Happy geocoding!

Mapping the IVAucher

As a reaction to the record high of fuel prices, the Portuguese government has updated the IVAucher program, to allow each citizen to recover 10 cents per each liter of fuel spent, up to a maximum of 5 EUR/month. This blog post is not going to discuss whether this is good way of spending the public budget, or if it is going to make a real impact in the lives of the people that manage to subscribe to this program. Instead, I want to focus on data.

Once you subscribe to the program as a consumer, you just need to fill the tank in one of the gas stations that subscribed the program, as businesses. The IVAucher website publishes a list of subscribed stations, which seems to be updated, from time to time. The list is published as a PDF, with 2746 records, ordered by “districto” and “concelho” administrative units.

When I look for the stations around me, in the “concelho” of Lisbon, I found 67 records. In order to know where to go, I would literally need to go through each and check if I know the address or the name of the station. Lisbon is a big city, and I admit that there are lots of street names that I don’t know – and I don’t need to, because this is “why” we have maps. My first though was that this data belonged in a map, and my second though was that the data should be published in such a way that it would enable other people to create maps – and this is how this project was born.

In the five-star deployment scheme for Open Data, PDF is at the very bottom, and it is easy to understand why. There is so much you can do with a format, which is largely unstructured.

In order to be able to process these data, I had to transform it into a structured format, preferentially non proprietary, so I chosen CSV (3 stars). This was achieved using a combination of command-line processing tools (e.g.: pdftotext, sed and grep).

The next step was to publish these data, following the FAIR principles, so that it is Findable, Accessible, Interoperable and Reusable. In order to do that, I have chosen the OGC API Features standard, which allows to publish vector geospatial data on the web. This standard defines a RESTfull API with JSON encodings, which fits the expectations of modern web applications. I used a Python implementation of OGC API Features, called pygeoapi.

Before getting the data into pygeoapi, I had to georeference it. In order to do forward geocoding, I used the OpenCage API, and more specifically a Python client, which is one of the many supported SDKs. After tweaking the parameters, the results were quite good, and I was even able to georeference some incomplete addresses, something that was not possible using the Nominatum OSM API.

The next thing was to get the data into a format which supports geometry. The CSV was transformed into a GeoJSON using GDAL/ogr2ogr. I could have published it as a GeoJSON int pygeoapi, but indexing into a database adds support to more functionality, so I decided to store it in a MongoDB NoSQL data store. Everything was virtualized into docker containers, and orchestrated using this docker-compose file.

The application was deployed in AWS and the collection is available at this endpoint:

https://features.byteroad.net/collections/gas_stations

This means that anyone is able to consume this data and create their own maps, whether they are using QGIS, ArcGIS, JavaScript, Python, etc. All they need is an application which implements the OGC API Features standard.

I also created a map, using React.js and the Leaflet library. Although Leaflet does not support OGC API Features natively, I was able to fetch the data as GeoJSON, by following this approach.

The resulting application is available here:

https://ivaucher.byteroad.net

Now you can navigate through the map until you find you area of interest, or even type an address in the search box, to let the map fly to that location.

Hopefully, this application will make the user experience of the IVAucher program a bit easier, but it will also demonstrate the importance of using standards in order to leverage the use of geospatial information. Making data available on the web is good, but it is time that we move a step forward and question “how” we are making the data available, in order to ensure that its full potential is unlocked.

DevRel – What is that?

Almost a year ago, I heard the term DevRel for the first time when Sara Safavi, from Planet, gave a talk at CodeOp and used that word to describe her new role. I knew Sara as a developer, like myself, so I was curious to learn what this role entailed and understand how it could attract someone with a strong technical background.

It turns out that DevRel – Developer Relations – is as close as you can be to the developer world, without actually writing code. All these things that I used to do in my spare time, like participating in hackathons, writing blog posts, participating in conversations on Twitter, speaking at events, are now the core part of my job. I did them, because they are fun, and also because I believe that ultimately, writing code has an impact in society, and in order to run that last mile we need to get out of our compilers and reach out to the world. Technology is like a piece of art – it only fulfills its mission when it leaves the artist’s basement and it reaches the museums, or at least the living room of someone who appreciates it.

I am happy to say that I am now the DevRel at the Open Geospatial Consortium. In a way, it is a bit ironic that I ended up taking this role in an organization that does not actually produce software as its main outcome. But in a way OGC is the ultimate software facilitator, by producing the standards that will be used by developers to build their interoperable, geospatial aware, products and services. If you are reading this and you are not a geogeek, you may think of W3C as a somehow similar organization: it produces the HTML specification, which is not itself a software, but how could we build all these frontend applications using React, Vue and so many other frameworks, without using HTML? It is that important. Now you may be thinking, “so tell me an OGC standard that I use, or at least know”, and, again, if you are not a geogeek, maybe you won’t know any of the standards I will mention. Even if you use, or have used at some point location data. And this is part of the reason why I am at OGC.

Location data is increasingly part of the mainstream. We all carry devices in our pockets that produce geo referenced data with an accuracy that was undreamed ten years ago. Getting hold of these data opens a world of possibilities for data scientists and data engineers, but in order for all these applications to be able to understand each other we need sound, well articulated standards in place. My main goal as DevRel at OGC will be to bring the OGC standards closer to the developer community, by making them easier to use, and by making sure that they are actually used. And maybe, just maybe, I will also get to write some code along the way.