DEV Community

Marat Latypov
Marat Latypov

Posted on • Edited on

Understanding the PostgreSQL Crosstab Function

Usually piece of data in SQL is represented as a row in a table. But often it's convenient to represent it as a cell in pivot table. And crosstab could help us in it.

Let's see how it works.

Preparing data

First of all let's prepare data. Here I created tables a and b, and so-called "join table" c, which is related to a and b.

create sequence myseq start 1; create table a ( id bigint default nextval('myseq'::regclass) not null primary key, name text NOT NULL ); create table b ( id bigint default nextval('myseq'::regclass) not null primary key, name text NOT NULL ); create table c ( a_id bigint not null references a(id), b_id bigint not null references b(id), value text ); 
Enter fullscreen mode Exit fullscreen mode

Now let's seed our tables with data.

insert into a (name) values ('a1'), ('a2'), ('a3'); -- Result: -- +--+----+ -- |id|name| -- +--+----+ -- |1 |a1 | -- |2 |a2 | -- |3 |a3 | -- +--+----+ insert into b (name) values ('b1'), ('b2'), ('b3'); -- Result: -- +--+----+ -- |id|name| -- +--+----+ -- |4 |b1 | -- |5 |b2 | -- |6 |b3 | -- +--+----+ 
Enter fullscreen mode Exit fullscreen mode

I used one shared increment sequence for both tables, which is why the id in table b starts from 4 instead of 1.

Just keep this in mind.

Next let's populate the join table c.

insert into c (a_id, b_id, value) values (1, 4, 'v14'), (1, 5, 'v15'), (1, 6, 'v16'), (2, 4, 'v24'), (2, 5, 'v25'), (2, 6, 'v26'), (3, 4, 'v34'), (3, 5, 'v35'), (3, 6, 'v36'); 
Enter fullscreen mode Exit fullscreen mode

Case #0 Cartesian product

Now let's make a simple inner join.

select a.name as a_name, b.name as b_name, c.value from c inner join a on a.id = c.a_id inner join b on b.id = c.b_id order by 1, 2; -- Result: -- +------+------+-----+ -- |a_name|b_name|value| -- +------+------+-----+ -- |a1 |b1 |v14 | -- |a1 |b2 |v15 | -- |a1 |b3 |v16 | -- |a2 |b1 |v24 | -- |a2 |b2 |v25 | -- |a2 |b3 |v26 | -- |a3 |b1 |v34 | -- |a3 |b2 |v35 | -- |a3 |b3 |v36 | -- +------+------+-----+ 
Enter fullscreen mode Exit fullscreen mode

In the output result we see all the data in rows. Each row is a piece of information.

There is nothing special here

Case #1 Our pivot table using crosstab

Now let's make a pivot table from it

select * from crosstab( 'select a.name as a_name, b.name as b_name, c.value from c inner join a on a.id = c.a_id inner join b on b.id = c.b_id order by 1', 'select name from b order by 1' ) as crosstab_alias (a_name text, b1 text, b2 text, b3 text); -- Result: -- +------+---+---+---+ -- |a_name|b1 |b2 |b3 | -- +------+---+---+---+ -- |a1 |v14|v15|v16| -- |a2 |v24|v25|v26| -- |a3 |v34|v35|v36| -- +------+---+---+---+ 
Enter fullscreen mode Exit fullscreen mode

Wow! This result looks much better!

Let's look at the query

crosstab(text source_sql, text category_sql) 
Enter fullscreen mode Exit fullscreen mode

Crosstab function produces table and have 2 parameters : source_sql and category_sql.

  • source_sql - SQL used to retrieve the source data, in our case it's our previously written query with inner joins.
  • category_sql - SQL used to retrieve the category names that will be used as column headers.

Source data has some restrictions:

  • The first column must be the row identifier (in our example a_name). It should be unique for each row in pivot table.
  • After row identifier column could be some extra fields. In our example we don't have them.
  • The next mandatory column is the category identifier (in our example b_name). It should be unique for each column in pivot table.
  • The last mandatory column is the value (in our example value). It can be any data type.

One more important thing is table alias and column aliases in our example it's the line

as crosstab_alias (a_name text, b1 text, b2 text, b3 text) 
Enter fullscreen mode Exit fullscreen mode

The alias crosstab_alias is quite common thing in SELECT queries. It's nothing special — aliases are often assigned to tables in queries for easier reference later on.

The next mandatory thing is column description with types. So it should contain:

  • row identifier (for ex. a_name text)
  • optional extra fields, if you have some
  • categories columns (for ex. b1 text, b2 text, b3 text)

That's it!

Case #2 Nonexistent data in the pivot table

Now let's play with the data. What if we don't have full table?

delete from c; --delete previous data insert into c (a_id, b_id, value) values (1, 4, 'v14'), -- (1, 5, 'v15'), removed line (1, 6, 'v16'), (2, 4, 'v24'), (2, 5, 'v25'), -- (2, 6, 'v26'), removed line -- (3, 4, 'v34'), removed line (3, 5, 'v35'), (3, 6, 'v36'); 
Enter fullscreen mode Exit fullscreen mode

When we run our previous crosstab query on such data, we will get:

-- Result: -- +------+----+----+----+ -- |a_name|b1 |b2 |b3 | -- +------+----+----+----+ -- |a1 |v14 |null|v16 | -- |a2 |v24 |v25 |null| -- |a3 |null|v35 |v36 | -- +------+----+----+----+ 
Enter fullscreen mode Exit fullscreen mode

So now we see that empty slots with no data are replaced by NULLs!

Looks great for me!

Case #3 Duplicated data in pivot table

Previously we noted that

  • row identifier of pivot table should be unique
  • category column of pivot table should be unique It's quite reasonable! Otherwise what should be shown in the cell?

But what if in our database we have two entries for the same row and column?

delete from c; --delete previous data if you need insert into c (a_id, b_id, value) values (1, 4, 'v14'), (1, 5, 'v15'), (1, 6, 'v16'), (1, 6, 'v16 duplicate'), (2, 4, 'v24'), (2, 5, 'v25'), (2, 6, 'v26'), (3, 4, 'v34'), (3, 5, 'v35'), (3, 6, 'v36'); 
Enter fullscreen mode Exit fullscreen mode

What does mean this duplicated data entry in our case, it looks like a nonsense as for me.

But assume we have something like that and let's see what will happen.

Our crosstab query will return result:

-- Result: -- +------+---+---+-------------+ -- |a_name|b1 |b2 |b3 | -- +------+---+---+-------------+ -- |a1 |v14|v15|v16 duplicate| -- |a2 |v24|v25|v26 | -- |a3 |v34|v35|v36 | -- +------+---+---+-------------+ 
Enter fullscreen mode Exit fullscreen mode

Ouch! From two data entries for relation a1-b3 only the latest is shown.

Why the latest? Why not the first? No idea, I don't know, it's on PostgreSQL implementation. Maybe it uses the latest value, maybe chooses randomly.

Anyway it sounds bad. You shouldn't have any duplicated entries in you table. Maybe entries should be aggregated somehow, or filtered.

That's all!

Links

Here is a link to the code used in this article:

Top comments (0)