Usually piece of data in SQL is represented as a row in a table. But often it's convenient to represent it as a cell in pivot table. And crosstab
could help us in it.
Let's see how it works.
Preparing data
First of all let's prepare data. Here I created tables a
and b
, and so-called "join table" c
, which is related to a
and b
.
create sequence myseq start 1; create table a ( id bigint default nextval('myseq'::regclass) not null primary key, name text NOT NULL ); create table b ( id bigint default nextval('myseq'::regclass) not null primary key, name text NOT NULL ); create table c ( a_id bigint not null references a(id), b_id bigint not null references b(id), value text );
Now let's seed our tables with data.
insert into a (name) values ('a1'), ('a2'), ('a3'); -- Result: -- +--+----+ -- |id|name| -- +--+----+ -- |1 |a1 | -- |2 |a2 | -- |3 |a3 | -- +--+----+ insert into b (name) values ('b1'), ('b2'), ('b3'); -- Result: -- +--+----+ -- |id|name| -- +--+----+ -- |4 |b1 | -- |5 |b2 | -- |6 |b3 | -- +--+----+
I used one shared increment sequence for both tables, which is why the id
in table b
starts from 4 instead of 1.
Just keep this in mind.
Next let's populate the join table c
.
insert into c (a_id, b_id, value) values (1, 4, 'v14'), (1, 5, 'v15'), (1, 6, 'v16'), (2, 4, 'v24'), (2, 5, 'v25'), (2, 6, 'v26'), (3, 4, 'v34'), (3, 5, 'v35'), (3, 6, 'v36');
Case #0 Cartesian product
Now let's make a simple inner join.
select a.name as a_name, b.name as b_name, c.value from c inner join a on a.id = c.a_id inner join b on b.id = c.b_id order by 1, 2; -- Result: -- +------+------+-----+ -- |a_name|b_name|value| -- +------+------+-----+ -- |a1 |b1 |v14 | -- |a1 |b2 |v15 | -- |a1 |b3 |v16 | -- |a2 |b1 |v24 | -- |a2 |b2 |v25 | -- |a2 |b3 |v26 | -- |a3 |b1 |v34 | -- |a3 |b2 |v35 | -- |a3 |b3 |v36 | -- +------+------+-----+
In the output result we see all the data in rows. Each row is a piece of information.
There is nothing special here
Case #1 Our pivot table using crosstab
Now let's make a pivot table from it
select * from crosstab( 'select a.name as a_name, b.name as b_name, c.value from c inner join a on a.id = c.a_id inner join b on b.id = c.b_id order by 1', 'select name from b order by 1' ) as crosstab_alias (a_name text, b1 text, b2 text, b3 text); -- Result: -- +------+---+---+---+ -- |a_name|b1 |b2 |b3 | -- +------+---+---+---+ -- |a1 |v14|v15|v16| -- |a2 |v24|v25|v26| -- |a3 |v34|v35|v36| -- +------+---+---+---+
Wow! This result looks much better!
Let's look at the query
crosstab(text source_sql, text category_sql)
Crosstab
function produces table and have 2 parameters : source_sql
and category_sql
.
-
source_sql
- SQL used to retrieve the source data, in our case it's our previously written query with inner joins. -
category_sql
- SQL used to retrieve the category names that will be used as column headers.
Source data has some restrictions:
- The first column must be the row identifier (in our example
a_name
). It should be unique for each row in pivot table. - After row identifier column could be some extra fields. In our example we don't have them.
- The next mandatory column is the category identifier (in our example
b_name
). It should be unique for each column in pivot table. - The last mandatory column is the value (in our example
value
). It can be any data type.
One more important thing is table alias and column aliases in our example it's the line
as crosstab_alias (a_name text, b1 text, b2 text, b3 text)
The alias crosstab_alias
is quite common thing in SELECT
queries. It's nothing special — aliases are often assigned to tables in queries for easier reference later on.
The next mandatory thing is column description with types. So it should contain:
- row identifier (for ex.
a_name text
) - optional extra fields, if you have some
- categories columns (for ex.
b1 text, b2 text, b3 text
)
That's it!
Case #2 Nonexistent data in the pivot table
Now let's play with the data. What if we don't have full table?
delete from c; --delete previous data insert into c (a_id, b_id, value) values (1, 4, 'v14'), -- (1, 5, 'v15'), removed line (1, 6, 'v16'), (2, 4, 'v24'), (2, 5, 'v25'), -- (2, 6, 'v26'), removed line -- (3, 4, 'v34'), removed line (3, 5, 'v35'), (3, 6, 'v36');
When we run our previous crosstab
query on such data, we will get:
-- Result: -- +------+----+----+----+ -- |a_name|b1 |b2 |b3 | -- +------+----+----+----+ -- |a1 |v14 |null|v16 | -- |a2 |v24 |v25 |null| -- |a3 |null|v35 |v36 | -- +------+----+----+----+
So now we see that empty slots with no data are replaced by NULL
s!
Looks great for me!
Case #3 Duplicated data in pivot table
Previously we noted that
- row identifier of pivot table should be unique
- category column of pivot table should be unique It's quite reasonable! Otherwise what should be shown in the cell?
But what if in our database we have two entries for the same row and column?
delete from c; --delete previous data if you need insert into c (a_id, b_id, value) values (1, 4, 'v14'), (1, 5, 'v15'), (1, 6, 'v16'), (1, 6, 'v16 duplicate'), (2, 4, 'v24'), (2, 5, 'v25'), (2, 6, 'v26'), (3, 4, 'v34'), (3, 5, 'v35'), (3, 6, 'v36');
What does mean this duplicated data entry in our case, it looks like a nonsense as for me.
But assume we have something like that and let's see what will happen.
Our crosstab
query will return result:
-- Result: -- +------+---+---+-------------+ -- |a_name|b1 |b2 |b3 | -- +------+---+---+-------------+ -- |a1 |v14|v15|v16 duplicate| -- |a2 |v24|v25|v26 | -- |a3 |v34|v35|v36 | -- +------+---+---+-------------+
Ouch! From two data entries for relation a1
-b3
only the latest is shown.
Why the latest? Why not the first? No idea, I don't know, it's on PostgreSQL implementation. Maybe it uses the latest value, maybe chooses randomly.
Anyway it sounds bad. You shouldn't have any duplicated entries in you table. Maybe entries should be aggregated somehow, or filtered.
That's all!
Links
Here is a link to the code used in this article:
Top comments (0)