Optimization of getting random rows out of a PostgreSQL in Django

Wednesday, Feb 23, 2011

⬅︎ Back to Optimization of getting random rows out of a PostgreSQL in Django

Comment

Peter Bengtsson February 23, 2011

Check the update I just added to the blog post. Selecting 100 random items only once takes the same amount of time.

How would one use triggers to solve this in a better way?

Parent comment

Ken Swift February 23, 2011

Aff, my mistake. But still let assume You want to get 100 rows in random order. With Your solution script would execute more then 100 queries to database. It would be slow. So why are so reluctant to use trigers wich guarantees you to get what You want with one query? Offtopic: I am not big fan of orms, and I am surprise that python/django community tries to solve database problems with weird python/django orm solutions rather then use database itself, which results in greater performance and readability.

Replies

Ken Swift February 23, 2011

CREATE OR REPLACE FUNCTION position_update() RETURNS TRIGGER AS $$
DECLARE
next_position INTEGER;
BEGIN
IF (TG_OP = 'INSERT') THEN
SELECT (COALESCE(MAX(position), 0) + 1) INTO next_position FROM table;
UPDATE table SET position = next_position WHERE id = NEW.id;
ELSEIF (TG_OP = 'DELETE') THEN
UPDATE table SET position = position - 1 WHERE position > OLD.position ;
END IF;
RETURN NULL;
END;
$$
LANGUAGE 'plpgsql' VOLATILE
COST 100;

DROP TRIGGER IF EXISTS position_trigger ON table;
CREATE TRIGGER position_trigger AFTER INSERT OR DELETE ON table FOR EACH ROW EXECUTE PROCEDURE position_update();

Now we can:
1. Select max(position) from table;
2. generate random numbers from 1 to max
3. fetch all rows in one query using where in

On my desktop time to fetch 100random rows from 100000row table is.... ~40ms

Its a difference? Isn't it ?

Peter Bengtsson February 23, 2011

I have no idea what that does. How do you use it to select random elements?

Ken Swift February 23, 2011

1. Create triger I've posted
2. Select maximum value of "position" column
3. GEnerate your random numbers varying from 1 to value form point 2.
4. Select rows from table using where predicate:
"where position in (generated numbers from point 3)"

I dont know django ORM so well so I can't translate it to python code :) But i assume u can use "where in" predicate in it.

Peter Bengtsson February 23, 2011

Again, I'm feeling really dim. What do you mean by "Select maximum value of "position" column"??

In python, to do `random.randint(1, SOME_BIG_NUMBER)` is very quick and that's not where the optimization win is.

Ken Swift February 23, 2011

I've meant: select max(position) from table;

But I see my solution is not clear so i will explain once more.

We have table with rows and id column as pk. If no rows were ever deleted then we would have id values varying from 1 to number of rows. So max(id) gives us number of rows. However if some rows were deleted then max(id) != rows number, hence we cant use random.randint(1, max_id_value_returned_from_database) because there are some gaps in id sequence. But if we add additional column say "position" like in my example and add trigger to that table we could maintain this column value in given fashion:

1. if row is added, select maximum existing value for column position. Then increment it by one and save it to the new record.
2. if row is deleted then update all rows that have "position" value bigger then deleted row. All updated rows will reduce "position" value by 1. That is how i enforce sequence with no gaps in it.

Now that we have sequence with no gaps we can get maximum value of this sequence, generate some random number in python, and retrieve all rows with one query. Also we could add index for that column, but since we are retrieving randoms rows random fetches for database are inevitable.

Peter Bengtsson February 23, 2011

Some caps are OK. Read my code again. If COUNT(id) is 10,000 and MAX(id) is 11,000 that means that you have about 10% blanks in there so the loop that keeps going till it has fetched 'TIMES' random entries will have to loop for about 11 (10 + 10% of 10) times.

If is the kind of application where deletes are very common, look at Simon's solution.

Peter Bengtsson February 23, 2011

Actually, your algorithm is broken. Suppose you add:
row1, row2, row3, row4.
The "position" value will be 4 (randint(1, 4))
Then you delete row3. The "position" value will be decrement to 3 (randint(1, 3)). How then will it ever randomly pick row4?

Ken Swift February 24, 2011

Read my post again. If you delete row3, value of "position" for row4 will be decremented by 1, so, after delete it will have value of 3. This leads us to place where everything is alright.

Peter Bengtsson February 24, 2011

I see! So the position is kept with the table.

But now you need an index on the position field too otherwise you can't pick from it quickly. Seems tight but extremely convoluted.

Ken Swift February 24, 2011

"On my desktop time to fetch 100random rows from 100000row table is.... ~40ms"

This was done without index and worked perfectly fine. Also, I'am glad that I finally make myself clear about this solution :)

Jay Wineinger February 24, 2011

So if I understand, you're gaining some optimization for the random SELECT but you're causing an UPDATE on a possibly large number of rows for every DELETE? Example: if you DELETE the first row in the table, you now have to UPDATE every other row. Seems like this would be a high cost to pay, especially if your app does frequent deletes.

Ken Swift February 24, 2011

You are correct. There never is golden solution for each case. All tables have theirs specifics and where on solutions is great it is a failure in another one.

Nathan Florea March 19, 2011

Rather than decrementing the position of every row above the deleted one, you should just give the highest one the position of the deleted one. So if you have rows 1, 2, 3, 4 and you delete 2, you have 1, 3, 2. That way you only have to make one update.

Ken Swift March 19, 2011

I dont think that would work, ie. we got 1,2,3,4. We delete 2
1,3,4 are left and we are missing 2. Flowing ur advice i can update last value (4) and get 1,3,2.

Now, I delete 1. 3,2 are left and updating last value wont give us a correct sequence

Nathan Florea March 19, 2011

No, you would replace the 3 with 1. That is the highest position value in the sequence 1,3,2. Not the *last* value in position, the *highest*.