Optimization of getting random rows out of a PostgreSQL in Django

Wednesday, Feb 23, 2011

⬅︎ Back to Optimization of getting random rows out of a PostgreSQL in Django

Comment

Ken Swift February 23, 2011

Aff, my mistake. But still let assume You want to get 100 rows in random order. With Your solution script would execute more then 100 queries to database. It would be slow. So why are so reluctant to use trigers wich guarantees you to get what You want with one query?

Offtopic: I am not big fan of orms, and I am surprise that python/django community tries to solve database problems with weird python/django orm solutions rather then use database itself, which results in greater performance and readability.

Parent comment

Peter Bengtsson February 23, 2011

It's not bugged. See what I wrote to Dougal. It re-tries if it doesn't exist. The assumption is that the count is quite similar to the max.

Replies

Peter Bengtsson February 23, 2011

Check the update I just added to the blog post. Selecting 100 random items only once takes the same amount of time.

How would one use triggers to solve this in a better way?

Ken Swift February 23, 2011

CREATE OR REPLACE FUNCTION position_update() RETURNS TRIGGER AS $$
DECLARE
next_position INTEGER;
BEGIN
IF (TG_OP = 'INSERT') THEN
SELECT (COALESCE(MAX(position), 0) + 1) INTO next_position FROM table;
UPDATE table SET position = next_position WHERE id = NEW.id;
ELSEIF (TG_OP = 'DELETE') THEN
UPDATE table SET position = position - 1 WHERE position > OLD.position ;
END IF;
RETURN NULL;
END;
$$
LANGUAGE 'plpgsql' VOLATILE
COST 100;

DROP TRIGGER IF EXISTS position_trigger ON table;
CREATE TRIGGER position_trigger AFTER INSERT OR DELETE ON table FOR EACH ROW EXECUTE PROCEDURE position_update();

Now we can:
1. Select max(position) from table;
2. generate random numbers from 1 to max
3. fetch all rows in one query using where in

On my desktop time to fetch 100random rows from 100000row table is.... ~40ms

Its a difference? Isn't it ?

Peter Bengtsson February 23, 2011

I have no idea what that does. How do you use it to select random elements?

Ken Swift February 23, 2011

1. Create triger I've posted
2. Select maximum value of "position" column
3. GEnerate your random numbers varying from 1 to value form point 2.
4. Select rows from table using where predicate:
"where position in (generated numbers from point 3)"

I dont know django ORM so well so I can't translate it to python code :) But i assume u can use "where in" predicate in it.

Peter Bengtsson February 23, 2011

Again, I'm feeling really dim. What do you mean by "Select maximum value of "position" column"??

In python, to do `random.randint(1, SOME_BIG_NUMBER)` is very quick and that's not where the optimization win is.

Ken Swift February 23, 2011

I've meant: select max(position) from table;

But I see my solution is not clear so i will explain once more.

We have table with rows and id column as pk. If no rows were ever deleted then we would have id values varying from 1 to number of rows. So max(id) gives us number of rows. However if some rows were deleted then max(id) != rows number, hence we cant use random.randint(1, max_id_value_returned_from_database) because there are some gaps in id sequence. But if we add additional column say "position" like in my example and add trigger to that table we could maintain this column value in given fashion:

1. if row is added, select maximum existing value for column position. Then increment it by one and save it to the new record.
2. if row is deleted then update all rows that have "position" value bigger then deleted row. All updated rows will reduce "position" value by 1. That is how i enforce sequence with no gaps in it.

Now that we have sequence with no gaps we can get maximum value of this sequence, generate some random number in python, and retrieve all rows with one query. Also we could add index for that column, but since we are retrieving randoms rows random fetches for database are inevitable.

Peter Bengtsson February 23, 2011

Some caps are OK. Read my code again. If COUNT(id) is 10,000 and MAX(id) is 11,000 that means that you have about 10% blanks in there so the loop that keeps going till it has fetched 'TIMES' random entries will have to loop for about 11 (10 + 10% of 10) times.

If is the kind of application where deletes are very common, look at Simon's solution.

Peter Bengtsson February 23, 2011

Actually, your algorithm is broken. Suppose you add:
row1, row2, row3, row4.
The "position" value will be 4 (randint(1, 4))
Then you delete row3. The "position" value will be decrement to 3 (randint(1, 3)). How then will it ever randomly pick row4?

Ken Swift February 24, 2011

Read my post again. If you delete row3, value of "position" for row4 will be decremented by 1, so, after delete it will have value of 3. This leads us to place where everything is alright.

Peter Bengtsson February 24, 2011

I see! So the position is kept with the table.

But now you need an index on the position field too otherwise you can't pick from it quickly. Seems tight but extremely convoluted.

Ken Swift February 24, 2011

"On my desktop time to fetch 100random rows from 100000row table is.... ~40ms"

This was done without index and worked perfectly fine. Also, I'am glad that I finally make myself clear about this solution :)

Jay Wineinger February 24, 2011

So if I understand, you're gaining some optimization for the random SELECT but you're causing an UPDATE on a possibly large number of rows for every DELETE? Example: if you DELETE the first row in the table, you now have to UPDATE every other row. Seems like this would be a high cost to pay, especially if your app does frequent deletes.

Ken Swift February 24, 2011

You are correct. There never is golden solution for each case. All tables have theirs specifics and where on solutions is great it is a failure in another one.

Nathan Florea March 19, 2011

Rather than decrementing the position of every row above the deleted one, you should just give the highest one the position of the deleted one. So if you have rows 1, 2, 3, 4 and you delete 2, you have 1, 3, 2. That way you only have to make one update.

Ken Swift March 19, 2011

I dont think that would work, ie. we got 1,2,3,4. We delete 2
1,3,4 are left and we are missing 2. Flowing ur advice i can update last value (4) and get 1,3,2.

Now, I delete 1. 3,2 are left and updating last value wont give us a correct sequence

Nathan Florea March 19, 2011

No, you would replace the 3 with 1. That is the highest position value in the sequence 1,3,2. Not the *last* value in position, the *highest*.

Andras February 23, 2011

@offtopic the reason for using an ORM is not that we are afraid of raw SQL (though you should sometimes...), but the power of the models and queryset classes. We can manipulate models and querysets in a pythonic way, and reuse all kinds of stuff on different models and queries. A queryset can be used to just get a list of instances, filter the change list in the admin, provide choices for a dropdownbox, be used in all kinds of templatetags etc.

Raw SQL queries have a place in django applications, but there are very good reasons to stick with the ORM most of the time...

Peter Bengtsson February 23, 2011

I totally agree. Not depending on SQL means I can run unit tests in SQLite3 in memory and then manually test and deploy on Postgres.

Staying within the ORM usually means you're not complicating things for yourself. Instead of solving problems by abandoning the ORM and digging deep into the SQL perhaps instead you should walk around the problem and look at it in a different way.

Ken Swift February 23, 2011

Well, I know what ORM is used for but I offten see that people use it to much. It all fine and dandy what You are saying when u got simple bussnes logic like get list of something, simple predicates and stuff. But when u need to join 10 tables in sophistacated manner orms are usless :)
I cant expect that some code will do everthing for You, and even if it does dont expect it will be the best available way to achive your goal.

Andreas February 23, 2011

Probably I am lucky that I never needed to join 10 tables in a single query. If these questions come up in IRC, most of the time, whoever asked it was thinking much much too complicated.

If however, I really had to join 10 tables, punching out a complex ORM queryset would be the least of my problems. But again, I still expect such cases to be a very small minority.

Most problems I ever encountered map nicely to the ORM if you put a little thought in it.

Peter Bengtsson February 23, 2011

I guess my point was that, if you need to join 10 tables you probably have made the app too complex. To get out of it, an ORM certainly won't help.

By the way, a good ORM makes it possible to inject arbitrary raw SQL if there's a need for speed or a need for a hack.

Ken Swift February 23, 2011

Or I had to develop big app with weird client requests :)

For the second part about hacking orm I don't know why using some fancy database features like cte, window function, sql functions, views, database extensions is "hacking". Mixing raw sql with orm imho is out of taste also, once again whats the point? If I need data only for displaying (I do not want to change that data) why would I go through the pain in making complex orm query if I can write nice and easily readable sql query that uses 100% of database ? Again, If I want to make efficient application limiting myself only to features orm provides is wrong. Why would I forbid myself from using all goodies postgres have? Or if I was developing using mysql, orcale etc why should I limit myself only to standard sql queries ?

On the other hand I have no grudge against orms. I use them every day ,but for tasks they fit perfectly. I've seen many blog posts presenting pythonic solutions for some problems
taking tens of lines which could have been done in 10line trigger...

Peter Bengtsson February 23, 2011

Why? Because it reduces complexity from a code-reading point of view.

Compare the readability of your trigger function with a snippet of plain python ORM code. A new developer joining would have to study the SQL functions, the triggers, the application code.

Large SQL statements are harder to read and harder to debug. You can't put in step-by-step debuggers, debugging print statements, monkeypatching or unit testing.

That's why.

By the way, I maintain a very large web app where the SQL is handcoded (started before ORMs became popular). Hundreds of .sql files. It's fast because the database gets to do so much of the processing. But for basic things (editing a user profile) it's tedious to have to type all the SQL and the app now only works in Postgres meaning I can't run my unit tests in a in-memory SQLite3 database.

Ken Swift February 23, 2011

Our discussion is going to nowhere. No offence but if sql statements are hard to read and hard to debug don't use sql at all if its to hard to learn it :) The only reason why people are so afraid of using sql is that they don't know it very well so using orm is very cosy.

Peter Bengtsson February 23, 2011

I know SQL extremely well but I still believe ORMs are generally better than hand-coding SQL and using triggers, functions and views. Especially if it's an app.