Comment

Peter Bengtsson

Count is slower than Max. Also a count will yield a lower number than the MAX(id) meaning the random range in reduced. With slicing, how would you make it random?

Parent comment

hutch

cant you just take the count of your query set, and then pick random rows via slicing?

Replies

hutch

example

count = Model.objects.all().count()
randomrow = Model.objects.all()[int(random.random() * count)]

the idea is that since query sets are lazy, only that one row will actually be retrieved.

the reason we want to use count() instead of max, is that the max pk may be significantly higher than the total number of rows available to slice out. i.e. rows have been deleted.

if you wanted 10 random rows, you could just generate 10 random numbers, making sure they're unique, and then iterate over the list of randoms taking that row number from the the query set and appending it to a list or something.

i actually think i first saw this technique in the docks, as a way to avoid mysql's slow randomizing.

Peter Bengtsson

a) count() is much slower than max() or using the sequence currval()

b) if you have the IDs 1,2,4 the count will be 3 so you'd have to run randint(1,3). That way you're never picking up ID=4.

c) what does that slice translate to in SQL? Is it "LIMIT 1 OFFSET <python random number>"? If so, don't you have to order?

hutch

i don't know about a at all, it's something to test, i'm just throwing out an alternate technique to try.

b) it's zero indexed and you should be adding in checks that int(random() * count) < count

c) you're right about offset and limit, but you don't need an order because you're pulling a random one anyway, so it's irrelevant.

Peter Bengtsson

a) trust me, it is. considerably

c) have you tried it? It's ultra-slow

hutch

on c, taking a slice rather than get(pk=X)

i just tried it (this is on sqlite though, which just occurred to me) could someone with a real db test this?

t0 = time()
blah = Model.objects.all()[3].pk
print '%f seconds' % (time() - t0)

t0 = time()
blah = model.objects.model.objects.get(pk=2).pk
print '%f seconds' % (time() - t0)

yielded these three trials

0.000898 seconds
0.001472 seconds

0.001804 seconds
0.001447 seconds

0.001948 seconds
0.001504 seconds

when i write out the function, similar to the using_max function, i get these results (first is using count and slices, second is the using_max function

0.007093 seconds
0.010980 seconds

0.006881 seconds
0.011418 seconds

0.006947 seconds
0.011399 seconds

count = model.objects.all().count()
i = 0
while i < TIMES:
try:
yield model.objects.all()[random.randint(0, count-1)].pk
i += 1
except model.DoesNotExist:
pass

like i said though, his is on sqlite, with probably not a representative dataset.

Peter Bengtsson

Do it on a proper database like postgres and copy and paste the SQL that Django generates then run it like this:

mydatabase# EXPLAIN ANALYZE <the big sql statement>;

hutch

no need, now that i've filled up the database with faked data, we get this:

0.862994 seconds
0.104061 seconds

0.894336 seconds
0.114008 seconds

0.809722 seconds
0.102276 seconds

Peter Bengtsson

Are you sure you're doing that right? I just tried it myself and got this:

COUNT 84482
using_max() took 0.613966941833 seconds
using_max2() took 2.08254098892 seconds
using_count_and_slice() took 14.112842083 seconds

Code here: http://www.peterbe.com/plog/getting-random-rows-postgresql-django/get_random_ones.py

hutch

no, i think the numbers jive, and you're right.

it seems to be a 7x factor using the count/slice method i was talking about.

Peter Bengtsson

Don't think so, out of the 14 seconds about 0.4 seconds is spent getting the counts.