Postgres. Sample N random records

When working on the same project had the need to write some semblance of a test system. The task was formulated like this:

the
    the
  • of the N records in the database it is necessary to choose m (3-5) random strings in a series of k samples (mainly k=2).

And now the same human language: from the table need to choose 3-5 random records. This should not be duplicate and sample should occur at random.

The first thing that comes to mind:

the
 SELECT *
FROM data_set
WHERE id NOT IN (1,2,3,4, 5)
ORDER BY random()
LIMIT 5;

And it will even work. But the price of such a decision.

So I had to use "higher intelligence", which gave hint for the solution.

the
WITH RECURSIVE r AS ( 
WITH b AS (SELECT min(id), max(id) FROM table1) 
( 
SELECT id, min, max, array[]::integer[] AS a, 0 AS n 
FROM table1, b 
WHERE id > min + (max - min) * random() 
LIMIT 1 
) UNION ALL ( 
SELECT t.id, min, max, a || t.id, r.n + 1 AS n 
FROM table1 AS t, r 
WHERE 
t.id > min + (max - min) * random() AND 
t.id <> all(a) AND 
r.n + 1 < 10 
LIMIT 1 
) 
) 
SELECT t.id FROM table1 AS t, r WHERE r.id = t.id;

Here only the solution is... "slightly" confusing. A haul of strange decisions in the project somehow reluctant. Therefore, it was decided to go already verified by, where did you find the explanation essence recursive queries.

What. Logic in General has become more or less clear: n time choose a row with a random offset from a minimum value of ID (primary key). Therefore, the query affects a maximum of N rows instead of full enumeration of the contents of the table.

But "it was smooth on paper". When you try to use "as is" (with the amendment in the names of tables/fields) surfaced surprises:

    the
  1. an Array in line 4 creates an empty. Because of this, the final sample may be duplicates. For example, the query in lines 4-7 returns the id==5. Then fulfills the request in the UNION block (lines 9-15) and at some point returns the same id==5, because the previous id value was not in the array "and" and check line 13 in "t.id <> all(a)" succeeds.
  2. the
  3. the Number of values to "output" was no more than a stated (p. 14). But less — easily. Even if it was guaranteed the number in the table. Sometimes returning a blank sample (the number values of "0").

With the first "feature" of the deal was relatively easy — it was enough just to change the line initializing the array. Like this:

the
SELECT id, min, max, array[]::integer[] || id AS a, 0 AS n 

But the second paragraph made poraskinut brains. The catch revealed in the "heart" of the algorithm is a sample entry from the range. Indeed, a recursive query attempts to select a line, which would run condition:

the
id > min + (max - min) * random()

But in the case when random() returns "1", this condition transformirovalsya:

the
id > max

Naturally, in this case, the query will not find anything and will stop working. And if this happens at the first pass query, then the output will be empty. Even if the database definitely contains the desired record.

First instinct was to slightly modify the condition and to lead him about like the following:

the
id >= min + (max - min) * random()

That is to say, the solution is allowed to receive at least one line of output. But it is not guaranteed to receive a specified number of rows in the result. I had to think further.

After careful reflection and many attempts, litres stimulator was born on
this option:

request code
WITH RECURSIVE r AS (
WITH b AS (
SELECT
min(t.id)
(
SELECT t.id
FROM table1 AS t
ORDER BY t.id DESC
LIMIT 1
OFFSET 9
) max
FROM table1 AS t
)
(
SELECT

FROM table1 AS t, b
WHERE
id >= min + ((max - min) * random())::int
LIMIT 1
) UNION ALL (
SELECT t.id, min, max, a || t.id, r.n + 1 AS n
FROM {table} AS t, r
WHERE
t.id >= min + ((max - min) * random())::int AND
t.id <> all(a) AND
r.n + 1 < 10
LIMIT 1
)
)
SELECT t.* FROM table1 AS t, r WHERE r.id = t.id

All the salt in the lines 5-11. I.e. in order to guarantee that the output will be N elements necessary to proceed from the worst-case scenario. In this case — that random N times in a row will return 1. Therefore you need to know/have N-1 values before the max. How to achieve this in query? As a variant — to sort all records by ID in descending order and producing a shift down by N lines. And as in lines 19 and 25 used ">=", then the offset you can specify is one less than (N-1).

That's good — the main difficulty is solved, and remained a "trifle": the request in its current form, of little use. Because you need to make a selection subject to certain conditions. In the simplest case — to exclude from the sample ID of the records you selected in the previous stage. In addition, we cannot exclude the possibility of using some additional conditions imposed on the rows in the table (is_active = true, is_deleted=false, ...). After some thought it becomes clear that the conditions you have to put in all the essential parts of the query (almost all subqueries). Like in the following template:

code template
WITH RECURSIVE r AS (
WITH b AS (
SELECT
min(t.{pk}),
(
SELECT t.{pk}
FROM {table} AS t
WHERE {exclude} t.is_active
ORDER BY t.{pk} DESC
LIMIT 1
OFFSET {limit} - 1
) max
FROM {table} AS t WHERE {exclude} t.is_active
)
(
SELECT
t.{pk}, min, max, array[]::integer[] || t.{pk} AS a, 0 AS n
FROM {table} AS t, b
WHERE
t.{pk} >= min + ((max - min) * random())::int AND
{exclude}
t.is_active
LIMIT 1
) UNION ALL (
SELECT t.{pk}, min, max, a || t.{pk}, r.n + 1 AS n
FROM {table} AS t, r
WHERE
t.{pk} >= min + ((max - min) * random())::int AND
t.{pk} <> all(a) AND
r.n + 1 < {limit} AND
{exclude}
t.is_active
LIMIT 1
)
)
SELECT {fields} FROM {table} AS t, r WHERE r.{pk} = t.{pk}


Here in curly brackets named parameters to be replaced:

the

    {table} — the name of the table;

    {pk} — the name of the PrimaryKey-fields;

    {fields} — list of fields to retrieve (you can specify "*");

    {exclude} — condition (set of conditions) to exclude records from the sample. For example "t.id NOT IN (1,2,3,4)";

    {limit} — the number of records in the final selection



And finally, last but not least the question: "is the game worth the candle"? What is the "profit" from this mashalani design? Will check.

To begin, create a table in which experiments will be run.

the
DROP TABLE IF EXISTS ttbl;
CREATE TABLE IF NOT EXISTS ttbl
(
id serial NOT NULL,
is_active BOOL NOT NULL DEFAULT true,
Ttbl_pkey CONSTRAINT PRIMARY KEY (id)
)
WITH (OIDS=FALSE);

Now fill it with data. If id values were not consistently, and had "holes". Ie not "1, 2, 3, 4, 5..." but at least "1, 4, 5, 8....". For this sketching unpretentious skriptik.
script code
import random

import psycopg2


DB_TABLE = 'ttbl'
PK_NAME = 'id'
DATABASES = {
'NAME': 'test_db',
'USER': 'user_test',
'PASSWORD': 'R#Q9Jw*ZEKWOhBX+EP|3/xGkQn3',
'HOST': 'localhost',
'PORT': ",
}

TOTAL = 10000000
current = 0
step = 10000
dev = 8
while current <= TOTAL:
data = set()
for x in range(current, current+step, 10):
data.add(x + int(random.random() * dev))

x = cur.execute(
"INSERT INTO {t_table} VALUES {t_items};".format(
t_table=DB_TABLE,
t_items='(' + '), ('.join(map(str, data)) + ')'
)
)
current += step
print(x, current)

cur.execute('COMMIT;')



As you can see from the code each query inserts hundreds of records. The values change in increments of about "10". Approximately, because each value can deviate to a random value in the range 0-dev. I.e. when two successive x values "340" and "350" in the table can be inserted any values from a range of 340-348 and 350-358 (342, 347, 340...; 351, 355, 358...).

Total in the table was
the
select count(id) from ttbl;

1001000 records

Quite a decent amount. Try to make a selection. It is clear that a single sample is not an indicator. Therefore we will make a series of consecutive runs. For definiteness, a series of 5 runs of queries of each type. The results are summarized in the table and calculate the average.

First, a simple query
from ttbl t where t.id not in (1, 3, 10, 89, 99, 22, 24, 25, 28, 30) AND t.is_active order by random() limit 5;
Results:
the the the the the the
№ p/p time, ms
1 697
2 605
3 624
4 593
5 611

As you can see, the average query execution time of about* 600ms.

And now — "the monster".
watch monster
WITH RECURSIVE r AS (
WITH b AS (
SELECT
min(t.id)
(
SELECT t.id
FROM ttbl AS t
WHERE
t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30)
AND t.is_active
ORDER BY t.id DESC
LIMIT 1
OFFSET 5 - 1
) max
FROM ttbl AS t
WHERE 
t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30)
AND t.is_active
)
(
SELECT
id, min, max, array[]::integer[] || id AS a, 0 AS n
FROM ttbl AS t, b
WHERE
id >= min + ((max - min) * random())::int AND
t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30) AND
t.is_active
LIMIT 1
) UNION ALL (
SELECT t.id, min, max, a || t.id, r.n + 1 AS n
FROM ttbl AS t, r
WHERE
t.id > min + ((max - min) * random())::int AND
t.id <> all( a ) AND
r.n + 1 < 5 AND
t.id NOT IN (1, 3, 10, 89, 99, 22, 24, 25, 28, 30) AND
t.is_active
LIMIT 1
)
)
SELECT t.* FROM ttbl AS t, r WHERE r.id = t.id


Results:
the the the the the the
№ p/p time, ms
1 15
2 17
3 8
4 12
5 12

The average time of about* 15ms.

Total difference is about one and a half order (40-50 times). It still was worth it.

The queries were tested including at the "cold" start (after restarting the machine/demon). Although the run time in absolute terms has varied, but the ratio remained constant (as possible). Ie a recursive query is always a few dozen times was faster than the decision "in a forehead".

the

notes


*About, because the exact value is not important because of the deviations caused by the cache Postgres-a, the influence of the operating system, iron, other soft, etc.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

The release of the new version of the module modLivestreet 0.3.0-rc

mSearch: search + filter for MODX Revolution

Emulator data from GNSS receiver NMEA