Projection? No thanks

Under the cut is a small tip about applying a spatial index
based on the zcurve to index point data situated on the field. And bencmark for PostgreSQL and compare with the same (but very different)
index to R-tree.

The development of the theme (1, 2, 3, 4, 5, 6).
Actually, return back to the beginning the idea to index geographic coordinates, placing them on the surface of a sphere. Conventional indexing pairs of latitude/longitude leads to distortions away from the equator, working with projections is not universal. So the idea is to translate the geographic coordinates into three-dimensional space looks very elegant.

The idea itself is not new, similar works, such as extending PostgreSQL — PGSphere, which is used for indexing 3-dimensional R-tree. With him we will compare.

the

data Preparation.

PGSphere

the

To start, you will have to download, build and install the extension (the author used the current version 1.1.5)

the
```
gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config
sudo gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config install
```
Then download it (psql)

the
```
CREATE EXTENSION pg_sphere;
```
will Create a table for test data

the
```
CREATE TABLE spoint_data (sp spoint);
```
We need source of random data.
The first program parameter is the radius, and the second number of results.
The only trick is the data is uniformly distributed inside the sphere with the given radius, otherwise it will not turn a uniform distribution on the sphere

Random data passed through an awk script to transform the geographic coordinates

the

# --- gendata.awk ------
BEGIN{
pi=3.1415926535897932;
degra=pi/180.0;
rad=180.0/pi;
Grad = 1000000.;
}
{
x = $1; y = $2 z = $3;
r3 = sqrt(x*x + y*y + z*z);
x *= Grad / r3;
y *= Grad / r3;
z *= Grad / r3;

r2 = sqrt(x*x + y*y);
lat = atan2(z, r2) * rad;
lon = 180. + atan2(y, x) * rad;

printf ("(%14.10 fd %14.10 fd)\n", lon, lat);
}

the Actual creation of the data, here the radius is not important, it is important that pgsphere and zcurve received the same data. Sort it is highly desirable to accelerate the indexing.

the

./random 1000000 100000000 | gawk -f gendata.awk | sort > pgsphere.txt

Fill the data in our table

the

COPY spoint_data (sp) FROM '/home/.../pgsphere.txt';

Index

the

CREATE INDEX sp_idx ON spoint_data USING gist (sp);

ZORDER

the

To start, you will have to download, build and install extension

the

gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config
sudo gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config install

will Create a table for test data

the

create table test_points_3d (x integer,y integer, z integer);

We need the same source of random data.

Random data passed through an awk script to place them inside a cube with sides of 2 000 000

the

#--- gendata2.awk ------
BEGIN{
pi=3.1415926535897932;
degra=pi/180.0;
rad=180.0/pi;
Grad = 1000000.;
}
{
x = $1; y = $2 z = $3;
r3 = sqrt(x*x + y*y + z*z);
x *= Grad / r3;
y *= Grad / r3;
z *= Grad / r3;

ix = int(x+0.5+Grad);
iy = int(y+0.5+Grad);
iz = int(z+0.5+Grad);
print ix"\t"iy"\t";
}

the Actual creation of the data, here the radius is important. Sorting is not required.

the

./random 1000000 100000000 | gawk -f gendata2.awk > zcurve.txt

Fill the data in our table

the

COPY test_points_3d FROM '/home/.../zcurve.txt';

Index

the

create index zcurve_test_points_3d on test_points_3d(zcurve_num_from_xyz(x,y,z));

the

test Preparation

PGSphere

For testing you will need here is a awk script

the

#--- gentest.awk -------
BEGIN{
pi=3.1415926535897932;
degra=pi/180.0;
rad=180.0/pi;
Grad = 1000000.;
}
{
x = $1; y = $2 z = $3;
r3 = sqrt(x*x + y*y + z*z);
x *= Grad / r3;
y *= Grad / r3;
z *= Grad / r3;

r2 = sqrt(x*x + y*y);

lat = atan2(z, r2) * rad;
lon = 180. + atan2(y, x) * rad;

# EXPLAIN (ANALYZE,BUFFERS) 
printf ("select count(1) from spoint_data where sp @' < (%14.10 fd%14.10 fd).316d>'::laukitis;\n", lon, lat);
}

This script is quite symmetric to that by which we were preparing the data. You should pay attention to the number of .316 is the radius of the sphere with center at the computed random point in which we are looking for data

The preparation of a series of queries is done as follows:

the

./random 1000000 100 1023 | gawk -f gentest.awk >tests1.sql

Here 100 — test series 1023 — seed the randomizer.

ZCURVE

For testing also need awk script

the

#--- gentest2.awk -------
BEGIN{
pi=3.1415926535897932;
degra=pi/180.0;
rad=180.0/pi;
Grad = 1000000.;
}
{
x = $1; y = $2 z = $3;
r3 = sqrt(x*x + y*y + z*z);
x *= Grad / r3;
y *= Grad / r3;
z *= Grad / r3;

ix = int(x+0.5+Grad);
iy = int(y+0.5+Grad);
iz = int(z+0.5+Grad);
# EXPLAIN (ANALYZE,BUFFERS) 
lrad = int(0.5 + Grad * sin(.316 * degra));
print "select count(1) from zcurve_3d_lookup_tidonly('zcurve_test_points_3d', "ix-lrad","iy-lrad","iz-lrad","ix+lrad","iy+lrad","iz+lrad");";
}

This script is quite symmetric to that by which we were preparing the data. Again, pay attention to the number of .316, which is half the side of the cube with center at the computed random point in which we are looking for data.

The preparation of a series of queries is done as follows:

the

./random 1000000 100 1023 | gawk -f gentest2.awk >tests2.sql

Here 100 — test series 1023 — seed the randomizer, it's better if it coincides with the add-on from pgsphere.

the

Benchmark

As before, measurements were conducted on a modest virtual machine with two cores and 4 GB of RAM, so the times are not absolute values, but the numbers of pages read is still possible to trust.

The times shown in the second runs, on a heated server and virtual machine. The number of read buffers — into the fresh-raised server.
the the the the the the

Radius	AVG NPoints	Nreq	Type	Time(ms)	Reads	Hits
.01°	1.17 0.7631 (0.7615)	10 000	zcurve rtree	.37 .46	1.4397 2.1165	9.5647 3.087
.0316°	11.6 7.6392 (7.6045)	10 000	zcurve rtree	.39 .67	2.0466 3.0944	20.9707 2.7769
.1°	115.22 76.193 (76.15)	1 000	zcurve rtree	.44 2.75 *	4.4184 6.073	82.8572 2.469
.316°	1145.3 758.37 (760.45)	1 000	zcurve rtree	.59 18.3 *	15.2719 21.706	401.791 1.62
1.°	11310 7602 (7615)	100	zcurve rtree	7.2 94.5 *	74.9544 132.15	1651.45 1.12

where
Radius — the size of the search region in degrees
Npoints — the average number of points in the results, in parentheses are the theoretically expected number of
(in the field 41252.96 square degrees, 100 000 000 points ~2424 dots per square degree)
Nreq — the number of queries in the series
Type
‘zcurve’ — it is
’rtree’- PGSphere
Time(ms) — average time of query execution
Reads — the average number of readings by request
Hits — the number of accesses to the buffers

* at some point, the performance of R-tree begins abruptly
sink is connected with the fact, this tree read significantly more
pages and its working set ceases to fit in the cache (apparently).

Note that the zcurve finds more data, which is logical because he is looking inside the cube, not sphere PGSphere. Therefore, the required post-filtering in the spirit of HAVERSINE. But here we compare only the performance of the indexes.

Try to evaluate the difference. In General, the task is to find the area of the piece of the sphere that fall inside this cube is non-trivial. Try to make an assessment.

the

Assume we have a cube in average cuts out from the sphere of the same area as the sphere of equal volume
the Volume of the unit sphere is 1.33*3.14=4.19
the Volume of a cube of side 2 = 8.
Then the root of the third degree of 8/4.19 = 1.24 is the ratio of the radii of the imaginary sphere to the present
area ratio of the imaginary sphere to the present 1.24*1.24=1.54
are from the experimental data 1.17/0.7631= 1.5332
Bingo!

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express