Projection? No thanks
Under the cut is a small tip about applying a spatial index
based on the zcurve to index point data situated on the field. And bencmark for PostgreSQL and compare with the same (but very different)
index to R-tree.
The development of the theme (1, 2, 3, 4, 5, 6).
Actually, return back to the beginning the idea to index geographic coordinates, placing them on the surface of a sphere. Conventional indexing pairs of latitude/longitude leads to distortions away from the equator, working with projections is not universal. So the idea is to translate the geographic coordinates into three-dimensional space looks very elegant.
The idea itself is not new, similar works, such as extending PostgreSQL — PGSphere, which is used for indexing 3-dimensional R-tree. With him we will compare.
the
data Preparation.
PGSphere
the
-
the
- To start, you will have to download, build and install the extension (the author used the current version 1.1.5)
thegmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config sudo gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config install
the - Then download it (psql)
theCREATE EXTENSION pg_sphere;
the - will Create a table for test data
theCREATE TABLE spoint_data (sp spoint);
the - We need source of random data.
The first program parameter is the radius, and the second number of results.
The only trick is the data is uniformly distributed inside the sphere with the given radius, otherwise it will not turn a uniform distribution on the sphere
the - Random data passed through an awk script to transform the geographic coordinates
the# --- gendata.awk ------ BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2 z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; r2 = sqrt(x*x + y*y); lat = atan2(z, r2) * rad; lon = 180. + atan2(y, x) * rad; printf ("(%14.10 fd %14.10 fd)\n", lon, lat); }
- Fill the data in our table
theCOPY spoint_data (sp) FROM '/home/.../pgsphere.txt';
the - Index
theCREATE INDEX sp_idx ON spoint_data USING gist (sp);
the Actual creation of the data, here the radius is not important, it is important that pgsphere and zcurve received the same data. Sort it is highly desirable to accelerate the indexing.
the
the ./random 1000000 100000000 | gawk -f gendata.awk | sort > pgsphere.txt
ZORDER
the
-
the
- To start, you will have to download, build and install extension
thegmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config sudo gmake USE_PGXS=1 PG_CONFIG=/usr/bin/pg_config install
the - will Create a table for test data
thecreate table test_points_3d (x integer,y integer, z integer);
the - We need the same source of random data. the
- Random data passed through an awk script to place them inside a cube with sides of 2 000 000
the#--- gendata2.awk ------ BEGIN{ pi=3.1415926535897932; degra=pi/180.0; rad=180.0/pi; Grad = 1000000.; } { x = $1; y = $2 z = $3; r3 = sqrt(x*x + y*y + z*z); x *= Grad / r3; y *= Grad / r3; z *= Grad / r3; ix = int(x+0.5+Grad); iy = int(y+0.5+Grad); iz = int(z+0.5+Grad); print ix"\t"iy"\t"; }
the Actual creation of the data, here the radius is important. Sorting is not required.
the
./random 1000000 100000000 | gawk -f gendata2.awk > zcurve.txt
the the
COPY test_points_3d FROM '/home/.../zcurve.txt';
the
create index zcurve_test_points_3d on test_points_3d(zcurve_num_from_xyz(x,y,z));
the
test Preparation
PGSphere
For testing you will need here is a awk script
the
#--- gentest.awk -------
BEGIN{
pi=3.1415926535897932;
degra=pi/180.0;
rad=180.0/pi;
Grad = 1000000.;
}
{
x = $1; y = $2 z = $3;
r3 = sqrt(x*x + y*y + z*z);
x *= Grad / r3;
y *= Grad / r3;
z *= Grad / r3;
r2 = sqrt(x*x + y*y);
lat = atan2(z, r2) * rad;
lon = 180. + atan2(y, x) * rad;
# EXPLAIN (ANALYZE,BUFFERS)
printf ("select count(1) from spoint_data where sp @' < (%14.10 fd%14.10 fd).316d>'::laukitis;\n", lon, lat);
}
This script is quite symmetric to that by which we were preparing the data. You should pay attention to the number of .316 is the radius of the sphere with center at the computed random point in which we are looking for data
The preparation of a series of queries is done as follows:
the
./random 1000000 100 1023 | gawk -f gentest.awk >tests1.sql
Here 100 — test series 1023 — seed the randomizer.
ZCURVE
For testing also need awk script
the
#--- gentest2.awk -------
BEGIN{
pi=3.1415926535897932;
degra=pi/180.0;
rad=180.0/pi;
Grad = 1000000.;
}
{
x = $1; y = $2 z = $3;
r3 = sqrt(x*x + y*y + z*z);
x *= Grad / r3;
y *= Grad / r3;
z *= Grad / r3;
ix = int(x+0.5+Grad);
iy = int(y+0.5+Grad);
iz = int(z+0.5+Grad);
# EXPLAIN (ANALYZE,BUFFERS)
lrad = int(0.5 + Grad * sin(.316 * degra));
print "select count(1) from zcurve_3d_lookup_tidonly('zcurve_test_points_3d', "ix-lrad","iy-lrad","iz-lrad","ix+lrad","iy+lrad","iz+lrad");";
}
This script is quite symmetric to that by which we were preparing the data. Again, pay attention to the number of .316, which is half the side of the cube with center at the computed random point in which we are looking for data.
The preparation of a series of queries is done as follows:
the
./random 1000000 100 1023 | gawk -f gentest2.awk >tests2.sql
Here 100 — test series 1023 — seed the randomizer, it's better if it coincides with the add-on from pgsphere.
the
Benchmark
As before, measurements were conducted on a modest virtual machine with two cores and 4 GB of RAM, so the times are not absolute values, but the numbers of pages read is still possible to trust.
The times shown in the second runs, on a heated server and virtual machine. The number of read buffers — into the fresh-raised server.
Radius | AVG NPoints | Nreq | Type | Time(ms) | Reads | Hits |
---|---|---|---|---|---|---|
.01° | 1.17 0.7631 (0.7615) |
10 000 | zcurve rtree |
.37 .46 |
1.4397 2.1165 |
9.5647 3.087 |
.0316° | 11.6 7.6392 (7.6045) |
10 000 | zcurve rtree |
.39 .67 |
2.0466 3.0944 |
20.9707 2.7769 |
.1° | 115.22 76.193 (76.15) |
1 000 | zcurve rtree |
.44 2.75 * |
4.4184 6.073 |
82.8572 2.469 |
.316° | 1145.3 758.37 (760.45) |
1 000 | zcurve rtree |
.59 18.3 * |
15.2719 21.706 |
401.791 1.62 |
1.° | 11310 7602 (7615) |
100 | zcurve rtree |
7.2 94.5 * |
74.9544 132.15 |
1651.45 1.12 |
Radius — the size of the search region in degrees
Npoints — the average number of points in the results, in parentheses are the theoretically expected number of
(in the field 41252.96 square degrees, 100 000 000 points ~2424 dots per square degree)
Nreq — the number of queries in the series
Type
‘zcurve’ — it is
’rtree’- PGSphere
Time(ms) — average time of query execution
Reads — the average number of readings by request
Hits — the number of accesses to the buffers
* at some point, the performance of R-tree begins abruptly
sink is connected with the fact, this tree read significantly more
pages and its working set ceases to fit in the cache (apparently).
Note that the zcurve finds more data, which is logical because he is looking inside the cube, not sphere PGSphere. Therefore, the required post-filtering in the spirit of HAVERSINE. But here we compare only the performance of the indexes.
Try to evaluate the difference. In General, the task is to find the area of the piece of the sphere that fall inside this cube is non-trivial. Try to make an assessment.
the
-
the
- Assume we have a cube in average cuts out from the sphere of the same area as the sphere of equal volume the
- the Volume of the unit sphere is 1.33*3.14=4.19 the
- the Volume of a cube of side 2 = 8. the
- Then the root of the third degree of 8/4.19 = 1.24 is the ratio of the radii of the imaginary sphere to the present the
- area ratio of the imaginary sphere to the present 1.24*1.24=1.54 the
- are from the experimental data 1.17/0.7631= 1.5332
Bingo!
Комментарии
Отправить комментарий