Parallelize long-running operations

I often deal with tasks that require very large database performance while processing large amounts of data. Today I will talk about very simple but effective technique that can help you out if base has not kept pace with the amount of data that accumulates and needs to be treated. The method does not depend on the database, but the habit of publishing blog PostgreSQL, and a sample will be on it. Let's cut to the example.

Spherical cow


Suppose we are talking about simple billing (it is clear that the method is applicable not only for billing, but it will look quite clearly). A table that accumulates data about calls of subscribers will be in our case to have this format:
CREATE TABLE billing.calls
(
call_id BIGINT,
call_time TIMESTAMP
subscriber_id INTEGER
duration INTERVAL
);


* This source code was highlighted with Source Code Highlighter.

Here so all simply. Data about calls form this sign with insane speed, and must be within a reasonable time charged.

For billing we have a database function with this signature:
FUNCTION calculate(IN subscriber_id INTEGER IN duration INTERVAL OUT status_code text) RETURNS void

* This source code was highlighted with Source Code Highlighter.

Billing we spend running every 5 minutes here is the query:
SELECT calculate(subscriber_id, duration) FROM billing.calls;

* This source code was highlighted with Source Code Highlighter.

And at one point, we understand that this request is simply not time to run for 5 minutes. During this time accumulates even more data, and then another, and here we sit and wait for night, when the flow is slightly weakened, and finally turn Rosgranitsa. That's the term. I must say that sitting and waiting, we are not alone. With us sat 3 (for example) the remaining cores of our server while one pored over the request. PostgreSQL, unfortunately, is not able to parallelize the queries myself, but in our case this is not necessary. Much better results will give a very simple and obvious ploy. Create an index on the function "modulo 4 subscriber_id":
CREATE INDEX billing.calls_subscriber_id_mod_idx ON billing.calls USING btree ((subscriber_id % 4));

* This source code was highlighted with Source Code Highlighter.

Now run in four streams (for example, four different job):
SELECT calculate(subscriber_id, duration) FROM billing.calls WHERE subscriber_id % 4 = @mod;

* This source code was highlighted with Source Code Highlighter.

where mod equal to 0,1,2 or 3 (for each thread).

result


This technique solves the problem of blocking that can occur if two different threads gets a phone call from a caller. Also, in parallel, these jobs will work faster than if we relied on the parallelization of the database itself (if we don't postgre, and Oracle, for example).

The method is applicable to any database that supports index functions (Oracle, Postgresql). In the case of MSSQL, you can create a calculated column and index on it. MySQL support functional indexes, not, but, as a workaround, you can create a new column with index on it and update it with the trigger.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

The release of the new version of the module modLivestreet 0.3.0-rc

mSearch: search + filter for MODX Revolution

Emulator data from GNSS receiver NMEA