Suppose we have an application that stores two big lists of numbers in a relational database. These numbers represent money – Both the amount of money units in transactions, and the price of that money.
In response to a client query, we must multiply the big list of transaction values by the big list of prices, and sum a running total of those multiples. We assume we can’t pre-calculate and store the running values, and that we must supply the client with strings, perhaps because they are communicating with us via a web API. Suppose clients expect responses in real time.
Storing money requires absolute precision. We can’t use floats, for obvious reasons. So we would probably use an arbitrary precision type. If we were using PostgreSQL 9.6, as this article assumes, then we might choose to use a
numerictype to store both values and prices.
Numerics give us arbitrary precision, but come with some baggage. Numerics are binary coded decimals, and CPUs can’t interact with them directly using hardware instructions. Whereas two 64-bit integers will fit in a CPU register and be happily summed, two numerics need to go through a bunch of software shuffling to perform a calculation.
Therefore, if we need to perform lots of calculations quickly, numerics might slow us down when compared to using a 64-bit integer. In PostgreSQL, a 64-bit integer is called a
Hypothesis: Performing all calculations on integers of type
bigint, rather than binary coded decimals of type
numeric, will vastly reduce computation times when responding to our client’s queries. Let’s assume the following:
- A price is stored as a
bigintwith an implicit precision of four decimal places. For example, a price of £4.4352 is stored as
44352(NB the absence of the decimal point).
- An amount is stored as a
bigintwith an implicit precision of two decimal places. This suits currencies with two decimal places of minor unit precision, such as the US Dollar (USD). For example, an amount of $12,045.42 is stored as
The value of any transaction is therefore the price of the money (
44352) times the amount of money units (
1204542) divided by the extra orders of magnitude introduced by the implicit decimal places.
In our example, the price has introduced four decimal places, and the amount has introduced two decimal places, for a total of six. Therefore, the calculation is as follows:
44352 * 1204542 / 10^6
53423.846784, which may be nicely formatted for client consumption as $53,423.85.
So, to get to our true value with explicit decimal places, we need to perform division. In PostgreSQL we can’t divide two values of type
bigint or we will lose precision. For example,
select (10::bigint / 3::bigint) yields
3. Because we are dealing with money, that is not acceptable.
Numerics, as arbitrary precision types, can retain precision on division. We will cast to a
numeric, and divide by another
numeric of appropriate size for our implied decimal places.
Here’s the example above written in Postgres 9.6 SQL:
SELECT to_char(value::numeric(15,4) / 1000000::numeric(15,4), 'FM999,999,999,999.99') AS str_representation FROM ( SELECT amount * price AS interim_value FROM ( SELECT 1204542::bigint AS amount, 44352::bigint AS price ) AS random_data ) AS calculation
Is this really faster than using binary coded decimals? To answer that question, we’ll create a big set of random data that we can operate on.
We will assume a largish organisation that processes 250 transactions per day every day for 10 years, or 912,500 transactions, and round it up to 1,000,000 for luck. We will assume these transaction amounts are in USD, with two decimal places of implied precision, and are sized somewhere between $0 and $200. Each amount is processed at a price to convert the USD to some Pounds Sterling (GBP). Let’s assume (somewhat absurdly) that each GBP price is somewhere between £0 and £10 with four decimal places of implied precision.
SELECT generate_series (1, 1000000) as row_id, (random() * 20000)::bigint AS amount, (random() * 100000)::bigint AS price
My machine features an i7-5557U CPU, and it takes ~670ms to generate these data, which look like this:
row_id | amount | price ---------+--------+-------- 1 | 14179 | 48998 2 | 16948 | 56369 3 | 12760 | 16965 4 | 17177 | 977 5 | 11632 | 38872 6 | 18370 | 44416 7 | 14370 | 89625
Now, let’s perform the necessary calculations. For each of the million rows, we need to multiply the amount by the price, and then sum those multiples into a running balance column.
SELECT value, SUM(value) OVER (ORDER BY row_id) AS balance FROM ( SELECT row_id, amount * price AS value FROM ( SELECT generate_series (1,1000000) as row_id, (random()*20000)::bigint AS amount, (random()*10000)::bigint AS price ) AS random_data ) AS raw_values
This takes ~1,550ms on my i7-5557U. Less the ~670ms to generate the data yields a calculation time of ~880ms. Computers are so cool. That’s just 880 nanoseconds per row! Here’s some sample output:
value | balance -----------+---------------- 15987240 | 15987240 5282332 | 21269572 32625180 | 53894752 125860125 | 179754877 3753301 | 183508178 79953750 | 263461928 74762297 | 338224225
These huge integers, with their implicit decimal places, are not very useful. They are certainly not human readable. We need to convert them into decimal numbers with explicit decimal places. To do so, we will cast them to type
numeric. Finally, recall that our clients require string output, so we will convert the
numericinto a nicely formatted
SELECT to_char(value::numeric / 1000000::numeric, 'FM999,999,999,990.90') as str_value, to_char(balance::numeric / 1000000::numeric, 'FM999,999,999,990.90') as str_balance FROM ( SELECT value, SUM(value) OVER (ORDER BY row_id) AS balance FROM ( SELECT row_id, amount * price AS value FROM ( SELECT generate_series (1, 1000000) as row_id, (random() * 20000)::bigint AS amount, (random() * 100000)::bigint AS price ) AS random_data ) AS raw_values ) AS balances
Bam! Approximately 2,700ms, Less generation (~670ms) and calculation (~880ms), that’s ~1,150ms. All that
bigint::numerictype casting and
numeric / numericdivision adds a good chunk of change. Here is some sample output:
string_value | string_balance --------------+---------------- 1,029.03 | 1,029.03 85.97 | 1,115.00 291.78 | 1,406.78 670.64 | 2,077.42 62.33 | 2,139.75 142.92 | 2,282.68 2.41 | 2,285.09 31.01 | 2,316.10
Alright, let’s try the same thing using
numeric the whole way through. Then we will compare the difference. First, we’ll generate the random data. We’ll use a
numeric with 19 digits of precision, to match the 19 positive number digits we get from a PostgreSQL
SELECT generate_series (1, 1000000) as row_id, (random() * 200)::numeric(17,2) AS amount, (random() * 1)::numeric(15,4) AS price
Generating the random data takes ~2,200ms, and looks as follows:
row_id | amount | price ---------+--------+-------- 1 | 84.33 | 0.8137 2 | 178.67 | 0.0793 3 | 115.89 | 0.1935 4 | 5.61 | 0.9763 5 | 32.45 | 0.8218 6 | 154.49 | 0.4451 7 | 127.43 | 0.3493
Now the basic balance calculation, wherein we multiply amount by price, and then sum a running total of those multiples:
SELECT value, SUM(value) OVER (ORDER BY row_id) AS balance FROM ( SELECT row_id, amount * price AS value FROM ( SELECT generate_series (1, 1000000) as row_id, (random() * 200)::numeric(17,2) AS amount, (random() * 1)::numeric(15,4) AS price ) AS random_data ) AS raw_values
value | balance ------------+----------------- 32.893335 | 32.893335 9.424890 | 42.318225 46.160686 | 88.478911 134.282533 | 222.761444 19.844650 | 242.606094 8.255778 | 250.861872 2.859764 | 253.721636
Finally, the complete query. This time, we already have a list of values of type
numeric, so we don’t need to peform the initial typecasting step we did in the
bigintversion. Instead, we just use
to_char()to create pretty strings.
SELECT to_char(value, 'FM999,999,999,990.90') as str_value, to_char(balance, 'FM999,999,999,990.90') as str_balance FROM ( SELECT value, SUM(value) OVER (ORDER BY row_id) AS balance FROM ( SELECT row_id, amount * price AS value FROM ( SELECT generate_series (1, 1000000) as row_id, (random() * 200)::numeric(17,2) AS amount, (random() * 1)::numeric(15,4) AS price ) AS random_data ) AS raw_values ) AS balances
The full query, in numeric form, takes ~4,200ms. Less generation (2,200ms) and calculation (~1,350ms) time, the string processing takes ~650ms.
As expected, the natively
numeric query is much faster at generating strings: ~650ms vs ~1,150ms, or 56% of the native
bigint query time.
However, the bigint query wins in the calculation stakes: ~880ms vs ~1,350, or 65% of the
numeric time. That means that excluding the random data generation time, the two queries run pretty much equal: 2,030ms for the
bigint, and 2,000ms for the
Of course, we must consider the elephant in the room: Maximum precision. Usingthe
bigint technique, we lose a huge chunk of precision every time we perform a calculation. With an amount featuring two implied decimals, and a price featuring four, we lose six digits.
bigint can store a maximum positive value of 9,223,372,036,854,775,807,or several quintillion units. Less the digits we used for implied decimals, that falls to two trillion units.
That puts the
bigint calculation in a bind. Say we have a query that requires a further calculation step: Perhaps price is calculated as the multiple of two numbers with four decimal digits, e.g. 2.4561 * 1.3258. That is a common case for currencies, and would caused our maximum balance value to fall to around eight hundred million.
Based on the above analysis, we might reasonably assume that such a calculation heavy query denominated in
bigint might pull away from a
numeric version. However, the loss of maximum balance value might preclude its use for clients with big balance numbers.
As is the way of these things, I’ve now got more questions than I did when I started. What I want to know now is:
- Does changing the
numericprecision definition, e.g.
numeric(15,4)affect PostgreSQL calculation performance?
- We ignored retrieval speed in this analysis. When the data is coming off the disk, do
bigintcause materially different read times?
- Does adding an extra calculation step cause the
bigintmethod to pull away from
numericbe substituted depending on client needs, allowing small clients to benefit from speed while big ones benefit from arbitrary precision?
Questions for another day.