Bigint vs Numeric Performance in PostgreSQL 9.6

Suppose we have an application that stores two big lists of numbers in a relational database. These numbers represent money – Both the amount of money units in transactions, and the price of that money.

In response to a client query, we must multiply the big list of transaction values by the big list of prices, and sum a running total of those multiples. We assume we can’t pre-calculate and store the running values, and that we must supply the client with strings, perhaps because they are communicating with us via a web API. Suppose clients expect responses in real time.

Storing money requires absolute precision. We can’t use floats, for obvious reasons. So we would probably use an arbitrary precision type. If we were using PostgreSQL 9.6, as this article assumes, then we might choose to use a numerictype to store both values and prices.

Numerics give us arbitrary precision, but come with some baggage. Numerics are binary coded decimals, and CPUs can’t interact with them directly using hardware instructions. Whereas two 64-bit integers will fit in a CPU register and be happily summed, two numerics need to go through a bunch of software shuffling to perform a calculation.

Therefore, if we need to perform lots of calculations quickly, numerics might slow us down when compared to using a 64-bit integer. In PostgreSQL, a 64-bit integer is called a bigint.

Hypothesis: Performing all calculations on integers of type bigint, rather than binary coded decimals of type numeric, will vastly reduce computation times when responding to our client’s queries. Let’s assume the following:

  • A price is stored as a bigintwith an implicit precision of four decimal places. For example, a price of £4.4352 is stored as 44352 (NB the absence of the decimal point).
  • An amount is stored as abigint with an implicit precision of two decimal places. This suits currencies with two decimal places of minor unit precision, such as the US Dollar (USD). For example, an amount of $12,045.42 is stored as 1204542.

The value of any transaction is therefore the price of the money (44352) times the amount of money units (1204542) divided by the extra orders of magnitude introduced by the implicit decimal places.

In our example, the price has introduced four decimal places, and the amount has introduced two decimal places, for a total of six. Therefore, the calculation is as follows:

44352 * 1204542 / 10^6

This yields 53423.846784, which may be nicely formatted for client consumption as $53,423.85.

So, to get to our true value with explicit decimal places, we need to perform division. In PostgreSQL we can’t divide two values of type bigint or we will lose precision. For example,  select (10::bigint / 3::bigint) yields 3. Because we are dealing with money, that is not acceptable.

Numerics, as arbitrary precision types, can retain precision on division. We will cast to anumeric, and divide by anothernumeric of appropriate size for our implied decimal places.

Here’s the example above written in Postgres 9.6 SQL:

SELECT 
    to_char(value::numeric(15,4) / 1000000::numeric(15,4), 'FM999,999,999,999.99') AS str_representation
FROM (
    SELECT
        amount * price AS interim_value
    FROM (
        SELECT
            1204542::bigint AS amount,
            44352::bigint AS price
    ) AS random_data
) AS calculation

Is this really faster than using binary coded decimals? To answer that question, we’ll create a big set of random data that we can operate on.

We will assume a largish organisation that processes 250 transactions per day every day for 10 years, or 912,500 transactions, and round it up to 1,000,000 for luck. We will assume these transaction amounts are in USD, with two decimal places of implied precision, and are sized somewhere between $0 and $200. Each amount is processed at a price to convert the USD to some Pounds Sterling (GBP). Let’s assume (somewhat absurdly) that each GBP price is somewhere between £0 and £10 with four decimal places of implied precision.

SELECT
    generate_series (1, 1000000) as row_id,
    (random() * 20000)::bigint AS amount,
    (random() * 100000)::bigint AS price

My machine features an i7-5557U CPU, and it takes ~670ms to generate these data, which look like this:

row_id   | amount | price  
---------+--------+--------
       1 |  14179 |  48998
       2 |  16948 |  56369
       3 |  12760 |  16965
       4 |  17177 |    977
       5 |  11632 |  38872
       6 |  18370 |  44416
       7 |  14370 |  89625

Now, let’s perform the necessary calculations. For each of the million rows, we need to multiply the amount by the price, and then sum those multiples into a running balance column.

SELECT 
    value, 
    SUM(value) OVER (ORDER BY row_id) AS balance
FROM (
    SELECT 
        row_id, amount * price AS value
    FROM (
        SELECT 
            generate_series (1,1000000) as row_id,
            (random()*20000)::bigint AS amount, 
            (random()*10000)::bigint AS price
    ) AS random_data
) AS raw_values

This takes ~1,550ms on my i7-5557U. Less the ~670ms to generate the data yields a calculation time of ~880ms. Computers are so cool. That’s just 880 nanoseconds per row! Here’s some sample output:

   value   |    balance     
-----------+----------------
  15987240 |       15987240
   5282332 |       21269572
  32625180 |       53894752
 125860125 |      179754877
   3753301 |      183508178
  79953750 |      263461928
  74762297 |      338224225

These huge integers, with their implicit decimal places, are not very useful. They are certainly not human readable. We need to convert them into decimal numbers with explicit decimal places. To do so, we will cast them to type numeric. Finally, recall that our clients require string output, so we will convert the numericinto a nicely formatted stringusing the to_char()function.

SELECT
    to_char(value::numeric / 1000000::numeric, 'FM999,999,999,990.90') as str_value,
    to_char(balance::numeric / 1000000::numeric, 'FM999,999,999,990.90') as str_balance
FROM (
    SELECT 
        value, 
        SUM(value) OVER (ORDER BY row_id) AS balance
    FROM (
        SELECT 
            row_id, amount * price AS value
        FROM (
            SELECT
                generate_series (1, 1000000) as row_id,
                (random() * 20000)::bigint AS amount, 
                (random() * 100000)::bigint AS price
        ) AS random_data
    ) AS raw_values
) AS balances

Bam! Approximately 2,700ms, Less generation (~670ms) and calculation (~880ms), that’s ~1,150ms. All that bigint::numerictype casting andnumeric / numericdivision adds a good chunk of change. Here is some sample output:

 string_value | string_balance 
--------------+----------------
 1,029.03     | 1,029.03
 85.97        | 1,115.00
 291.78       | 1,406.78
 670.64       | 2,077.42
 62.33        | 2,139.75
 142.92       | 2,282.68
 2.41         | 2,285.09
 31.01        | 2,316.10

Alright, let’s try the same thing using numeric the whole way through. Then we will compare the difference. First, we’ll generate the random data. We’ll use a numeric with 19 digits of precision, to match the 19 positive number digits we get from a PostgreSQL bigint.

SELECT
    generate_series (1, 1000000) as row_id,
    (random() * 200)::numeric(17,2) AS amount, 
    (random() * 1)::numeric(15,4) AS price

Generating the random data takes ~2,200ms, and looks as follows:

 row_id  | amount | price  
---------+--------+--------
       1 |  84.33 | 0.8137
       2 | 178.67 | 0.0793
       3 | 115.89 | 0.1935
       4 |   5.61 | 0.9763
       5 |  32.45 | 0.8218
       6 | 154.49 | 0.4451
       7 | 127.43 | 0.3493

Now the basic balance calculation, wherein we multiply amount by price, and then sum a running total of those multiples:

SELECT 
    value, 
    SUM(value) OVER (ORDER BY row_id) AS balance
FROM (
    SELECT 
        row_id, amount * price AS value
    FROM (
        SELECT
            generate_series (1, 1000000) as row_id,
            (random() * 200)::numeric(17,2) AS amount, 
            (random() * 1)::numeric(15,4) AS price
    ) AS random_data
) AS raw_values
That takes ~3,550ms. Less generation time (2,200ms) yields a calculation time of ~1,350ms. The output looks like this:
   value    |     balance     
------------+-----------------
  32.893335 |       32.893335
   9.424890 |       42.318225
  46.160686 |       88.478911
 134.282533 |      222.761444
  19.844650 |      242.606094
   8.255778 |      250.861872
   2.859764 |      253.721636

Finally, the complete query. This time, we already have a list of values of type numeric, so we don’t need to peform the initial typecasting step we did in the bigintversion. Instead, we just use to_char()to create pretty strings.

SELECT
    to_char(value, 'FM999,999,999,990.90') as str_value,
    to_char(balance, 'FM999,999,999,990.90') as str_balance
FROM (
    SELECT 
        value, 
        SUM(value) OVER (ORDER BY row_id) AS balance
    FROM (
        SELECT 
            row_id, amount * price AS value
        FROM (
            SELECT
                generate_series (1, 1000000) as row_id,
                (random() * 200)::numeric(17,2) AS amount, 
                (random() * 1)::numeric(15,4) AS price
        ) AS random_data
    ) AS raw_values
) AS balances

The full query, in numeric form, takes ~4,200ms. Less generation (2,200ms) and calculation (~1,350ms) time, the string processing takes ~650ms.

As expected, the nativelynumeric query is much faster at generating strings: ~650ms vs ~1,150ms, or 56% of the nativebigint query time.

However, the bigint query wins in the calculation stakes: ~880ms vs ~1,350, or 65% of thenumeric time. That means that excluding the random data generation time, the two queries run pretty much equal: 2,030ms for thebigint, and 2,000ms for thenumeric.

Of course, we must consider the elephant in the room: Maximum precision. Usingthebigint technique, we lose a huge chunk of precision every time we perform a calculation. With an amount featuring two implied decimals, and a price featuring four, we lose six digits.

A PostgreSQLbigint can store a maximum positive value of 9,223,372,036,854,775,807,or several quintillion units. Less the digits we used for implied decimals, that falls to two trillion units.

That puts thebigint calculation in a bind. Say we have a query that requires a further calculation step: Perhaps price is calculated as the multiple of two numbers with four decimal digits, e.g. 2.4561 * 1.3258. That is a common case for currencies, and would caused our maximum balance value to fall to around eight hundred million.

Based on the above analysis, we might reasonably assume that such a calculation heavy query denominated inbigint might pull away from anumeric version. However, the loss of maximum balance value might preclude its use for clients with big balance numbers.

As is the way of these things, I’ve now got more questions than I did when I started. What I want to know now is:

  • Does changing thenumeric precision definition, e.g. numeric(9,4)instead of numeric(15,4)affect PostgreSQL calculation performance?
  • We ignored retrieval speed in this analysis. When the data is coming off the disk, donumericandbigint cause materially different read times?
  • Does adding an extra calculation step cause thebigint method to pull away fromnumeric?
  • Couldbigint andnumeric be substituted depending on client needs, allowing small clients to benefit from speed while big ones benefit from arbitrary precision?

Questions for another day.

Chat to me about this article anytime on Twitter.