Understanding PostgreSQL internals, part 1: Planner

Introduction

This is a small translation of my homework for an MIPT course, where I describe how PostgreSQL internals work with references to the source code.

All source code references are based on the REL_17_5 PostgreSQL git tag. For building PostgreSQL, you can refer to the dockerbuild directory.

During this series, I assume that you have following tables:

CREATE TABLE t1(a int, b1 int);
CREATE TABLE t2(a int, b2 int);
CREATE TABLE t3(a int, b3 int);

INSERT INTO t1 SELECT i, (i * 5) % 7 from generate_series(1, 10) i;
INSERT INTO t2 SELECT i * 2, (i * 5) % 7 from generate_series(1, 10) i;
INSERT INTO t3 SELECT i * 3, (i * 5) % 7 from generate_series(1, 10) i;

VACUUM (ANALYZE, FULL);

Why we need planner at all?

Well, SQL is a declarative (not imperative) language: it specifies what we want from the database but not how to get it. So because we can execute only imperative statements, we need to transform these wishful queries into executable plans — and like any good developer, we want this transformation to be optimal.

And that’s where the planner kicks in:

It can apply some general optimizations, such as inlining subqueries into main plan (like a compiler can inline function)
It can choose use-case optimized algorithms, for example specific scan operator (index scan / seq scan) or join operator (hash / merge)
And … drumroll … One of the biggest problems: The Join Order. Theoretically, (A ∪ B) ∪ C and A ∪ (B ∪ C) are equal, but in reality, this order has a huge impact on performance, as explained later.

Transformation in general

Let’s discuss what the planner actually tries to build. As hinted in the the post’s description, PostgreSQL is a compiler that tries to compile your query (and later execute it). So just as with a compiler, the intermediate result of your query is an AST, which you can view using the EXPLAIN command:

postgres=# explain select * from t1 order by a;
                       QUERY PLAN
---------------------------------------------------------
 Sort  (cost=1.27..1.29 rows=10 width=8)
   Sort Key: a
   ->  Seq Scan on t1  (cost=0.00..1.10 rows=10 width=8)
(3 rows)

As you can see, QUERY PLAN is just an AST with 2 nodes: Sort and underlying Seq Scan. So, we can say that planner’s job is to build AST, which is actual kinda trivial task until you tries to optimize it.

Optimizations

Speaking in general, SQL is kinda limited and can’t naturally express all capabilities that underlying databases (such as PostgreSQL), have. And that’s actually cool, because it lets you write database-independent queries — at least in theory.

But as a database developer, you don’t want to settle for a generic plan, do you? You want to squeeze out every possible micro-optimization and leverage all capabilities of your execution engine. And postgres’ devs thought the same, so there are many optimizations during the planning stage.

You can track these optimizations in the planner’s entry function, subquery_planner (code):

As mentioned earlier, PostgreSQL tries to inline subqueries into the main plan when possible, as you can see in the source code
As a safeguard against PEBCAK it also attempts to replace outer joins with inner joins (which can be reordered later) (source).

In general, to understand a specific optimization, where and why it applies, you can usually read the comment above its function. For example, later in the call stack in grouping_planner there is well described by it’s comment remove_useless_groupby_columns optimization (source):

/* [...]
 * Since some other DBMSes do not allow references to ungrouped columns, it's
 * not unusual to find all columns listed in GROUP BY even though listing the
 * primary-key columns would be sufficient.  Deleting such excess columns
 * avoids redundant sorting work, so it's worth doing.
 * [...]
 * Currently, we only make use of pkey constraints for this, however, we may
 * wish to take this further in the future and also use unique constraints
 * which have NOT NULL columns.
 * [...]
 */

Cost estimation and algorithm selection

Like all every heavily optimized software, PostgreSQL implements many specialized algorithms and must choose between them.

For example, there are three main join algorithms:

Hash Join: Populates a hashmap with one table and matches it against the other

Pros: The fastest option

Cons: If the inserted table is too large, the hashmap spills to disk, significantly reducing performance
Merge Join: Sorts both tables and combines them via ’two pointers'

Pros: Consistent speed even when both tables are big

Cons: Must sort all data before emitting rows, making it suboptimal for queries with LIMIT operator
NestedLoop, which consist of two nested loops and performs n × m iterations

Pros: Simplicity and zero overhead

Cons: O(n × m) iterations

So how does the planner decide which algorithm suits better?

Well, there is no perfect way to solve this issue, and it remains an active research area in databases. PostgreSQL (like many others) uses a cost-based optimizer (CBO), that tries to predict plan cost in some abstract time units. You can see this via EXPLAIN ANALYZE operator:

explain analyze
select
    MIN(s.title),
    COUNT(ep.id) as episode_n
from movie s
    join movie ep on ep.episode_of_id = s.id
group by s.id
order by episode_n desc;

                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
# Sorting rows per ORDER BY clause
 Sort  (cost=808223.55..814544.33 rows=2528312 width=44) (actual time=1988.917..1992.647 rows=51481 loops=1)
   Sort Key: (count(ep.id)) DESC
   Sort Method: quicksort  Memory: 3950kB

   # Aggregation for MIN/COUNT calculation
   ->  HashAggregate  (cost=323936.94..383786.82 rows=2528312 width=44) (actual time=1929.682..1983.244 rows=51481 loops=1)
         Group Key: s.id
         Planned Partitions: 128  Batches: 129  Memory Usage: 8209kB  Disk Usage: 8136kB
        
         # Uses nested loops because it is cheaper than hashing/sorting overhead
         ->  Nested Loop  (cost=0.44..116141.29 rows=2528312 width=25) (actual time=139.306..1647.542 rows=1543264 loops=1)
              
               # Outer loop scans whole table row by row and tries to match every row with other loop
               ->  Seq Scan on movie ep  (cost=0.00..45949.12 rows=2528312 width=8) (actual time=2.562..163.561 rows=2528312 loops=1)
              
               # Caches underlying index scan for faster iterations
               ->  Memoize  (cost=0.44..0.50 rows=1 width=21) (actual time=0.000..0.000 rows=1 loops=2528312)
                     Cache Key: ep.episode_of_id
                     Cache Mode: logical
                     Hits: 2476830  Misses: 51482  Evictions: 0  Overflows: 0  Memory Usage: 6197kB
                    
                     # rows=1 loops=51482 means that Index Scan was called 51482 times and returned 1 row each time. As we can see, 51482
                     # matches number of cache misses in Memoize node
                     ->  Index Scan using title_pkey on movie s  (cost=0.43..0.49 rows=1 width=21) (actual time=0.014..0.014 rows=1 loops=51482)
                           Index Cond: (id = ep.episode_of_id)

First, let’s explain the output format:

cost=[estimated cost of retrieving first row]..[estimated cost of retrieving all rows] rows=[estimated number of returned rows] width=[sum of returned column widths] actual time=[actual time in ms] rows=[actual number of rows]

(as mentioned earlier, cost use abstract units and cannot be directly compared to real time.)

Cost estimation relies on table statistics: PostgreSQL periodically calculates these metrics (via sampling) for cost estimation, which you can see in pg_stats table:

Number of rows
Percentage of NULL rows
Approximately how sorted rows are
Approximately distribution (by separating in fixed number of buckets and calculating number of rows in each bucket)
Correlation between different columns

Next, let’s examine how PostgreSQL estimates cost of each node. Sadly, a concise summary is nearly impossible for most nodes — calculations span many functions with non-obvious control flow, because of accounting such niche effects like disk locality.

As simple example, we can review how cost_seqscan (code) calculates sequential scan costs:

/* I/O cost: pages × sequential page cost (cache-friendly) */
disk_run_cost = spc_seq_page_cost * baserel->pages;

/* Filter cost: comparisons, function calls, etc. */
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);

/* For the first row we need just to setup filter */
startup_cost = qpqual_cost.startup;

/* For each tuple we have internal computations (cpu_tuple_cost, configurable)
   and filtering computations (qpqual_cost.per_tuple) */
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;

/* Total CPU cost: per-tuple cost × expected rows */
cpu_run_cost = cpu_per_tuple * baserel->tuples;

Similarly, PostgreSQL predicts row counts by estimating filter selectivity (percentage of accepted rows) and propagating it through the plan. For details, see this article.

Join order

Theoretical note:

There are two join strategies: bushy trees and left-deep joins. A bushy tree can join 4 tables as ((A ⋈ B) ⋈ (C ⋈ D)), while a left-deep join always follows (A ⋈ (B ⋈ (C ⋈ D))). PostgreSQL only supports left-deep joins.

Consider three tables: A(x), B(x), and C(x), where A and B each contain 1 million rows with x=10, while C has single row with x=20. Then query1

select * from A join B using (x) join C using(x);

can be computed in two different ways:

A ⋈ (B ⋈ C): Joins B and C first. The empty result skips A entirely.
(A ⋈ B) ⋈ C: Joins A and B first, producing 10^12 temporary rows (all discarded when joined with C).

While this example is synthetic, cardinality estimation inaccuracies can lead to highly suboptimal join orders. The IMDB team documented this problem in their article, which led to the still-relevant Join Order Benchmark.

PostgreSQL solves this problem adaptively: for a small number of tables (<12, configurable) it uses a dynamic programming algorithm “on bitmasks”, while for larger queries it employs a genetic algorithm, as implemented in make_rel_from_joinlist (code):

else if (enable_geqo && levels_needed >= geqo_threshold)
    return geqo(root, levels_needed, initial_rels);
else
    return standard_join_search(root, levels_needed, initial_rels);

where standard_join_search is described in its comment as:

/*
* We employ a simple "dynamic programming" algorithm: we first find all
* ways to build joins of two jointree items, then all ways to build joins
* of three items (from two-item joins and single items), then four-item
* joins, and so on until we have considered all ways to join all the
* items into one rel.
*
* root->join_rel_level[j] is a list of all the j-item rels.  Initially we
* set root->join_rel_level[1] to represent all the single-jointree-item
* relations.
*/

Conclusion

As we’ve seen, the PostgreSQL query planner is a complex system with numerous optimizations and subtle details — and we’ve only scratched the surface. Actually, while it accounts for a small fraction of total execution time, its decisions have an enormous impact on performance, and therefore is one of the big and most optimized part of PostgreSQL — alongside the executor and buffer manager, which we’ll explore in future posts.

Next part: Executor