Understanding PostgreSQL internals, part 2: Executor (with JIT)

In the previous part, we discussed how PostgreSQL transforms declarative queries into imperative steps. Therefore, it seems logical to discuss the execution part next.

Overview

A basic understanding of how PostgreSQL executes trees can be obtained from the following source code comment:

/*
 *	 NOTES
 *		This used to be three files.  It is now all combined into
 *		one file so that it is easier to keep the dispatch routines
 *		in sync when new nodes are added.
 *
 *	 EXAMPLE
 *		Suppose we want the age of the manager of the shoe department and
 *		the number of employees in that department.  So we have the query:
 *
 *				select DEPT.no_emps, EMP.age
 *				from DEPT, EMP
 *				where EMP.name = DEPT.mgr and
 *					  DEPT.name = "shoe"
 *
 *		Suppose the planner gives us the following plan:
 *
 *						Nest Loop (DEPT.mgr = EMP.name)
 *						/		\
 *					   /		 \
 *				   Seq Scan		Seq Scan
 *					DEPT		  EMP
 *				(name = "shoe")
 *
 *		ExecutorStart() is called first.
 *		It calls InitPlan() which calls ExecInitNode() on
 *		the root of the plan -- the nest loop node.
 *
 *	  * ExecInitNode() notices that it is looking at a nest loop and
 *		as the code below demonstrates, it calls ExecInitNestLoop().
 *		Eventually this calls ExecInitNode() on the right and left subplans
 *		and so forth until the entire plan is initialized.  The result
 *		of ExecInitNode() is a plan state tree built with the same structure
 *		as the underlying plan tree.
 *
 *	  * Then when ExecutorRun() is called, it calls ExecutePlan() which calls
 *		ExecProcNode() repeatedly on the top node of the plan state tree.
 *		Each time this happens, ExecProcNode() will end up calling
 *		ExecNestLoop(), which calls ExecProcNode() on its subplans.
 *		Each of these subplans is a sequential scan so ExecSeqScan() is
 *		called.  The slots returned by ExecSeqScan() may contain
 *		tuples which contain the attributes ExecNestLoop() uses to
 *		form the tuples it returns.
 *
 *	  * Eventually ExecSeqScan() stops returning tuples and the nest
 *		loop join ends.  Lastly, ExecutorEnd() calls ExecEndNode() which
 *		calls ExecEndNestLoop() which in turn calls ExecEndNode() on
 *		its subplans which result in ExecEndSeqScan().
 *
 *		This should show how the executor works by having
 *		ExecInitNode(), ExecProcNode() and ExecEndNode() dispatch
 *		their work to the appropriate node support routines which may
 *		in turn call these routines themselves on their subplans.
 */

In PostgreSQL, trees execute with a ‘pull strategy’: the plan is executed from top to bottom, and each node pulls content row by row from child nodes. As an alternative approach, we could try to execute queries from bottom to top, but among other cons, this approach lacks laziness. For example, select * from very_big_table limit 1 would scan the whole table and then filter everything out instead of pulling just one row. You might argue that this can be optimized by pushing the limit operator into the scan node, but in reality it won’t work - consider a limit above a join. In that case, we can’t predict how many rows we’ll need from each table because we can’t predict how successful the row joins will be. Therefore, the pull strategy is the only way to implement laziness, and its overhead can be optimized via batching (pulling multiple rows at once, though this is currently not implemented in PostgreSQL).

Obviously, not every operator can work in a ‘pull one row -> return one row’ strategy: for example, it’s impossible to sort a table without pulling all rows. In such cases, the operator pulls the entire child iterator into temporary storage, as you can see in the Sort node source code:

static TupleTableSlot *
ExecSort(PlanState *pstate)
{
    /* ... snip ... */

    /* Check if we haven't sorted yet */
	if (!node->sort_Done)
	{
        /* ... snip ... */

        /* Pull rows one by one, put them into sort structure (quick sort / heap sort).
         * Heap sort can be used for order by + limit queries, where we need only top k results
        */
			for (;;)
			{
                /* Pull row */
				slot = ExecProcNode(outerNode);

                /* Null row means iterator end */
				if (TupIsNull(slot))
					break;

                /* Put row into sorting struct */
				tuplesort_puttupleslot(tuplesortstate, slot);
			}

		/*
		 *  Perform the final sort operation
		 */
		tuplesort_performsort(tuplesortstate);

        /* ... snip ... */
        node->sort_Done = true;
	}

    /* ... snip ... */


    /* Return next sorted row */
    (void) tuplesort_gettupleslot(tuplesortstate,
                                    ScanDirectionIsForward(dir),
                                    false, slot, NULL);

	return slot;
}

Thus, the Sort operator performs heavy lifting (pulling all rows and sorting them) only on the first call, and on subsequent calls it just returns rows from its internal storage. Also, this difference between ‘one in -> one out’ and ‘all in -> one by one out’ nodes explains why PostgreSQL needs startup cost evaluation at all.

Virtual calls

Let’s examine how the child node is called in the source code shown earlier:

slot = ExecProcNode(outerNode);

ExecProcNode(node) is just a wrapper for node->ExecProcNode(node), where ExecProcNode is a struct field containing a function pointer. This implements virtual functions similar to what C++ does under the hood, allowing PostgreSQL to abstract specific operators into a general interface of ‘some node that can return rows’. So, for example, when executing next SQL query, you can set a breakpoint on ExecProcNode and step into ExecSeqScan (nodeSeqscan.c), which then calls generalized function ExecScan (execScan.c):

// QUERY: select * from (select * from t1 order by b1) join t2 using(a);

static TupleTableSlot *
ExecSeqScan(PlanState *pstate)
{
	SeqScanState *node = castNode(SeqScanState, pstate);

	return ExecScan(&node->ss,
					(ExecScanAccessMtd) SeqNext,
					(ExecScanRecheckMtd) SeqRecheck);
}

Here, SeqNext is a function pointer that abstracts how Seq Scan gets the next row. This design allows ExecScan to be reused for Seq Scan, Index (Only) Scan, Subquery Scan, Function Scan, and other scan types.

Let’s review how ExecScan works:

TupleTableSlot *
ExecScan(ScanState *node,
		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
		 ExecScanRecheckMtd recheckMtd)
{
    /* ... snip ... */
    /* Optimization: If no filtering, just return first row
    */
	if (!qual && !projInfo)
	{
		ResetExprContext(econtext);
		return ExecScanFetch(node, accessMtd, recheckMtd);
	}

    /* ... snip ... */

	for (;;)
	{
        /* Pull row */
		slot = ExecScanFetch(node, accessMtd, recheckMtd);

        /* End of underlying iterator */
		if (TupIsNull(slot))
		{
            /* proj means projection — selecting specific columns rather than all */
			if (projInfo)
				return ExecClearTuple(projInfo->pi_state.resultslot);
			else
				return slot;
		}

		/* ... snip ... */

        /* if no filters or row passes them, return it */
		if (qual == NULL || ExecQual(qual, econtext))
		{
            /* Same about projection */
			if (projInfo)
			{
				return ExecProject(projInfo);
			}
			else
			{
				return slot;
			}
		}

        /* ... snip ... */
    }
}

Filtering

Let’s examine the query select 2 * a + 3 * b1 from t1, specifically focusing on the 2 * a + 3 * b1 expression. During the parsing stage, PostgreSQL converts this expression into bytecode that it later interprets for each row. You can verify this by placing a breakpoint in ExecInterpExpr (execExprInterp.c) and observing that it executes 10 times - matching the number of rows in t1.

If you track how it executes, you will see that the bytecode contains 7 instructions:

EEOP_SCAN_FETCHSOME — unpacking row into internal structs
EEOP_SCAN_VAR — extracts column a from the row
EEOP_FUNCEXPR_STRICT — (virtually) calls int4mul (int4 means type and mul means operation), that calculates 2 * a
EEOP_SCAN_VAR — extracts column b1 from the row
EEOP_FUNCEXPR_STRICT — (virtually) calls int4mul, that calculates 3 * b1
EEOP_FUNCEXPR_STRICT — (virtually) calls int4pl that sums this values
EEOP_ASSIGN_TMP — saves the result and returns it

And this is how PostgreSQL evaluates any expression in your query: filter conditions, ORDER BY clauses, join predicates, or aggregate functions. With all the overhead of virtual calls and bytecode interpretation, expression evaluation can consume a significant portion of query execution time.

JIT

To mitigate this performance issue, PostgreSQL 11 introduced optional JIT (Just-In-Time) compilation based on LLVM. JIT is only used for plans deemed ’expensive enough’, as there’s no benefit in spending 100ms compiling expression for a 3ms execution plan. This cost-benefit analysis also results in different thresholds for generic JIT versus optimized JIT with function inlining.

According to the documentation. right now PostgreSQL compiles next parts:

‘Tuple deforming’ from compact on disk format into inmemory structures, that are optimized for usability
Expression evaluation (as described in the Filtering section above)

PostgreSQL’s approach to inlining built-in functions (like int4mul) deserves special mention. While inlining provides significant performance benefits, maintaining separate LLVM IR implementations would be impractical, so when PostgreSQL builds, clang transforms the source code into LLVM bytecode, which is later loaded for JIT purposes. This solution, along with other JIT integration challenges (like field name translation to LLVM IR), is well documented in the developer readme.

The JIT code generation logic mirrors ExecInterpExpr: it uses the same switch-case structure over bytecode in its llvm_compile_expr function, but generates LLVM bytecode instead of interpreting instructions directly. Notably, the implementation prioritizes simularity over optimization - for the EEOP_FUNCEXPR_STRICT opcode, it generates individual blocks for each argument’s null check (rather than translating consolidated loop) and chains them together, despite the relative simplicity of implementing loop codegen.

In the end JIT produces LLVM module with evalexpr function, that is virtually called instead of ExecInterpExpr in execution stage.

Sadly, right now JIT is a part of execution stage in PostgreSQL and therefore: — For SQL functions, it compiles code (where approved by cost model) for each invocation — For prepared statements, compilation occurs for each execution — [Other similar limitations exist] But PostgreSQL developers are working on a JIT cache for compiled expressions that could partially address these issues.

Small note about other JITs

Sadly PostgreSQL’s JIT is far from perfect. For example, in ClickHouse it: — Compiles entire execution plans — JIT compiles every query that executes more than 5 times (rather than only ‘heavy’ queries), caching the generated code. This approach better suits modern applications where queries are frequently repeated.

In theory, PostgreSQL aims to implement similar functionality through its planned JIT cache and JITting at planning stage, though these are not yet available. However, given that prepared statements aren’t widely used, PostgreSQL may not achieve such an elegant JIT solution for the foreseeable future.

JIT (advanced)

You can examine PostgreSQL’s JIT compilation output by following these steps:

Tune GUC parameters

set jit = on;
-- Set cost threshold to zero (-1 disables) [default = 100000]
set jit_above_cost = 0;
-- Dump generated bytecode into data directory [default = false]
set jit_dump_bitcode = true;
-- Set optimization threshold to 0 (-1 disables) [default = 500000]
set jit_optimize_above_cost = 0;
-- Set inlining threshold to 0 (-1 disables) [default = 500000]
set jit_inline_above_cost = -1;
-- Disable tuple deforming codegen for cleaner code [default = on]
set jit_tuple_deforming = off;
-- Enable JITting expressions [default = on]
set jit_expressions = on;

These settings are not realistic for production but are ideal for demonstration purposes, as they generate simpler IR code.

Run explain analyze select 2 * a + 3 * b1 from t1 query, it will display what and how long compiler did:

JIT:
  Functions: 1
  Options: Inlining false, Optimization true, Expressions true, Deforming false
  Timing: Generation 0.202 ms (Deform 0.000 ms), Inlining 0.000 ms, Optimization 0.231 ms, Emission 1.839 ms, Total 2.272 ms

In your PostgreSQL data folder (typically /var/lib/postgres/data), you’ll find files named like 1055.0.bc and 1055.0.optimized.bc, where 1055 represents the server process PID and 0 is the JIT invocation sequence number. Note that PostgreSQL doesn’t automatically clean these files, so you should remove them manually and remember to disable jit_dump_bitcode afterward to prevent bloating your data directory.
Transform bytecode into readable LLVM IR by llvm-dis <.bc filename> -o <ot file.ll>

In unoptimized version of the bytecode for same query as in Filtering section before, guided by int4mul and int4pl calls, you can clearly see simularity with interpreter code:

; `evalexpr` function will be called instead of intrpreter

define noundef i64 @evalexpr_4_0(ptr nocapture noundef writeonly %0, ptr %1, ptr %2) #0 {
entry:
; ... [skipped] loading arguments & global variables...
  br label %b.op.0.start

; Operator 0: compiled version of EEOP_SCAN_FETCHSOME (row loading)

b.op.0.start:
  %20 = getelementptr inbounds %struct.TupleTableSlot, ptr %v_scanslot, i32 0, i32 2
  %21 = load i16, ptr %20, align 2
  %22 = icmp uge i16 %21, 2
  br i1 %22, label %b.op.1.start, label %op.0.fetch

; Calls slot_getsomeattrs_int just like an interpreter
;   This handles tuple deforming - enabling jit_tuple_deforming would generate
;   more complex code here, because it will inline `slot_getsomeattrs_int` here

op.0.fetch:
  call void @slot_getsomeattrs_int(ptr %v_scanslot, i32 2)
  br label %b.op.1.start

; Operator 1: compiled version of EEOP_SCAN_VAR (extracts column `a`)

b.op.1.start:
  %23 = getelementptr i64, ptr %v_scanvalues, i32 0
  %24 = load i64, ptr %23, align 8
  %25 = getelementptr i8, ptr %v_scannulls, i32 0
  %26 = load i8, ptr %25, align 1
  store i64 %24, ptr inttoptr (i64 94008167322488 to ptr), align 8
  store i8 %26, ptr inttoptr (i64 94008167322496 to ptr), align 1
  br label %b.op.2.start

;  Operator 2: Compiled version of EEOP_FUNCEXPR_STRICT (calls `int4mul`)

b.op.2.start:
  store i8 1, ptr inttoptr (i64 94008167322320 to ptr), align 1
  br label %b.2.isnull.0

; As noted earlier, PostgreSQL generates separate null check blocks (b.2.isnull.0/.1)
; rather than a loop, maintaining consistency with interpreter structure

b.2.isnull.0:
  %27 = load i8, ptr getelementptr inbounds (%struct.NullableDatum, ptr getelementptr inbounds (%struct.FunctionCallInfoBaseData, ptr inttoptr (i64 94008167322440 to ptr), i32 0, i32 6), i32 0, i32 1), align 1
  %28 = icmp eq i8 %27, 1
  br i1 %28, label %b.op.3.start, label %b.2.isnull.1

b.2.isnull.1:
  %29 = load i8, ptr getelementptr inbounds (%struct.NullableDatum, ptr getelementptr inbounds ([0 x %struct.NullableDatum], ptr getelementptr inbounds (%struct.FunctionCallInfoBaseData, ptr inttoptr (i64 94008167322440 to ptr), i32 0, i32 6), i32 0, i32 1), i32 0, i32 1), align 1
  %30 = icmp eq i8 %29, 1
  br i1 %30, label %b.op.3.start, label %b.2.no-null-args

; int4mul call

b.2.no-null-args:
  store i8 0, ptr getelementptr inbounds (%struct.FunctionCallInfoBaseData, ptr inttoptr (i64 94008167322440 to ptr), i32 0, i32 4), align 1
  %funccall = call i64 @int4mul(ptr inttoptr (i64 94008167322440 to ptr))
  %31 = load i8, ptr getelementptr inbounds (%struct.FunctionCallInfoBaseData, ptr inttoptr (i64 94008167322440 to ptr), i32 0, i32 4), align 1
  call void @llvm.lifetime.end.p0(i64 32, ptr inttoptr (i64 94008167322472 to ptr))
  call void @llvm.lifetime.end.p0(i64 1, ptr inttoptr (i64 94008167322468 to ptr))
  store i64 %funccall, ptr inttoptr (i64 94008167322312 to ptr), align 8
  store i8 %31, ptr inttoptr (i64 94008167322320 to ptr), align 1
  br label %b.op.3.start

; ... [skipped] same loading `b` column, calling int4mul and int4pl ...

; Block6: compiled version of EEOP_ASSIGN_TMP (stores computed result)
b.op.6.start:
  %46 = load i64, ptr %3, align 8
  %47 = load i8, ptr %4, align 1
  %48 = getelementptr i64, ptr %v_resultvalues, i32 0
  %49 = getelementptr i8, ptr %v_resultnulls, i32 0
  store i8 %47, ptr %49, align 1
  store i64 %46, ptr %48, align 8
  br label %b.op.7.start

; Returns value

b.op.7.start:
  %50 = load i64, ptr %3, align 8
  %51 = load i8, ptr %4, align 1
  store i8 %51, ptr %2, align 1
  ret i64 %50
}

; External function declarations. With inlining enabled, these would be
; replaced with implementations from clang-generated bytecode.

declare void @slot_getsomeattrs_int(ptr noundef, i32 noundef) #1
declare i64 @int4mul(ptr)
declare i64 @int4pl(ptr)

Additional materials

LLVM JIT in PostgreSQL: https://youtube.com/watch?v=k5PQq9a4YqA