Command ID Visibility in PostgreSQL Extensions: What Apache AGE Taught Us

Two bugs in Apache AGE — the graph database extension for PostgreSQL — exposed a subtle interaction between PostgreSQL's Multi-Version Concurrency Control (MVCC) visibility rules and how extensions manage Command IDs inside custom executor nodes. This post walks through the internals involved, explains both bugs and their fixes, and extracts a general checklist for extension authors.

The material covers how PostgreSQL tracks tuple visibility within a single transaction — MVCC, snapshots, and Command IDs (the per-statement counter PostgreSQL uses to determine which rows a statement can see within its own transaction). It explains how the Volcano executor model works, and how custom scan nodes can break visibility guarantees if they don't keep the global Command ID counter and the snapshot's local copy in sync. If you work with graph databases or PostgreSQL extensions and want to understand what's happening under the hood, this is for you.

If you're already comfortable with PostgreSQL's MVCC model, snapshots, and the Volcano executor, skip ahead to The Bug: Command ID Drift.

PostgreSQL Internals Primer

MVCC: Why Old Rows Stick Around

PostgreSQL never overwrites a row in place. When you UPDATE a row, PostgreSQL inserts a new version of that row and marks the old version as dead. When you DELETE, it just marks the existing version. Old versions hang around until VACUUM cleans them up.

This is Multi-Version Concurrency Control (MVCC). It means readers never block writers and writers never block readers — every transaction sees a consistent snapshot of the database, even while other transactions are making changes. The tradeoff is that PostgreSQL needs a way to decide which versions of a row are visible to which transactions.

Tuples and Their Hidden Columns

PostgreSQL calls each row version a tuple. Every tuple on disk carries hidden system columns that control visibility:

Column Meaning
xmin The transaction ID that created this tuple
xmax The transaction ID that deleted/updated this tuple (0 if still live)
Cmin The command ID within the inserting transaction
Cmax The command ID within the deleting transaction

xmin/xmax handle visibility between transactions. Cmin/Cmax handle visibility within a single transaction — and that's where our bugs live.

Snapshots: The Visibility Rulebook

When PostgreSQL runs a query, it takes a snapshot — a frozen view of which transactions and commands have completed. The snapshot determines what each scan can see.

For our purposes, the critical field is curcid — the snapshot's current Command ID. The rule is simple:

A tuple inserted by the CURRENT transaction is visible when: Cmin < curcid
A tuple inserted by the CURRENT transaction is invisible when: Cmin >= curcid

If a transaction inserts a row at command 2, a scan running at curcid = 3 can see it. A scan running at curcid = 1 cannot. This is how PostgreSQL prevents a single SQL statement from seeing its own partial mutations — the scan's curcid is set before the statement begins.

Command IDs: Counting Mutations

Each SQL statement within a transaction gets an incrementing Command ID. The first INSERT is command 0, the next is command 1, and so on. Extensions advance this counter by calling CommandCounterIncrement() after performing mutations. This lets subsequent operations within the same transaction see what was just inserted.

There are two Command ID (CID) values that matter:

  • Global CommandCounter — the process-wide counter, advanced by CommandCounterIncrement(). Retrieved via GetCurrentCommandId().
  • snapshot->curcid — the snapshot's local copy, used during tuple visibility checks. This is what HeapTupleSatisfiesMVCC() actually reads.

In normal SQL execution, these stay synchronized. In extensions with custom executors, they can drift apart. That's the bug.

The Volcano Executor

PostgreSQL executes queries using the Volcano model (also called the iterator model). A query plan is a tree of nodes. Each node has three methods: Init, Exec (returns the next tuple), and End. The top node pulls one tuple at a time from its children, which pull from their children, all the way down. This is lazy evaluation — no node does work until someone asks it for a row.

All executor nodes in a plan tree share an EState (Executor State), which holds the transaction snapshot, output CID, and other shared context.

Custom Scan Nodes

PostgreSQL lets extensions register custom scan nodes — executor nodes that plug into the plan tree just like built-in nodes (SeqScan, IndexScan, etc.) but run extension-defined logic. This is how AGE implements Cypher: each clause (MATCH, CREATE, MERGE) compiles to a custom scan node in the plan tree.

The key point: custom scan nodes have full access to EState, including es_snapshot. They can read and modify curcid. And they are responsible for keeping it consistent with any mutations they perform.

How AGE Uses This

Apache AGE adds graph processing to PostgreSQL via these custom scan nodes. A Cypher query like:

MATCH (a:A)-[:R]->(b:B)
CREATE (c:C {name: a.name})
WITH a, b, c
MERGE (a)-[:LINK]->(c)
RETURN a.name, c.name

compiles into a plan tree where MATCH, CREATE, and MERGE are each custom executor nodes, pulling tuples from each other in Volcano style. They all share the same EState and its snapshot.

The Bug: Command ID Drift

AGE's CREATE executor calls CommandCounterIncrement() after each batch of inserts. For a 3-row MATCH feeding into CREATE, the global CID advances: 0 → 1 → 2 → 3.

But AGE's executor nodes use a macro pair to isolate visibility during child scans:

Decrement_Estate_CommandId(estate);  // before child scan
slot = ExecProcNode(child);          // scan sees original CID
Increment_Estate_CommandId(estate);  // restore after

This keeps es_snapshot->curcid at a stable value during the scan. The problem: while curcid is held stable, the global CommandCounter advances with each CommandCounterIncrement(). After processing 3 rows:

  • Global CID: 3
  • es_snapshot->curcid: still ~0 (from the Decrement/Increment bracketing)
  • Newly inserted vertices have Cmin = 1, 2, 3

When MERGE then calls entity_exists() to verify that a vertex created by CREATE still exists, it scans with the snapshot — which has curcid = 0. The MVCC check says: Cmin(2) >= curcid(0)invisible. AGE concludes the vertex was deleted and throws an error.

The key insight: the global CommandCounter and the snapshot's curcid can drift apart when custom executors bracket child scans with CID manipulation macros.

The Fix: Temporary Command ID Advancement

The fix for entity_exists() (PR #2343) is surgical:

bool entity_exists(EState *estate, Oid graph_oid, graphid id)
{
    CommandId saved_curcid;

    saved_curcid = estate->es_snapshot->curcid;
    estate->es_snapshot->curcid = GetCurrentCommandId(false);

    /* scan with current global CID — sees all recent inserts */
    rel = table_open(label->relation, RowExclusiveLock);
    scan_desc = table_beginscan(rel, estate->es_snapshot, 1, scan_keys);
    tuple = heap_getnext(scan_desc, ForwardScanDirection);
    /* ... */

    estate->es_snapshot->curcid = saved_curcid;  /* restore */
    return result;
}

Save, advance, scan, restore. The function now sees what the global counter says should be visible, without disrupting the executor's CID isolation for other operations.

Interestingly, AGE's DELETE executor already did this correctly — it updates curcid after CommandCounterIncrement(). CREATE and MERGE didn't. The inconsistency is how bugs hide: the correct pattern existed in the same codebase.

The Second Bug: Eager vs. Lazy Execution

A related issue (PR #2344) exposed another CID visibility problem in chained MERGE:

MATCH (x)
MERGE (x)-[:r]->(:t)
MERGE (:C)-[:r]->(:t)
RETURN x

The first MERGE creates :t nodes. The second MERGE should see them. But PostgreSQL's executor processes rows lazily — the second MERGE's lateral join materializes its hash table on the first row, before the first MERGE has finished creating all its :t nodes.

The fix: eager buffering for non-terminal MERGE nodes. Instead of returning rows one at a time (lazy), a non-terminal MERGE processes all input rows first, buffers the results, then emits them. This ensures downstream nodes see the complete set of mutations.

if (!terminal && !css->eager_buffer_filled)
{
    /* process ALL input rows first */
    while (true)
    {
        slot = ExecProcNode(child);
        if (TupIsNull(slot)) break;
        /* process merge, buffer result */
        css->eager_tuples = lappend(css->eager_tuples, ExecCopySlotHeapTuple(slot));
    }
    css->eager_buffer_filled = true;
}
/* then return buffered rows one at a time */

This is the same tension between lazy and eager evaluation that shows up across database systems. PostgreSQL's Volcano-model executor is fundamentally pull-based (lazy). When mutation visibility depends on processing order, lazy evaluation breaks — you need to force eager materialization at the right points.

The Pattern for Extension Authors

If you're writing a PostgreSQL extension with custom executor nodes that perform mutations:

After CommandCounterIncrement(), update curcid if any subsequent scan in the same plan needs to see the new tuples: c CommandCounterIncrement(); estate->es_snapshot->curcid = GetCurrentCommandId(false);

If you bracket child scans with CID manipulation, be aware that the global counter still advances inside the child. Your restored curcid may be stale relative to mutations that happened during the child scan.

Non-terminal mutation nodes in a plan tree may need eager buffering to ensure downstream nodes see the complete set of mutations, not a partial set that depends on row-by-row processing order.

Test with >1 input row. CID drift bugs often don't manifest with small inputs because the CID gap isn't large enough to cross the visibility threshold. Our regression tests use 3+ rows specifically for this reason.

Check for consistency. If DELETE updates curcid but CREATE doesn't, there's a bug. Extensions grow organically and visibility handling is easy to miss in less-exercised code paths.

What This Means

These aren't AGE-specific problems. They're consequences of PostgreSQL's MVCC model interacting with custom plan nodes that perform mid-execution mutations. Any extension that adds DML-capable custom scan nodes — foreign data wrappers with write-back, procedural language executors, custom merge operations — faces the same CID visibility constraints.

PostgreSQL's visibility model is well-documented for regular SQL operations. But when you write an extension that introduces new executor node types, you're operating in territory where the rules are less explicit. The snapshot API works correctly — the question is whether your extension keeps curcid synchronized with the mutations it performs.

Both fixes shipped in PRs #2343 and #2344 to Apache AGE. The regression tests that accompany them (issues #1954 and #1446) serve as reproducible demonstrations of the bug patterns described here.

Subscribe to OOXO

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe