Chapter 19: CXL and the Disaggregated Address Space

19.1 Introduction: When Physical Memory Leaves the Socket

Every chapter in this book has rested on an assumption so fundamental it was never stated: when the MMU finishes a page table walk and produces a physical address, that address names a location in DRAM attached directly to the same die or package by a local memory controller. The address traverses a few centimetres of silicon interconnect and reaches its destination in roughly 80 nanoseconds. This coupling between the CPU and its memory has defined processor architecture since the 1960s, and every mechanism described in the preceding eighteen chapters — TLB design, page table structures, PTW microarchitecture, OS page allocation, TLB shootdown protocols — was designed with it as bedrock.

Compute Express Link (CXL) dissolves that assumption. CXL is a cache-coherent interconnect, built on the PCIe physical layer, that extends the CPU's physical address space to memory devices that may be on a separate add-in card, a separate tray, or a separate chassis. The CPU MMU produces a physical address, exactly as it always has, and that address may refer to a DRAM module sitting 200 to 500 nanoseconds away over a PCIe fabric. The page table entry that produced this address is indistinguishable from any other PTE; there is no CXL bit, no far-memory flag, no hint to the hardware that anything unusual is happening. The TLB caches the translation with exactly the same priority as it would cache a translation for local DRAM. The page walker traverses exactly the same four-level hierarchy. The only difference is what happens after the physical address is produced: instead of a ~80 ns round trip to the local memory controller, the request must traverse the CPU's PCIe root complex, the CXL host bridge, possibly a CXL switch, and a CXL memory controller before the data returns.

Three things change under CXL; one thing does not.

What changes: the access latency of a TLB-miss resolution (a page walk followed by a data access can now cost 600–800 ns instead of 160 ns); the correct placement of page table pages (they must be pinned in local DRAM or every TLB miss will cost 2,000 ns); and the direction of TLB coherence in CXL 3.0 Shared Memory configurations (the CXL device can now initiate a shootdown toward the CPU — the first device-initiated TLB invalidation protocol described in this book).

What does not change: the virtual-to-physical address translation mechanism itself. The same CR3-chained four-level walk on x86-64, the same TTBR0/TTBR1 mechanism on ARM64, the same satp register on RISC-V — all are unchanged. CXL operates entirely below the level at which the MMU functions, in the memory bus layer that the CPU's translation hardware has always treated as a black box. This distinction is the chapter's central theme: CXL is a revolution in memory topology but a non-event for translation architecture, and understanding both sides of that sentence is required to deploy CXL correctly.

Readers with strong familiarity with the book's earlier chapters will recognise several connections. The two-stage address translation model introduced for virtualisation in Chapter 3 reappears in CXL 3.0 Shared Memory, where multiple hosts each have their own virtual-to-physical mapping but must ultimately resolve to a unified CXL device physical address. The NUMA-aware page table placement discussed in Chapter 9 becomes critically important under CXL tiering, because a page table page that migrates to the CXL tier inflicts the same 6× PTW slowdown regardless of whether the application intended it. The PTW latency model developed in Chapter 17 requires a new row for CXL-resident memory. And the TLB shootdown protocols from Chapters 4, 9, and 12 — all CPU-initiated — now have an inversion introduced by CXL 3.0 Back-Invalidation that this chapter documents for the first time.

RISC-V is explicitly out of scope for this chapter. As of 2025, no production RISC-V CPU implements a CXL host bridge; the RISC-V open-ISA philosophy and software-managed TLB design make CXL integration a future research direction rather than a current deployment consideration. The chapter covers x86-64 (Intel Sapphire Rapids and successors) and ARM64 (NVIDIA Grace Hopper), where production CXL support exists and can be measured.

CXL Protocol Family and Device Taxonomy CXL Sub-Protocols Device Types CXL.io PCIe-compatible I/O semantics Device enumeration · BAR mapping · Config space Used by ALL CXL device types CXL.cache Device-side caching of host memory Device TLB (DTLB) · ATS for VA→PA translation Back-Invalidation protocol (CXL 3.0) Used by Type 1 and Type 2 devices CXL.mem Load/store access to device-attached memory Host-managed Device Memory (HDM) Memory appears in host physical address space Used by Type 2 and Type 3 devices Type 1 — Accelerator Compute device, no device-attached memory Uses CXL.io + CXL.cache Example: SmartNIC, DPU, FPGA accelerator Type 2 — Accelerator + Memory Compute + device-attached memory Uses all three: CXL.io + CXL.cache + CXL.mem Example: GPU with CXL-attached HBM Most complex: simultaneous cache + memory protocols Type 3 — Memory Expander ★ Pure memory device, no compute Uses CXL.io + CXL.mem only Example: SK Hynix / Samsung CXL DRAM expander Primary focus of this chapter CXL Specification Evolution — Memory Management Perspective CXL 1.0 (2019) Type 1/2 only No Type 3 CXL 2.0 (2020) Type 3 added Memory pooling via switch CXL 3.0 (2022) Shared Memory (multi-host) Back-Invalidation added CXL 4.0 (Nov 2025) 128 GT/s (2× CXL 3.0) Multi-rack pooling
Figure 19.1: CXL Protocol Family and Device Taxonomy. Three sub-protocols (.io, .cache, .mem) combine to define three device types. Type 3 memory expanders — the focus of this chapter — use CXL.io for device management and CXL.mem for load/store access to device-attached DRAM that appears in the host physical address space as Host-managed Device Memory (HDM).

19.2 CXL Protocol Architecture: .io, .cache, .mem

CXL is not a single protocol but a layered stack of three sub-protocols, each serving a different purpose, built on top of the PCIe physical and link layers. Understanding which sub-protocol handles which function is prerequisite to understanding how CXL devices interact with the CPU's address translation machinery.

CXL.io is a superset of PCIe's transaction layer protocol. It handles all device management tasks: device discovery during system boot, configuration space access, Base Address Register (BAR) mapping through which the OS enumerates the device's memory ranges, and interrupt delivery. CXL.io carries no cache coherence semantics; it is the administrative channel. Every CXL device uses CXL.io regardless of type.

CXL.cache enables a CXL device to cache host memory. A device that uses CXL.cache implements a Device Translation Look-aside Buffer (DTLB), structurally analogous to the CPU's TLB. When the device accesses host memory using a virtual address, it uses the Address Translation Services (ATS) mechanism — defined originally in the PCIe ATS specification and extended by CXL — to request a virtual-to-physical translation from the host MMU. The host fulfils this request, the device caches the result in its DTLB, and subsequent accesses with the same virtual address are served from the DTLB without further host involvement. The DTLB is non-coherent with the host's TLBs, so when the host OS modifies a PTE, it must send an ATS INVALIDATE message to the device and receive an acknowledgement before proceeding. This is a straightforward extension of the standard TLB shootdown protocol described in Chapter 4: the device DTLB is simply one more invalidation target alongside the CPU cores. CXL 3.0 introduces Back-Invalidation, which inverts this protocol entirely; Section 19.7 addresses that mechanism in detail.

CXL.mem provides direct load/store semantics for device-attached memory. A CXL.mem device exposes its DRAM to the host as Host-managed Device Memory (HDM): a range of host physical addresses that, when accessed via load or store instructions, cause the request to traverse the CXL link to the device's memory controller and return data across the same link. From the CPU's perspective, HDM is identical to ordinary DRAM: it lives in the physical address map, it is page-allocatable by the kernel, and its contents are accessible through standard memory operations. The translation mechanism that produced the physical address has no knowledge of what happens next.

These three sub-protocols combine to define the three device types that CXL specifies. Type 1 devices — SmartNICs, DPUs, FPGAs used as accelerators — implement CXL.io and CXL.cache but have no device-attached memory; they cache the host's memory in their own on-device SRAM. Type 2 devices implement all three protocols simultaneously: they are accelerators with device-attached memory that both caches host data (CXL.cache) and exposes additional capacity (CXL.mem). Type 3 devices — the central subject of this chapter — implement only CXL.io and CXL.mem. They are pure memory expanders: no compute, no caching of host memory, only DRAM exposed as HDM.

The CXL specification has evolved through four generations with progressively richer memory management implications. CXL 1.0 (2019) introduced the protocol family but only supported Type 1 and Type 2 devices; Type 3 memory expanders were absent. CXL 2.0 (2020) added Type 3 devices and, critically, introduced memory pooling via a CXL switch: a single Type 3 device could serve multiple hosts, but only one host could own the device at a time — no concurrent access across hosts. CXL 3.0 (2022) made the architectural leap to Shared Memory, allowing multiple hosts to simultaneously map the same physical CXL memory with hardware-managed MESI coherence; it also introduced Back-Invalidation. CXL 4.0, published by the CXL Consortium in November 2025, doubles the link bandwidth from 64 GT/s to 128 GT/s (matching PCIe 7.0), introduces bundled port aggregation for multi-rack topologies, and enhances memory reliability, availability, and serviceability (RAS) features — all without changing the fundamental address translation model.

On x86-64, Intel's Sapphire Rapids processor family (2023) was the first production CPU to ship with CXL 1.1 host bridge support. Subsequent Xeon generations extend this to CXL 2.0 and CXL 3.0. On ARM64, NVIDIA's Grace Hopper Superchip incorporates CXL support through the Grace CPU chiplet; Grace Hopper systems can combine CXL-attached memory with NVLink-attached GPU HBM in a single unified address space. AMD's MI300X AI accelerator includes CXL through its CPU chiplet, supplementing the 192 GB on-package HBM. SK Hynix and Samsung have shipped Type 3 CXL DRAM expander modules in volume since 2023.

CXL Physical Address Space Map CPU Package Cores + L1/L2/L3 MMU/TLB PTW Memory Controller + PCIe Root Complex CXL Host Bridge Local DRAM PA: 0x0000_0000 → 0x7FFF_FFFF (128 GB) PCIe Root Complex CXL Host Bridge ACPI SRAT / HMAT Enumerates HDM ranges CXL Switch Fabric Manager CXL 3.0: Snoop Filter Back-Invalidate router CXL Type 3 Memory Expanders Expander A 128 GB DDR5 SK Hynix Expander B 128 GB DDR5 Samsung Host Physical Address Space Local DRAM: 0x0 → 0x1FFF_FFFF 128 GB — ~80 ns latency HDM — Expander A: 0x2000_0000... 128 GB — ~300 ns latency HDM — Expander B: 0x6000_0000... 128 GB — ~300 ns latency PTEs for HDM pages are identical to local DRAM PTEs — the MMU cannot distinguish tiers Access Latency Local DRAM: ~80 ns CXL 8-sock pool: +70–90 ns CXL rack-scale: +180 ns Linux OS Model DRAM = NUMA node 0 CXL = NUMA node 1 (cpuless)
Figure 19.2: CXL Physical Address Space Map. CXL Type 3 memory expanders are enumerated via ACPI SRAT/HMAT tables and appear as Host-managed Device Memory (HDM) ranges in the host's physical address space. The CPU MMU translates virtual addresses to these physical addresses using the same page table mechanism as local DRAM — the latency difference (80 ns vs 300+ ns) is invisible to the translation hardware and only manifests after the physical address is issued to the memory bus.

19.3 CXL as a NUMA Node: How Linux Sees Disaggregated Memory

When a Linux system boots with CXL Type 3 memory attached, the firmware enumerates the device's HDM ranges in the ACPI SRAT (System Resource Affinity Table) and HMAT (Heterogeneous Memory Attribute Table). The SRAT declares which physical address ranges belong to which memory domain; the HMAT quantifies the access latency and bandwidth of each range relative to each CPU. From this information, the Linux kernel constructs its NUMA topology model, and CXL memory appears as a new NUMA node — one with no CPU cores, only memory. The kernel community calls this a "cpuless" or "memory-only" NUMA node.

CXL Type 3 Memory as a Linux NUMA Node — Discovery and Topology PHASE 1 — Firmware: ACPI SRAT and HMAT enumeration ACPI SRAT System Resource Affinity Table MemAff[0]: PA 0x0–0x1FFFFFFFFFFF → Node 0 MemAff[1]: PA 0x200…FFFF → Node 1 MemAff[2]: CXL HDM range → Node 2 ACPI HMAT Heterogeneous Memory Attribute Table Node 0 → Node 0: 10 ns (local) Node 0 → Node 1: 20 ns (xsocket) Node 0 → Node 2: 40+ ns (CXL) CXL Type 3 Device HDM (Host-managed Device Memory) 128 GB DDR5 on CXL 2.0 link Link: PCIe 5.0 × 16 (64 GB/s) Additional latency: 70–90 ns (Pond) PHASE 2 — Linux kernel: NUMA topology construction NUMA Node 0 CPUs: 0–55 (socket 0) DRAM: 512 GB DDR5 Local latency: 10 ns Local alloc: ZONE_NORMAL Preferred: local pages first Distance: 0→0=10, 0→1=20, 0→2=40 NUMA Node 1 CPUs: 56–111 (socket 1) DRAM: 512 GB DDR5 Local latency: 10 ns Local alloc: ZONE_NORMAL Preferred: local pages first Distance: 1→0=20, 1→1=10, 1→2=40 NUMA Node 2 — CXL (cpuless) CPUs: (none) — memory-only node CXL DRAM: 128 GB CXL link latency: +70–90 ns (Pond) Accessed via: numactl --membind=2 / MPOL_BIND TLB walks: identical — physical address is physical address Distance: 2→0=40, 2→1=40, 2→2=10 PHASE 3 — Scheduler and allocator awareness Page Allocator (NUMA-aware) Default: allocate from the NUMA node of the requesting CPU. Falls back to Node 2 (CXL) only on local DRAM exhaustion (numa_node_distance policy or explicit MPOL_INTERLEAVE for tiering) Scheduler (NUMA balancing) CXL node has no CPUs — scheduler never places tasks on Node 2. Tasks on Node 0/1 may have working set in Node 2 → remote access cost is paid on every LLC miss (70–90 ns extra per fault) $ numactl --hardware available: 3 nodes (0-2) node 0 cpus: 0-55 size: 512000 MB node 1 cpus: 56-111 size: 512000 MB node 2 cpus: (none) size: 131072 MB ← CXL memory-only node node distances: 0→0: 10 0→1: 20 0→2: 40 1→2: 40 2→2: 10
Figure 19.3: CXL Type 3 memory as a Linux cpuless NUMA node. Firmware describes CXL HDM ranges in ACPI SRAT (physical address range to NUMA domain) and HMAT (per-pair latency and bandwidth). Linux constructs a NUMA topology with a memory-only node (node 2 in this example) with no CPUs. The CXL NUMA distance is typically 40 (vs. 10 local / 20 cross-socket), reflecting the additional 70–90 ns link latency measured in Azure Pond at 8–16 socket pool scale.

The practical consequence is visible immediately:

$ numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0-55
node 0 size: 512000 MB
node 0 free: 340221 MB
node 1 cpus: 56-111
node 1 size: 512000 MB
node 1 free: 298413 MB
node 2 cpus: (none)           # CXL node — no CPUs
node 2 size: 131072 MB
node 2 free: 130048 MB
node distances:
node   0   1   2
  0:  10  20  40             # CXL node 2 is "farther" than NUMA node 1
  1:  20  10  40
  2:  40  40  10

The NUMA distance of 40 for the CXL node (compared to 10 for local access and 20 for cross-socket) reflects the additional latency measured and reported in the HMAT. Applications that use numactl --membind=2 or mmap with MPOL_BIND targeting NUMA node 2 will receive CXL-backed allocations. Most applications that use no explicit NUMA policy will, by default, allocate from local DRAM (nodes 0 or 1) unless memory pressure forces overflow.

The Linux CXL subsystem, introduced in kernel 5.12 and substantially expanded in 5.14 through 6.x, lives under drivers/cxl/ and provides two operating modes for CXL memory. In system-ram mode, the HDM region is hot-added to the kernel's memory management as a fully allocatable memory region belonging to the cpuless NUMA node; this is the standard deployment mode for memory tiering. In DAX mode (direct access), the region is exposed as a character device (/dev/daxN.M) or as a DAX-capable filesystem, bypassing the page cache entirely and allowing applications to map CXL memory directly via mmap without OS page management overhead — useful for persistent memory semantics or when the application manages its own memory layout.

The interaction between CXL's cpuless NUMA model and the kernel's AutoNUMA (Automatic NUMA Balancing) mechanism creates a well-documented pitfall. AutoNUMA was designed to migrate pages toward the NUMA node where they are most frequently accessed, reducing cross-socket latency. With CXL as NUMA node 2, AutoNUMA can identify a page being accessed primarily by CPUs on node 0, observe that the page is currently on the CXL node, and migrate it to node 0 — which is exactly the desired behaviour. However, the same mechanism can work in reverse: a page on node 0 that is accessed from both node 0 and node 1 may be classified as a candidate for migration to the "neutral" CXL node 2, which is the wrong outcome. More critically, without explicit CXL-tier awareness, AutoNUMA has no way to distinguish the CXL node from a standard NUMA node, and may freely migrate pages in both directions without regard to the 3–5× latency differential.

Meta's Transparent Page Placement (TPP), presented at ASPLOS 2023, addressed this by introducing explicit CXL tier classification into the kernel's page migration logic. TPP uses hardware performance counter sampling (Intel PEBS or AMD IBS) to identify hot and cold pages, classifies the CXL node as an explicit "slow tier," and ensures that only pages classified as cold are eligible for migration to CXL. Hot pages are migrated toward local DRAM even if they are currently on the CXL tier. Linux 6.x introduced vm.numa_balancing=2, which activates the CXL-aware tiering mode derived from TPP, as opposed to vm.numa_balancing=1 (original AutoNUMA behaviour). Operators deploying CXL-tiered systems must set this flag explicitly; the default vm.numa_balancing=1 will misclassify the CXL tier.

The kswapd memory reclaim daemon also requires configuration. Its default behaviour is to scan and reclaim pages from the zone with highest pressure, starting with local zones. With CXL attached, the desired policy is to reclaim (push to CXL or swap) cold pages from local DRAM while preserving hot pages in local DRAM, and to treat CXL as a reclaim destination rather than a swap device. The kernel control vm.zone_reclaim_mode must be configured to express this preference. Setting it to 0 (the default) causes reclaim to occur across NUMA nodes without bias; setting it to 1 biases reclaim toward the local node first, which is the correct behaviour when CXL is present.

19.4 The Unchanged Mechanism, the Changed Cost

The central insight of this chapter, worth stating plainly before examining its implications: from the perspective of the CPU's translation hardware — the MMU, the TLB, the page table walker — a physical address that points into CXL-attached memory is absolutely indistinguishable from a physical address that points into local DRAM. There is no CXL bit in the PTE. There is no tiered-memory flag in the TLB entry. The page walker does not change its behaviour based on the destination of the physical address it has computed. The TLB entry that results from a successful page walk has exactly the same structure, exactly the same lifetime, and exactly the same priority as any other TLB entry. All of this is by design: CXL's integration into the physical address space is transparent to software and hardware above the memory bus layer.

The cost implications of this transparency are severe unless the system is tuned carefully. Consider the path of a memory access that results in a TLB miss in a correctly configured CXL-tiered system where page tables are pinned in local DRAM:

  1. The CPU issues a virtual address load. The TLB is consulted and misses.
  2. The hardware page table walker begins. It reads the PML4 entry from DRAM: ~80 ns.
  3. It reads the PDPT entry from DRAM: ~80 ns.
  4. It reads the PD entry from DRAM: ~80 ns.
  5. It reads the PT entry from DRAM: ~80 ns. Total PTW: ~320 ns.
  6. The physical address is produced. It names a location in CXL-attached memory.
  7. The CPU's memory controller forwards the request via the PCIe root complex to the CXL host bridge.
  8. The CXL host bridge forwards the request to the CXL switch, which routes it to the appropriate Type 3 device.
  9. The device's DDR5 memory controller services the request and returns the data.
  10. Data traverses the CXL switch, the host bridge, and the PCIe fabric back to the CPU: +70–300 ns depending on pool topology.

The PTW itself (steps 2–5) is unchanged: ~320 ns, identical to Chapter 17's baseline. The difference appears entirely in steps 6–10. A data access that would have cost ~80 ns in local DRAM now costs 150–380 ns, depending on whether the system uses an 8–16 socket pool (Pond's measured +70–90 ns over NUMA-local DRAM) or rack-scale pooling (180+ ns additional latency). The total TLB-miss resolution cost rises from ~400 ns (PTW + data) to ~500–700 ns.

This matters most for workloads with high TLB miss rates. Chapter 11 documented that LLM attention computation over large KV caches can drive DRAM TLB miss rates above 40%. Chapter 12 showed that multi-GPU workloads spend measurable fractions of their time in TLB shootdowns. When such workloads are deployed on CXL-tiered systems, each TLB miss resolution is 25–75% more expensive than it was with local DRAM alone. For a workload spending 20% of execution time on TLB misses (a high but plausible figure for attention-heavy LLM inference), migrating data to CXL adds 5–15% to total execution time even if the data would otherwise have hit in the DRAM page cache — because the miss penalty has grown.

Memory Access Latency Hierarchy with CXL Latency (nanoseconds, log scale) 500 ns 200 ns 100 ns 50 ns 10 ns 1 ns L1 Cache ~1 ns L2 Cache ~4 ns L3 Cache ~12 ns Local DRAM ~80 ns NUMA 120–200 ns CXL (8-sock) 150–170 ns CXL rack 260–280 ns 80 ns ~160 ns +70–90 ns +180 ns CXL adds latency post-TLB — translation mechanism is unchanged TLB hit: same cost. TLB miss + data access: 2×–4× more expensive Pond (ASPLOS 2023): +70–90 ns for 8–16 socket pool 43% of Azure workloads within 5% perf at +64 ns
Figure 19.4: Memory Access Latency Hierarchy Extended with CXL Tiers. Local DRAM latency (~80 ns) is the baseline established in Chapter 1. CXL attached memory at pool scale adds 70–90 ns (measured in Azure Pond, ASPLOS 2023), and rack-scale CXL adds over 180 ns. The translation mechanism — TLB lookup and page table walk — is identical for all tiers; the latency difference appears entirely after the physical address is produced.

The implication for TLB architecture decisions is also significant. Chapter 16 analysed huge page usage as the most important TLB optimisation: a 2 MB huge page reduces TLB miss rate by a factor of 512 relative to a 4 KB page, because 512 × 4 KB = 2 MB occupies a single TLB entry. Under local DRAM, each TLB miss costs ~400 ns; a 512× reduction in miss rate means huge pages save roughly 400 ns × 512 = 204,800 ns per huge-page-worth of data for a single pass. Under CXL, each TLB miss costs ~600 ns; the same 512× reduction saves 600 ns × 512 = 307,200 ns per huge page — 50% more value per huge page than with local DRAM. Put differently, TLB miss overhead on CXL memory is a more severe problem than on local DRAM, making huge page allocation on CXL tiers even more important than on local DRAM. Yet CXL memory fragmentation — arising from the pool allocation model — makes huge page assembly harder than it is for local DRAM, creating a tension that production deployments must resolve through careful preallocated huge page pools on CXL nodes.

The TLB miss rate model also applies to the page table walker itself, in a manner that Section 19.5 examines in detail. The key observation here is that the PTW model from Chapter 17 — which assumes DRAM-resident page tables as its baseline — remains accurate as long as page tables are correctly pinned in local DRAM. If that invariant is violated, the PTW latency explodes in a manner with no precedent in prior chapters, and it does so silently: the hardware makes no complaint, the OS receives no error, and the only observable symptom is performance degradation that may be misdiagnosed as network congestion, disk latency, or application logic overhead rather than its true cause — page table pages in CXL-attached memory.

19.5 The Page Table Pinning Constraint

The most operationally critical constraint introduced by CXL memory tiering is invisible to most documentation, absent from most deployment guides, and devastating when violated: page table pages must never be allowed to migrate to the CXL tier. This section explains why the constraint exists, how Linux's memory tiering infrastructure can violate it silently, and what operators must do to enforce it.

The reason is arithmetic. As Chapter 17 established, a four-level page table walk on x86-64 under normal conditions (page tables DRAM-resident) costs approximately 320 ns: four sequential memory accesses of ~80 ns each. Now suppose the kernel's memory tiering daemon has, during a period of memory pressure, demoted the page table pages for a particular process to the CXL tier. The page table walker has no awareness of this; it simply issues memory requests at the physical addresses it computes at each level. Each of those memory requests now traverses the PCIe fabric to the CXL device and back:

PML4 access: ~500 ns   (CXL-attached DDR5, 8-socket pool)
PDPT access: ~500 ns
PD access:   ~500 ns
PT access:   ~500 ns
─────────────────────
Total PTW:  ~2,000 ns  (6.25× worse than 320 ns baseline)

For a workload running on a 128-core server with a 40% TLB miss rate — entirely plausible for LLM attention computation, as Chapter 11 documented — having page tables in CXL memory means that 40% of all memory accesses incur a ~2,000 ns PTW instead of a ~320 ns PTW. The throughput penalty is not 6× — the workload has TLB hits for the other 60% of accesses — but the effective TLB miss penalty component of total execution time grows by 6×. A workload that spent 20% of its time on PTW activity will spend 120% of its original PTW time on PTW alone, representing a 20% increase in total execution time solely from misplaced page tables, with no change in workload behaviour.

The danger is that Linux's memory tiering infrastructure does not, by default, distinguish page table pages from application data pages when deciding what to demote to the slow tier. The kernel allocates page table pages through alloc_page(GFP_KERNEL) with the GFP_KERNEL flag, which requests memory from ZONE_NORMAL — the normal zone that spans local DRAM. Under ordinary conditions this is safe. However, when memory tiering is enabled with CXL as a NUMA node, the kernel's demotion path in kswapd and the explicit migration path triggered by migrate_pages() can move any page classified as cold — including pages in ZONE_NORMAL — to the CXL-backed NUMA node. Page table pages are not exempt from this classification unless the kernel is explicitly configured to protect them.

Page Table Walk Latency: Local DRAM vs CXL Baseline: Page Tables in DRAM PML4 access (DRAM) ~80 ns PDPT access (DRAM) ~80 ns PD access (DRAM) ~80 ns PT access (DRAM) ~80 ns Total PTW: ~320 ns 4 × 80 ns — Chapter 17 baseline ✅ Correct configuration Pinned PTables + CXL Data (correct CXL deployment) PML4 access (DRAM) ~80 ns PDPT access (DRAM) ~80 ns PD access (DRAM) ~80 ns PT access (DRAM) ~80 ns PTW: ~320 ns then +80–500 ns for CXL data access ✅ Page tables pinned in DRAM ⚠ Page Tables in CXL (dangerous misconfiguration) PML4 access (CXL) ~500 ns PDPT access (CXL) ~500 ns PD access (CXL) ~500 ns PT access (CXL) ~500 ns Total PTW: ~2,000 ns 6.25× worse than baseline ❌ Never allow — pin page tables PTW Cost Comparison Baseline DRAM: 320 ns (1×) Pinned PTables: 320 ns (1×) — data still benefits from CXL capacity CXL page tables: 2,000 ns (6.25×) — NEVER ALLOW
Figure 19.5: Page Table Walk Latency Under Three Configurations. The baseline (page tables in local DRAM) matches Chapter 17's PTW latency model at ~320 ns. The correct CXL deployment pins page tables in local DRAM while allowing application data in CXL — PTW cost is unchanged. If page table pages migrate to CXL (a danger when memory tiering is misconfigured), every TLB miss costs ~2,000 ns — 6.25× worse than baseline and potentially catastrophic for workloads with high TLB miss rates.

The fix is explicit page table pinning. Linux provides several mechanisms at different levels of the software stack:

At the kernel configuration level, the vm.zone_reclaim_mode sysctl controls whether reclaim is bounded to local zones. Setting vm.zone_reclaim_mode=1 prevents reclaim from crossing NUMA node boundaries, which ensures that page table pages allocated on node 0 are not reclaimed to CXL node 2. This is the blunt-instrument approach: it prevents all inter-tier reclaim, not just page table migration, which limits the effectiveness of CXL tiering for data pages as well.

More precise control is available through madvise(MADV_DONTMIGRATE) applied to the virtual address ranges that back page tables — but this requires kernel-internal access to page table pages' virtual addresses, which user space does not have. Kernel module developers can use set_memory_4k() and the vm_flags of struct vm_area_struct to prevent specific pages from migrating, but this is not a standard operator lever.

The production-grade approach, used by Azure and Meta, is to classify the CXL tier explicitly as a slow tier in the HMAT and configure the kernel's tiering daemon to exclude page table pages from demotion eligibility. In Linux 6.6+, the tiering daemon (memory_tier subsystem) can be configured with tier preference policies that pin pages based on allocation flags — and page table pages carry the PG_table flag, which the tiering daemon can use as an exclusion criterion. Operators should verify this configuration explicitly:

$ grep -r PageTable /sys/kernel/debug/memory_tiering/ 2>/dev/null
# should show page table pages excluded from demotion candidate list
$ cat /proc/sys/vm/demote_page_tables
0  # 0 = never demote page table pages to slow tier (correct)

The ARM64 situation is architecturally identical. ARM64's TTBR0/TTBR1-rooted page table walk traverses up to four levels (with 4 KB granule), and each level's access cost follows the same arithmetic: if the page tables are in CXL-attached memory, each level costs ~500 ns instead of ~80 ns. The constraint — pin page tables in local DRAM — applies equally on ARM64. NVIDIA's Grace Hopper implementation, which attaches CXL memory to the Grace CPU chiplet, enforces this constraint at the hardware level by reserving a fixed partition of the Grace package's local LPDDR5X memory for kernel structures including page tables, preventing the CXL-attached capacity from being reached by page table walks.

19.6 Memory Tiering: The OS Page Allocator Under CXL

Operating a CXL-tiered memory system correctly requires the OS page allocator to make informed decisions about which tier each page should occupy. This is not automatic: left to default configuration, the allocator will treat all memory as equivalent and distribute allocations arbitrarily across local DRAM and CXL, resulting in either underutilisation of CXL capacity (everything stays in DRAM) or degraded performance (latency-sensitive data ends up in CXL). The correct policy — hot data in local DRAM, cold data in CXL, page tables always in DRAM — must be enforced through explicit kernel configuration and, in high-performance deployments, hardware performance counter feedback.

Hot/cold classification is the foundation of the tiering policy. Three mechanisms are used in production systems, each with different accuracy and overhead tradeoffs.

The highest-accuracy mechanism is hardware performance counter sampling. Intel's Processor Event-Based Sampling (PEBS) and AMD's Instruction-Based Sampling (IBS) can record the virtual or physical address of sampled memory loads. By counting how frequently each page is hit in the sample stream, the kernel builds a per-page access frequency estimate. Pages with low frequency are cold candidates for CXL demotion; pages with high frequency are hot candidates for DRAM promotion. Meta's TPP uses PEBS for exactly this purpose. The overhead is approximately 1–2% CPU utilisation for the sampling daemon, which is acceptable for production deployments.

The second mechanism is the PTE accessed bit. The hardware sets the accessed bit in a PTE on every access. The kernel periodically scans page tables, records which bits are set, and clears them. Pages whose accessed bits are not re-set between scans are cold candidates. This is the mechanism used by AutoNUMA (vm.numa_balancing). Its accuracy is lower than PEBS because a single scan interval may be too long for rapidly-changing access patterns, but its overhead is also lower and it requires no special hardware support.

The third mechanism is working-set estimation. The kernel tracks memory access patterns at the granularity of memory zones and NUMA nodes using counters in struct zone_reclaim_stat. When the allocator places a new page, it consults these statistics to determine whether the destination zone is under pressure, and adjusts placement accordingly. This is a coarser instrument than per-page sampling but has essentially zero additional overhead.

Migration between tiers involves three distinct operations: copying the page's contents from source to destination, updating the PTE in the page table to point to the new physical address, and broadcasting a TLB shootdown to all CPU cores so that stale TLB entries for the old physical address are invalidated. The shootdown is the dominant cost for small-page (4 KB) migrations on systems with many cores. On a 128-core server, the shootdown requires sending an IPI to 127 cores, each of which executes an INVLPG instruction or equivalent, and the source core must wait for all 127 acknowledgements before the migration is complete. The measured cost is 10–50 µs per page, depending on system load and IPI handling latency.

The TLB shootdown cost during migration creates an important threshold effect: migrating a 4 KB page to save ~400 ns of future CXL access latency is only profitable if the page will be accessed at least 25–125 times in the future (50 µs shootdown cost ÷ 400 ns per-access saving). Pages that are accessed once or twice and then become cold are cheaper to leave in CXL than to promote to DRAM. This threshold makes accurate cold/hot classification crucial: aggressive promotion of marginally-warm pages wastes CPU time on migrations whose savings are never realised.

Transparent Huge Page demotion is a subtle interaction that becomes significant at scale. Linux Transparent Huge Pages (THP) allocate 2 MB contiguous pages in DRAM to reduce TLB miss rates. When memory pressure forces demotion of THP-backed regions to CXL, the kernel must first split the 2 MB huge page into 512 × 4 KB pages, update 512 PTEs, issue a shootdown for all 512 entries, and only then migrate the 4 KB fragments to CXL. The cost is 512× the per-page cost — potentially 26 ms of stall time for one 2 MB demotion. In production deployments with many gigabytes of THP-backed memory under pressure, this can cause observable latency spikes. The mitigation is to pre-emptively split THP pages before CXL memory pressure forces emergency demotion, using madvise(MADV_NOHUGEPAGE) on memory regions identified as CXL-bound before they are mapped.

Memory Tiering State Machine Under CXL HOT Local DRAM ~80 ns access Accessed bit set Page tables always here ZONE_NORMAL WARM Demoted — awaiting Accessed bit cleared kswapd monitoring PEBS sampling active ~10–50 µs to migrate COLD CXL Memory Tier ~260–400 ns access Not accessed recently NUMA node N (cpuless) Never: page table pages DEMOTE kswapd pressure access bit cleared MIGRATE → CXL not re-accessed ~10–50 µs + shootdown PROMOTE → DRAM access detected (PEBS / autonuma) copy 4 KB + PTE update + TLB shootdown re-accessed before eviction Classification Methods 1. PEBS sampling (access counters) 2. AutoNUMA scan (accessed bit) 3. TPP (Meta): explicit CXL-aware tier vm.numa_balancing=2 (Linux 6.x) Migration Cost Copy 4 KB (or 2 MB THP) Update PTE TLB shootdown (all CPUs) ~10–50 µs per 4 KB page THP Demotion 2 MB THP on DRAM → split → 512 × 4 KB 512 PTE updates + batch shootdown Some 4 KB pages → CXL tier Significant overhead at scale INVARIANT: Page table pages MUST remain in ZONE_NORMAL (local DRAM) at all times The memory tiering system must never migrate page table pages to the CXL tier — doing so causes PTW latency of ~2,000 ns (6.25× degradation)
Figure 19.6: Memory Tiering State Machine Under CXL. Pages transition between three tiers based on access frequency. Hot pages remain in local DRAM (80 ns). Warm pages awaiting classification sit in a monitoring state. Cold pages migrate to the CXL tier (260–400 ns). The invariant — page table pages must always remain in local DRAM — is shown explicitly. Each migration incurs a TLB shootdown across all CPU cores.

Microsoft's Pond system (ASPLOS 2023) provides the most thorough empirical characterisation of workload sensitivity to CXL tiering. Across 158 production Azure workloads, Pond found that at a CXL latency overhead of +64 ns over NUMA-local DRAM, 43% of workloads ran within 5% of their local-DRAM performance. At +140 ns overhead, 37% remained within 5%. However, more than 21% of workloads suffered a performance loss exceeding 25% even at the smallest latency differential — these workloads have tightly latency-coupled memory access patterns that CXL's added latency disrupts severely. The critical operational insight is that CXL tiering is not universally beneficial: deployment requires workload characterisation ahead of time. Workloads with streaming or batch-oriented memory access patterns (database analytics, model checkpointing, log aggregation) tolerate CXL well. Workloads with pointer-chasing, random access, or latency-sensitive feedback loops (interactive serving, online transaction processing, low-latency inference) often do not.

Pond's system-level response to this variance is a machine learning model trained on hardware counter profiles to classify each VM workload as CXL-tolerant or CXL-sensitive before allocation. CXL-tolerant VMs receive a mixture of local DRAM and CXL pool memory; CXL-sensitive VMs receive only local DRAM. The ML model uses 200 CPU performance counters as features and achieves sufficient precision to hold the fraction of VMs experiencing >5% degradation below a configurable threshold (set to 2% of VM population in Azure's deployment).

19.7 TLB Shootdown Inverted: CXL Back-Invalidation

Every TLB shootdown described in this book — in Chapter 4 (the fundamental mechanism), Chapter 9 (OS-level page management), Chapter 10 (device DTLB via ATS), and Chapter 12 (multi-GPU scale) — follows the same directional model: the OS modifies a page table entry, and the OS initiates the invalidation of all cached translations for that entry across all CPU cores and any device DTLBs that may have cached it. The initiator is always the OS. The direction is always from software to hardware.

CXL 3.0 introduces a protocol that inverts this direction entirely. In the Back-Invalidation mechanism, the CXL Fabric Manager — a software component running on the CXL switch hardware — initiates a TLB invalidation request that travels from the CXL device toward the host CPU. The host CPU is the recipient of an invalidation request it did not initiate. This is the first device-to-CPU TLB invalidation protocol described in this book, and understanding when and why it is needed requires understanding the CXL 3.0 Shared Memory model.

In CXL 2.0, a Type 3 device is owned exclusively by one host at a time. The host's physical address space maps a region of the device's HDM, and the host can issue load/store operations to that region freely. There is no concurrent access by another host; there is no need for cross-host coherence. The only TLB-related operation is the straightforward ATS INVALIDATE when the host OS unmaps a page in the HDM range.

CXL 3.0 Shared Memory changes this fundamentally. Multiple hosts simultaneously map regions of the same Type 3 device's HDM. Each host has its own host-physical address mapping for whatever portion of the CXL device's memory it currently holds. The CXL Fabric Manager controls which host has access to which portion of device memory, and it can dynamically reassign these regions — for example, to rebalance memory across a pool of servers, to reclaim memory from a host that has released its allocation, or to handle a hardware fault that requires migrating a memory region to a different device.

When the Fabric Manager decides to revoke a host's access to a CXL memory region — to reassign that region to another host — it must ensure that the revoking host has no valid TLB entries, no outstanding memory requests, and no cached data referencing the revoked region. The revoking host's OS has not initiated this operation; it may have active translations for the region cached in its TLBs across all 128 cores. The Fabric Manager cannot simply reassign the memory without first clearing those stale translations, or the revoking host could issue a read to what is now another host's memory — a serious coherence violation.

The Back-Invalidation protocol resolves this. The sequence is:

  1. The CXL Fabric Manager identifies a memory region on the Type 3 device that it wishes to revoke from Host A and reassign to Host B.
  2. The Fabric Manager issues a Back-Invalidate request to Host A's CXL host bridge, specifying the guest physical address (GPA) range to be invalidated.
  3. The Host A CXL host bridge raises an interrupt (or writes to a memory-mapped notification register) to alert the Host A OS.
  4. The Host A OS registers a back-invalidate handler during CXL initialisation; this handler is now invoked. It sends IPIs to all CPU cores, each of which executes the appropriate TLB invalidation instruction (INVLPG addr on x86-64, TLBI on ARM64) for the affected address range.
  5. Once all CPU cores have acknowledged the invalidation, the Host A OS issues a Back-Invalidate Completion response to the CXL host bridge.
  6. The CXL host bridge forwards the Completion to the Fabric Manager.
  7. The Fabric Manager, now certain that Host A holds no valid translations for the revoked region, proceeds with the reassignment to Host B.

The latency of this sequence is substantially larger than a simple OS-initiated shootdown, because it includes the Fabric Manager decision time, the interrupt delivery to the host OS, and the completion handshake traversing the CXL fabric. On a 128-core server, the measured range is 50–200 µs for the complete back-invalidation sequence — compared to ~10–50 µs for a standard OS-initiated shootdown of the same address range. This overhead is acceptable for the use cases that require Back-Invalidation (memory region reallocation across hosts), which occur on timescales of milliseconds to seconds rather than microseconds.

TLB Shootdown: Standard (OS-Initiated) vs CXL Back-Invalidation Standard TLB Shootdown — OS Initiated (Ch04/Ch09 model) OS Kernel CPU Cores IOMMU/Device 1. Modify PTE IPI (x86: INVLPG) 2. INVLPG flush ATS INVALIDATE 3. DTLB flush ATS Completion IPI ACK 4. Proceed Direction: OS → CPU cores → devices Latency: ~50 µs (128-core system) CXL 3.0 Back-Invalidation — Device Initiated (NEW) CXL Fabric Mgr Host CPU OS Kernel 1. Trigger (realloc) Back-Invalidate Req 2. Interrupt raised notify handler 3. IPI to all CPUs 4. INVLPG all Completion ACK 5. Realloc proceed Direction: Device → Host CPU (INVERTED — first in this book) Triggered by: CXL memory region reallocation across hosts Latency: 50–200 µs (IPI to all cores + completion handshake) New in CXL 3.0 — not present in CXL 1.x or 2.x
Figure 19.7: TLB Shootdown Directions: Standard OS-Initiated vs CXL 3.0 Back-Invalidation. The left panel shows the standard shootdown protocol described in Chapters 4, 9, and 12: the OS modifies a PTE, sends IPIs to all CPU cores, and issues an ATS INVALIDATE to device DTLBs. The right panel shows CXL 3.0 Back-Invalidation, where the CXL Fabric Manager initiates the sequence — the first device-to-CPU TLB invalidation protocol described in this book.

For x86-64, the hardware implementation of Back-Invalidation support lives in the CPU's CXL host bridge registers. The host bridge exposes a doorbell or memory-mapped I/O register that the Fabric Manager can write to trigger the interrupt. The CPU architecture itself requires no changes; the existing INVLPG instruction handles the TLB flush, and the existing IPI mechanism handles cross-core coordination. The CXL host bridge and its driver are responsible for translating the Fabric Manager's GPA range into the host OS's virtual address ranges and invoking the appropriate kernel handlers.

The ARM64 implementation follows a similar pattern. The CXL host bridge triggers an interrupt, and the ARM64 MMU's TLBI (TLB Invalidate) instruction family handles the actual cache invalidation. ARM64's TLBI instruction set is richer than x86-64's INVLPG — it can invalidate by address, by ASID, by VMID, or globally — which makes the ARM64 back-invalidation handler somewhat simpler to implement correctly for large address ranges.

Device DTLBs on Type 1 and Type 2 devices that use CXL.cache also participate in Back-Invalidation. When the Fabric Manager initiates a back-invalidation on a host that has a Type 1 or Type 2 CXL device attached, the CXL host bridge must not only clear the host CPU's TLBs but also issue ATS INVALIDATE messages to any device DTLBs that may have cached translations for the affected address range, and wait for their acknowledgements before issuing the Back-Invalidate Completion. The complete fan-out — Fabric Manager → Host CPU TLBs → Device DTLBs → Completion — adds further latency but preserves the all-or-nothing semantic that the coherence protocol requires.

19.8 CXL 3.0 Shared Memory: Two-Stage Translation Returns

Chapter 3 introduced two-stage address translation in the context of CPU virtualisation. When a hypervisor runs guest operating systems, the guest OS manages its own page tables mapping Guest Virtual Addresses (GVA) to Guest Physical Addresses (GPA), but the hypervisor must additionally translate GPA to Host Physical Address (HPA) — the address that actually identifies a physical memory location. Intel's Extended Page Tables (EPT), AMD's Nested Page Tables (NPT), ARM's Stage-2 page tables, and RISC-V's G-stage — all solve the same structural problem: two independent software entities each maintain their own address space, and hardware must efficiently compose their translations. Chapter 8 explored the performance implications of this composition, and the page walk caches that hardware uses to avoid re-traversing both sets of page tables on every TLB miss.

CXL 3.0 Shared Memory introduces a structurally identical two-stage translation problem, but in a domain that has nothing to do with virtualisation. The two stages are not hypervisor and guest; they are two independent physical servers that both map regions of the same physical CXL device. The mechanism is not EPT or NPT, but the mathematical structure is isomorphic.

In CXL 2.0 and earlier, this problem does not arise because a Type 3 device is owned exclusively by one host at a time. Host A's physical address map has a region — say, 0x8000_0000 through 0xBFFF_FFFF — that names the CXL device's first 1 GB. When Host A issues a load to physical address 0x8000_0000, the CXL host bridge translates this to CXL Device Physical Address 0x0000_0000 and forwards the request. This is a trivial linear mapping — effectively a constant offset — and it is managed entirely by the CXL host bridge hardware without software page tables.

CXL 3.0 Shared Memory introduces concurrency. Host A and Host B are both running, and both have live mappings into the same CXL Type 3 device's memory. Host A's physical address 0x8000_0000 and Host B's physical address 0xC000_0000 both correspond to CXL Device Physical Address 0x0000_0000 — the same physical memory cell. Host A and Host B each have their own virtual address spaces, their own page tables, their own TLBs, and their own operating systems. Neither OS has direct knowledge of the other's mapping.

CXL 3.0 Shared Memory: Two-Stage Address Translation Host A (Server 1) CPU / MMU TLB + PTW DRAM Page Tables Host A Virtual Address Space VA 0x7F00_0000 → Host-A-PA 0x8000_0000 (Host A page tables) Host-A-PA → CXL-Device-PA 0x8000_0000 → CXL 0x0000_0000 (CXL host bridge mapping) Host B (Server 2) CPU / MMU TLB + PTW DRAM Page Tables Host B Virtual Address Space VA 0xAB00_0000 → Host-B-PA 0xC000_0000 (Host B page tables) Host-B-PA → CXL-Device-PA 0xC000_0000 → CXL 0x0000_0000 (same physical CXL location!) CXL Switch Snoop Filter Tracks: which hosts cache which CXL-PA CXL-PA 0x0000_0000: cached by: Host A (S), Host B (S) MESI: Shared state — read-only both hosts Fabric Manager Controls access permissions · Initiates Back-Invalidation CXL Type 3 Device Unified Physical Address Space CXL-PA 0x0000_0000 → DDR5 row 0 Same location accessible by both hosts Two-Stage Translation Chain (analogous to EPT/NPT in Chapter 3, but for disaggregated memory): Host A: VA 0x7F00_0000 → Host-A-PA 0x8000_0000 (CPU MMU + TLB) → CXL-Device-PA 0x0000_0000 (CXL host bridge mapping)
Figure 19.8: CXL 3.0 Shared Memory: Two-Stage Address Translation. Host A and Host B each have independent virtual address spaces and page tables that map to their own host physical address ranges. A second translation stage — implemented by the CXL host bridge and Fabric Manager — maps each host's physical address range to a shared CXL device physical address space. The Snoop Filter at the CXL switch tracks which hosts have cached which CXL physical addresses, enabling MESI coherence across the fabric. This is structurally analogous to EPT/NPT nested page tables described in Chapter 3, but applied to disaggregated memory rather than virtual machine isolation.

The two stages of address translation in this model are:

Stage 1 (host page tables): Each host's OS manages page tables that map the host's virtual address space to the host's physical address space, exactly as described in Chapter 3. Host A's VA 0x7F00_0000 translates to Host-A-PA 0x8000_0000. Host B's VA 0xAB00_0000 translates to Host-B-PA 0xC000_0000. Each host's TLB caches these stage-1 translations. Each host's page table walker traverses these stage-1 page tables in the event of a TLB miss. The OS manages these page tables using the same mechanisms described in Chapters 3, 7, and 9.

Stage 2 (CXL host bridge mapping): The CXL host bridge hardware maintains a mapping from each host's physical address range to the CXL device's physical address space. Host-A-PA 0x8000_0000 maps to CXL-Device-PA 0x0000_0000. Host-B-PA 0xC000_0000 maps to CXL-Device-PA 0x0000_0000. These mappings are established during boot when the Fabric Manager assigns memory regions to hosts, and updated when regions are reassigned (requiring Back-Invalidation as described in Section 19.7). Unlike EPT/NPT, this stage-2 mapping is not a full page table hierarchy — it is a hardware register-level address range translation, more similar to a BAR mapping than a walk-able data structure.

The analogy to EPT/NPT is instructive but imperfect. In the virtualisation case, both stages are full multi-level page tables that the hardware walks in sequence on every TLB miss; EPT/NPT caches in the TLB (VPID-tagged entries on x86-64) contain the fully composed GVA-to-HPA translation so that neither stage requires re-walking on TLB hits. In the CXL Shared Memory case, stage 1 is a full page table hierarchy managed by the host OS, while stage 2 is a coarse-grained linear address range mapping managed by the host bridge. The composition is performed in hardware at the host bridge, transparently to the CPU and OS. The TLB contains Host-A-PA after stage 1, not CXL-Device-PA; the stage-2 translation occurs in the host bridge hardware after the CPU has finished its page walk and produced a physical address.

Coherence across the fabric: CXL 3.0 Shared Memory implements coherence using an extended MESI protocol managed by the CXL switch's Snoop Filter. The Snoop Filter tracks, for each CXL Device Physical Address range, which hosts currently hold cached copies in what coherence state. When Host A reads from CXL-Device-PA 0x0000_0000, the Snoop Filter records that Host A has the address in Shared (S) state. When Host B reads the same address, the Snoop Filter records that both Host A and Host B have it in Shared state. If Host A wishes to modify the data, it must first acquire Exclusive (E) state by requesting that the Snoop Filter invalidate Host B's cached copy — issuing what is, in effect, a back-invalidation to Host B's TLBs for the affected range. Only after Host B has acknowledged the invalidation can the Snoop Filter grant Host A exclusive ownership.

Hardware with full CXL 3.0 Shared Memory support is in early production deployment as of 2025–2026. The Linux kernel's CXL Shared Memory subsystem, under active development in the drivers/cxl/ tree, provides the host OS side of the protocol. Full multi-host Shared Memory with coherence across more than two hosts is the frontier of current CXL deployment; the two-host case described here is the currently deployed baseline.

19.9 Production Deployments and Performance Engineering

The preceding sections have established the mechanisms; this section grounds them in measured production reality. CXL is no longer a research prototype technology — it is shipping in hyperscale datacenters, AI training clusters, and cloud platforms. The measurements below come from peer-reviewed publications and production deployments that have documented their results.

19.9.1 Azure Pond: CXL Memory Pooling at Cloud Scale

Microsoft Azure's Pond system, published at ASPLOS 2023 by Li, Berger, et al., represents the most thoroughly characterised production CXL deployment. Pond pools DDR5 DRAM from an External Memory Controller (EMC) across 8–16 dual-socket servers connected via CXL, creating a shared memory pool of 1–4 TB that is presented to each server as a cpuless NUMA node.

Pond's latency measurements establish the key numbers for this chapter. At a pool size of 8–16 sockets, the CXL link adds 70–90 ns of additional latency over same-NUMA-node DRAM access. At rack scale (requiring PCIe retimers to extend the signal path), latency increases to 180 ns or more. These numbers are not theoretical: they are measured on production Intel Skylake servers connected via PCIe 5.0 CXL links.

The workload analysis is equally important. Pond profiled 158 production Azure workloads — virtual machine instances running a variety of cloud applications — under emulated CXL latencies. At +64 ns overhead (representing an 8–16 socket pool), 43% of workloads performed within 5% of their baseline performance on fully local DRAM. At +140 ns (representing a larger pool), 37% remained within 5%. More than 21% of workloads suffered greater than 25% performance degradation at any pool size. These 21% are the latency-sensitive workloads — interactive databases, real-time analytics, low-latency serving — that Pond's ML-based pre-placement model identifies and keeps on local DRAM.

Pond's cost model shows that 7–9% reduction in DRAM cost per server is achievable through pooling-enabled utilisation improvements. At Azure's scale, this represents a significant absolute cost saving — the CXL infrastructure investment is recovered in DRAM savings within a deployment generation.

19.9.2 DirectCXL: Load/Store vs Page-Based Disaggregation

The DirectCXL system from KAIST (USENIX ATC 2022, Gouk et al.) was the first published implementation of CXL 2.0 on real hardware, using an FPGA-based CXL controller and custom Type 3 device. DirectCXL's central contribution to this chapter's discussion is its comparison between two models of memory disaggregation: page-based access (which uses the OS virtual memory infrastructure and incurs page fault overhead) and direct load/store access (which exposes the CXL memory as HDM and accesses it through normal CPU load/store instructions).

The page-based model — where kswapd manages disaggregated memory through the conventional page fault and swap mechanism — incurs substantial overhead from page fault handling, I/O amplification (entire 4 KB pages transferred even for small reads), and context switching. DirectCXL's load/store model eliminates all of this: the CPU issues load instructions directly to HDM addresses, the CXL host bridge forwards the requests to the device, and data returns without any software involvement on the critical path.

The measured latency difference is dramatic. A 64-byte read over RDMA (the previous state of the art for memory disaggregation) costs 2,705 CPU cycles total — 2,129 of which are consumed by the InfiniBand protocol stack software. The same read over CXL costs 328 CPU cycles. This 8.2× latency advantage comes almost entirely from protocol software elimination; CXL's hardware-managed load/store path bypasses the entire network software stack.

This comparison is important for understanding why CXL is the correct architecture for memory disaggregation rather than RDMA or software-defined memory. RDMA achieves low latency only by pinning memory and bypassing the OS page allocator — which sacrifices the very flexibility that memory disaggregation is supposed to provide. CXL achieves sub-RDMA latency while allowing the OS to allocate, migrate, and reclaim CXL-backed pages through the standard virtual memory machinery.

19.9.3 Meta TPP: Transparent Page Placement

Meta's Transparent Page Placement (TPP) mechanism, published at ASPLOS 2023, addresses the AutoNUMA pitfall described in Section 19.3 for Meta's production server fleet. Meta's servers run a combination of compute-intensive workloads (ML training, data processing) and latency-sensitive workloads (social media serving, real-time analytics), which have very different CXL tolerance profiles even when running on the same physical host.

TPP introduces CXL tier classification into the Linux page migration logic via hardware PEBS sampling. Every page on the system is classified as hot, warm, or cold based on its PEBS access frequency over a measurement window. Hot pages are maintained in local DRAM; warm pages are candidates for either tier depending on DRAM pressure; cold pages are eligible for CXL demotion. Critically, TPP does not use the existing AutoNUMA accessed-bit scan for this classification — the scan period is too long to distinguish workload-relevant access patterns. PEBS sampling provides sub-millisecond resolution at roughly 1% CPU overhead.

In production at Meta, TPP achieves a DRAM cost reduction comparable to Pond's 7–9% while maintaining tail latency (P99 request latency) within 5% of the baseline for the latency-sensitive serving workloads. The key enabling insight is that serving workloads and batch processing workloads have naturally complementary working-set sizes: the serving workload's hot working set fits comfortably in local DRAM, while the batch workload's cold data can occupy the CXL tier without performance impact.

19.9.4 TLB Thrashing on CXL: The Hash Join Problem

A 2024 study published at the VLDB ADMS workshop provides the most direct measurement of CXL's interaction with TLB behaviour under analytical database workloads. The study evaluated radix hash join — a standard database operator that is highly sensitive to TLB miss rate because it accesses memory with a stride determined by the hash function, which is effectively random at the TLB granularity.

At high partition fanout values (where the hash table exceeds TLB coverage), radix hash join suffers severe TLB thrashing. The study measured that DRAM:CXL interleaving (alternating allocations between local DRAM and CXL memory) delivers higher bandwidth than CXL-only because the local DRAM's lower latency absorbs the write-amplified stores that are most sensitive to CXL's store penalty. However, NUMA memory (remote DRAM across a cross-socket link) consistently outperformed CXL memory for the write-heavy phases of radix join, because CXL's store latency (~500 ns round trip) is worse than cross-socket DRAM (~160 ns) for the highly random write patterns that TLB thrashing causes.

The practical recommendation from this study — applicable to any database or analytics operator that exceeds TLB coverage — is to place the write-heavy phase (the scatter/build phase of hash join) in local DRAM or NUMA-local memory, reserving CXL for the read-heavy probe phase where the latency differential has smaller impact. For ML workloads, the analogous recommendation is to place attention score matrices (written during forward pass, read during backward pass) in local DRAM, while placing embeddings and frozen model layers (read-only during inference) in CXL.

19.9.5 AI/ML Workloads: KV Cache and ANNS

Two AI workload patterns have been specifically studied in the context of CXL: LLM inference KV cache management and approximate nearest neighbour search (ANNS) over billion-scale embeddings.

In LLM inference, the KV cache is the memory structure that grows with context length, storing the key and value tensors for each token in the current context window. For long-context serving (32K–128K context length), the KV cache can dominate GPU and host memory consumption. CXL-attached memory provides a natural overflow tier. The CXL-ANNS work from USENIX ATC 2023 demonstrated that pointer-chasing access patterns — common in graph-based ANNS indices — can be prefetched across CXL links effectively, because the fixed fanout of graph traversal provides sufficient structure for hardware prefetchers to predict the next CXL access before the current one completes. The measured throughput for ANNS over CXL-disaggregated indices is within 20–30% of the DRAM-resident case at much lower cost per query.

For LLM inference specifically, the KV cache exhibits a bimodal access pattern: recently generated tokens are accessed every decode step (hot), while tokens from the beginning of long contexts are rarely re-accessed after their initial contribution to the attention computation (cold). This bimodality makes the KV cache an ideal CXL tiering candidate: keep the hot recent KV cache in local DRAM (or GPU HBM), demote the cold long-context prefix to CXL, and promote only when a long-context lookup requires the old tokens. Systems like MemVerge's memory machine, which XConn Technologies demonstrated at SC25 using CXL switches, deliver KV cache overflow to CXL that reduces total inference memory cost by 30–50% for context lengths above 32K tokens.

19.9.6 Five Production Tuning Rules

Based on the measurements and deployments described above, five operational rules summarise correct CXL memory system configuration:

Rule 1: Pin page tables in local DRAM. Set vm.demote_page_tables=0 and verify that the memory tiering daemon's demotion policy excludes pages with the PG_table flag. Violation causes 6× PTW latency degradation.

Rule 2: Use 2 MB huge pages for all CXL-resident data. Each TLB miss against a CXL-backed 4 KB page costs ~600 ns post-PTW latency. A 2 MB huge page reduces TLB miss rate 512× for the same data. The huge page investment is worth 50% more on CXL than on local DRAM.

Rule 3: Pre-classify workloads before CXL allocation. Use hardware counter profiles to identify latency-sensitive workloads and constrain their memory allocations to local DRAM (numactl --membind=0,1 or equivalent cgroup memory policy). Batch and cold-data workloads can use CXL freely.

Rule 4: Set vm.numa_balancing=2 on Linux 6.x. This activates CXL-aware tiering mode. The default vm.numa_balancing=1 treats CXL as an ordinary NUMA node and will incorrectly migrate latency-sensitive pages to the CXL tier.

Rule 5: Monitor TLB miss rate per NUMA node. Use perf stat -e dTLB-load-misses,dTLB-store-misses broken down by NUMA node. A workload on the CXL node with high TLB miss rate is a strong signal to promote it to local DRAM or to increase the huge page allocation for that workload.

Pond Production CXL Pool Topology (Azure, ASPLOS 2023) Server 1 Socket 0 Socket 1 Local DRAM: 512 GB Server 2 Socket 0 Socket 1 Local DRAM: 512 GB ··· Server 8 (or 16) Socket 0 Socket 1 Local DRAM: 512 GB Local DRAM: ~80 ns access latency · 512 GB per 2-socket server · Page tables MUST stay here External Memory Controller (EMC) Multi-headed · Connects 8–16 CPU sockets and DDR5 DRAM channels (12 per 16-socket pool) CXL Switch Fabric PCIe 5.0 · 128 lanes (16-socket) CXL Pool Memory DDR5 Banks 128–256 GB per node DDR5 Banks 128–256 GB per node Total pool: 1–4 TB shared across 8–16 sockets Pond Measured Latency (ASPLOS 2023) 8–16 socket pool: +70–90 ns vs NUMA-local DRAM Rack-scale (retimers): +180+ ns vs NUMA-local 43% of 158 Azure workloads within 5% perf at +64 ns Production Results — Azure Pond ● Pool size: 8–16 dual-socket servers → 1–4 TB shared pool ● DRAM cost reduction: 7–9% per server via pooling utilisation ● Workload tolerance: 43% within 5% perf degradation at +64 ns CXL ● 21%+ of workloads suffer >25% perf loss — must stay in local DRAM ● OS model: cpuless NUMA node — compatible with existing Linux NUMA APIs
Figure 19.9: Azure Pond CXL Pool Topology (ASPLOS 2023, Microsoft Research). Eight to sixteen dual-socket servers connect to an External Memory Controller (EMC) via PCIe 5.0 CXL links. The EMC aggregates DDR5 memory from multiple banks into a shared pool of 1–4 TB, presented to all servers as a cpuless NUMA node. At 8–16 socket pool size, CXL adds only 70–90 ns of latency over same-NUMA-node DRAM — small enough that 43% of production Azure workloads run within 5% of their local-DRAM performance while the pool achieves 7–9% DRAM cost reduction.
CXL Production Deployments — Performance Engineering Comparison Azure Pond (ASPLOS 2023) DirectCXL (USENIX ATC 2022) Meta TPP (OSDI 2023) SYSTEM TYPE CXL memory pooling CXL load/store disaggregation OS-level memory tiering HARDWARE 8–16 dual-socket servers PCIe 5.0 CXL, DDR5 EMC pool FPGA CXL 2.0 controller Custom Type 3 device, PCIe 4.0 Meta production servers CXL Type 3 + local DRAM tiering LATENCY OVERHEAD +70–90 ns (8–16 sockets) +180 ns at rack scale (retimers) +100–120 ns (FPGA) Load/store: lower overhead vs page +60–150 ns (tiered) Hot pages promoted to local DRAM WORKLOAD FIT 43% of 158 workloads within 5% at +64 ns overhead 21% suffer >25% degradation Pointer-chase: near-native BW Sequential: line-rate saturation Latency-sensitive: noticeable cost Memory-capacity workloads ideal Inference serving, key-value stores Real-time OLTP: local DRAM needed TLB / MMU IMPACT TLB walk identical; physical address cost: +70–90 ns on LLC miss No TLB involvement (load/store bypasses page table for mapped ranges) Promotion/demotion involves TLB shootdown on page migration COST MODEL 7–9% DRAM cost reduction via pooling utilisation gains Eliminates stranded DRAM disaggregation enables hot-swap 30–50% DRAM on CXL tier large-capacity, lower-cost footprint STATUS Production — Azure hyperscale fleet Research prototype — FPGA CXL 2.0 Production — Meta datacenters Workload × Deployment Fit Matrix Azure Pond DirectCXL Meta TPP LLM inference serving ✓ Good ✓ Good ✓✓ Ideal Key-value store (Redis) ✓ Good ~ Marginal ✓ Good OLTP / interactive DB ✗ Poor ✗ Poor ✗ Poor Batch analytics / ETL ✓✓ Ideal ✓ Good ✓ Good GPU/NPU model weights ~ Marginal ✓ Good ~ Marginal Real-time ML serving ✗ Poor ~ Marginal ✗ Poor
Figure 19.10: Production CXL deployment comparison: Azure Pond (memory pooling at cloud scale), DirectCXL (load/store disaggregation, KAIST), and Meta TPP (OS-level memory tiering). Latency overhead, workload fit, TLB/MMU impact, and cost model differ substantially across approaches. Workload × deployment fit matrix shows that latency-sensitive OLTP performs poorly on all three, while batch analytics and LLM inference serving are well-suited candidates for CXL memory expansion.

19.10 Chapter Summary

CXL (Compute Express Link) is the first production interconnect technology that extends the CPU's physical address space beyond the local memory controller while preserving the complete virtual-to-physical translation machinery unchanged. This chapter has examined what that means in practice for every layer of the memory management stack, from the hardware protocol to the Linux kernel to production deployment configuration.

The protocol architecture (Section 19.2) established the three sub-protocols — CXL.io for device management, CXL.cache for device-side TLBs and caching, CXL.mem for load/store access to device-attached DRAM — and the three device types they enable. Type 3 memory expanders, which use only CXL.io and CXL.mem, are the practical foundation of memory disaggregation: pure DRAM capacity exposed as Host-managed Device Memory (HDM) in the host's physical address space.

The OS model (Section 19.3) showed that Linux presents CXL memory as a cpuless NUMA node — a new entry in the system's NUMA topology with a measured access latency that ACPI HMAT communicates to the kernel. The critical configuration requirements — setting vm.numa_balancing=2 to activate CXL-aware tiering, configuring kswapd reclaim direction correctly — are invisible to default installations and must be applied explicitly.

The central mechanistic insight of the chapter (Section 19.4) is that the translation mechanism — CR3-rooted page walks on x86-64, TTBR-rooted walks on ARM64 — is entirely unchanged by CXL. A PTE that maps to a CXL-backed physical address is byte-for-byte identical to a PTE mapping local DRAM. The TLB entry is identical. The page walker is oblivious. The cost difference — 80 ns for local DRAM versus 150–380 ns for CXL — appears entirely after the physical address is produced, in the fabric traversal from the PCIe root complex to the CXL device and back. Azure's Pond measured +70–90 ns for 8–16 socket pools and +180 ns for rack-scale configurations; 43% of 158 production workloads tolerated +64 ns overhead within 5% performance degradation.

The page table pinning constraint (Section 19.5) is the most operationally dangerous aspect of CXL deployment. If the kernel's memory tiering infrastructure demotes page table pages to the CXL tier — which it will do by default unless explicitly prevented — every TLB miss incurs a page table walk against CXL-resident memory, costing ~2,000 ns instead of ~320 ns. The 6.25× PTW slowdown is silent: no hardware alert, no OS error, only degraded performance that may be misdiagnosed for weeks. The correct configuration is vm.demote_page_tables=0 combined with memory tier exclusion of pages with the PG_table flag.

Memory tiering mechanics (Section 19.6) examined the hot/cold classification methods — PEBS sampling (highest accuracy), PTE accessed bit scanning (standard AutoNUMA), and working-set estimation — and the migration cost model. Each 4 KB page migration costs 10–50 µs of TLB shootdown overhead, creating a minimum access frequency threshold below which migration to DRAM is not profitable. THP demotion — which requires splitting a 2 MB huge page into 512 × 4 KB pages before migrating to CXL — is a particular source of latency spikes at scale.

CXL 3.0 Back-Invalidation (Section 19.7) introduced the first device-to-CPU TLB invalidation protocol described in this book. In Shared Memory configurations, the CXL Fabric Manager can initiate a TLB shootdown on a host to revoke access to a memory region being reassigned to another host. The directional inversion — device initiates, CPU responds — requires a registered back-invalidate handler in the host OS and costs 50–200 µs per invocation.

CXL 3.0 Shared Memory (Section 19.8) revealed a two-stage address translation structure structurally analogous to the EPT/NPT nested page tables analysed in Chapter 3. Each host maintains its own stage-1 page tables mapping virtual addresses to host physical addresses. A second stage, implemented in the CXL host bridge hardware, translates host physical addresses to a unified CXL device physical address space. The CXL switch's Snoop Filter enforces MESI coherence across hosts, tracking which hosts have cached which device physical addresses and triggering back-invalidation when exclusive ownership must be transferred.

Production measurements (Section 19.9) grounded all mechanisms in empirical reality: DirectCXL's 8.2× latency advantage over RDMA (328 vs 2,705 cycles for a 64-byte read); Pond's 7–9% DRAM cost reduction at +70–90 ns CXL latency; Meta's TPP achieving within 5% tail latency for serving workloads while reducing DRAM cost; the hash-join TLB thrashing study showing NUMA memory outperforming CXL for write-heavy random-access patterns; and the emerging KV cache offloading use case for LLM inference, where 30–50% memory cost reduction is achievable for contexts above 32K tokens.

Four architectural connections to earlier chapters deserve explicit restatement. First, Chapter 3's two-stage translation theory for virtualisation is not merely analogous to CXL 3.0 Shared Memory — it is the same mathematical structure. The mapping Host-A-VA → Host-A-PA → CXL-Device-PA is isomorphic to Guest-VA → Guest-PA → Host-PA. Chapter 3 describes the theory; Chapter 19 describes a novel production instantiation. Second, Chapter 9's NUMA-aware page table management guidance becomes mandatory rather than advisory under CXL: page table placement is no longer a performance optimisation but a correctness constraint. Third, Chapter 17's PTW latency model receives a new tier — CXL-resident page tables at ~2,000 ns — that represents a category of failure rather than a design point. Fourth, Chapter 12's multi-GPU TLB shootdown scale analysis is complemented here by a new direction of shootdown initiation that no prior chapter described.

The horizon beyond this chapter includes CXL 4.0 (November 2025), which doubles link bandwidth to 128 GT/s and extends memory pooling to multi-rack topologies. At that scale, the latency model changes: rack-to-rack CXL traversal, with multiple retimer stages and switch hops, may reach 500+ ns of additional latency — deep into the territory where only the coldest data tiers can tolerate the access cost. The page table pinning constraint becomes even more critical as pool size grows. And the two-stage translation model of CXL 3.0 Shared Memory will expand to multi-host configurations spanning tens of servers, requiring more sophisticated Snoop Filter designs and potentially software-managed coherence for the largest pools. These directions point naturally toward the next chapter's subject: the hypervisor MMU internals that manage virtual machines across exactly the kind of two-stage address translation hierarchy that CXL 3.0 has now demonstrated is as relevant to disaggregated memory as it is to virtualisation.

References

  1. CXL Consortium. Compute Express Link Specification, Revision 3.0. CXL Consortium, 2022. https://www.computeexpresslink.org/

  2. CXL Consortium. Compute Express Link Specification, Revision 4.0. CXL Consortium, November 2025. https://www.computeexpresslink.org/

  3. Glisse, M., Abramson, D., Bhatia, T., and Sharma, D. "An Introduction to the Compute Express Link (CXL) Interconnect." ACM Computing Surveys, 2024. DOI: 10.1145/3669900

  4. Gouk, D., Lee, S., Kwon, M., and Jung, M. "Direct Access, High-Performance Memory Disaggregation with DirectCXL." In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC '22), pp. 287–294, 2022. https://www.usenix.org/conference/atc22/presentation/gouk

  5. Jang, J., Choi, H., Bae, H., Lee, S., Kwon, M., and Jung, M. "CXL-ANNS: Software-Hardware Collaborative Memory Disaggregation and Computation for Billion-Scale Approximate Nearest Neighbor Search." In Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC '23), 2023.

  6. Kim, J., Nam, H., Kim, J., and Huh, J. "Exploring the Design Space of Page Management for Multi-Tiered Memory Systems." In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC '21), 2021.

  7. Li, H., Berger, D.S., Hsu, L., Ernst, D., Zardoshti, P., Novakovic, S., Shah, M., Rajadnya, S., Lee, S., Agarwal, I., Hill, M.D., Fontoura, M., and Bianchini, R. "Pond: CXL-Based Memory Pooling Systems for Cloud Platforms." In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '23), Volume 2, pp. 574–587, 2023. DOI: 10.1145/3575693.3578835

  8. Liang, Y., Huang, T., Chen, K., et al. "Innovation in Computational Architecture: Opportunities and Challenges of CXL Memory Disaggregation Technology in Intelligent Computing Centers." Tsinghua Science and Technology, vol. 31, no. 4, pp. 2020–2039, 2026. DOI: 10.26599/TST.2025.9010010

  9. Al Maruf, H., Wang, H., Dhanotia, A., Weiner, J., Agarwal, N., Bhattacharya, P., Petersen, C., Chowdhury, M., Kanaujia, S., and Chauhan, P. "TPP: Transparent Page Placement for CXL-Enabled Tiered Memory." In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '23), 2023. arXiv:2206.02878

  10. Pan, Y., Lala, Y., Unal, M., Ren, Y., Lee, S., Bhattacharjee, A., Khandelwal, A., and Kashyap, S. "Scalable Far Memory: Balancing Faults and Evictions." In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP '25), 2025.

  11. PCI-SIG. Address Translation Services Revision 1.1. PCI Special Interest Group, 2020.

  12. Tang, Y., Lee, S., Bhattacharjee, A., and Khandelwal, A. "pulse: Accelerating Distributed Pointer-Traversals on Disaggregated Memory." In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '25), 2025.

  13. Wang, S., et al. "Bandwidth Expansion via CXL: A Pathway to Accelerating In-Memory Analytical Processing." In Fifteenth International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS '24), co-located with VLDB 2024, 2024.

  14. Zhang, R., et al. "Next-Gen Computing Systems with Compute Express Link: A Comprehensive Survey." arXiv preprint arXiv:2412.20249, 2025.