Chapter 17: Page Table Walker Microarchitecture

17.1 Introduction: The Hidden Machine Inside the MMU

Every TLB miss sets a hardware machine in motion. That machine — the page table walker (PTW) — traverses a multi-level radix tree stored in physical memory, extracts the physical frame number and permission bits from the leaf entry, and installs the result into the TLB. From the software perspective this happens invisibly and automatically. From the microarchitecture perspective, it is one of the most performance-critical circuits on a modern processor. A single cold page table walk on a DDR5 system can consume 720 to 1,000 cycles — equivalent to stalling a 4 GHz core for a quarter of a microsecond while waiting for four sequential DRAM fetches. On a workload with a TLB miss rate of one miss per hundred instructions executing at four instructions per cycle, the PTW consumes twelve percent of all available cycles. That is a worse performance tax than the L2 cache miss rate of most production server workloads.

Chapters 3 and 4 of this book explain what the page table walker does: which bits of the virtual address index each level of the page table tree, how page walk caches (PWC) eliminate redundant upper-level fetches, and what the resulting latency looks like from the software perspective. This chapter asks a fundamentally different question: how is the PTW physically implemented in hardware, and what are the microarchitectural properties that determine its throughput, its vulnerability surface, and the performance cliffs that application developers need to understand?

The answer matters across three dimensions. First, the number of Miss-Status Holding Registers (MSHRs) dedicated to concurrent page table walks sets a hard ceiling on PTW throughput that no software tuning can exceed — a fact that explains the catastrophic TLB miss penalties observed in large-memory AI/ML workloads and in-memory databases examined in Chapters 11 and 14. Second, the speculative execution properties of the PTW — specifically, whether the walker follows a P=0 (not-present) PTE speculatively before the permission check fires — are the root cause of the Foreshadow and L1 Terminal Fault (L1TF) vulnerabilities that required urgent microcode patches across the entire Intel installed base in 2018, and continue to influence SGX deployment decisions. Third, the design split between x86-64's microcode-assisted finite-state machine and ARM64's pure-hardware translation table walker has practical implications for how security patches are deployed: Intel can distribute mitigation through a microcode update; ARM implementors require either hardware revision or software workaround.

Understanding the PTW at this level of detail has become more urgent as memory footprints have grown faster than TLB capacity. In the x86-64 architecture, the DTLB grew from 64 entries in the original Pentium 4 to 64 entries in Nehalem (2008), 64 entries in Haswell (2013), and 96 to 128 entries in current Raptor Lake (2023) — a factor of approximately 2× over 20 years. In the same period, DRAM capacity per server socket grew from 4 GB to 6 TB — a factor of 1,500×. TLB coverage, measured as the fraction of a workload's working set that can be mapped in the TLB simultaneously, has therefore collapsed by three orders of magnitude for large-memory workloads. The consequence is that the PTW, which once handled a rare exception case, is now in the critical path for a substantial fraction of all memory accesses in cloud and AI/ML workloads. This shift has driven the addition of four-walker designs (AMD Zen 4), larger STLB (second-level TLB) arrays, and dedicated Page Walk Cache capacity in every major server processor family introduced since 2020.

The performance stakes are asymmetric across workload classes. Web services and application servers, which operate predominantly within their TLB reach, rarely see PTW-related overhead above two percent. Databases operating at the boundary of TLB reach — a common condition for in-memory databases on servers with 384 GB to 2 TB of RAM — can see PTW overhead of 10 to 20 percent, consistent with the 15 to 25 percent overhead reduction observed by Basu et al. when switching from 4 KB to 2 MB pages on graph analytics workloads. Large language model inference with batch sizes that require materialiing multi-hundred-gigabyte KV caches represents the most extreme case: the address translation overhead for a 70-billion parameter model can exceed 30 percent of wall clock time on systems without huge page support, a finding that has driven the adoption of HugePagesTHP_always kernel settings in production LLM inference clusters.

This chapter is structured as follows. Section 17.2 dissects the PTW pipeline stages for an x86-64 four-level walk, establishing a concrete latency model for each stage. Section 17.3 covers page walk caches — the structures that eliminate upper-level fetches and are the single largest performance lever available to the PTW. Section 17.4 analyses Miss-Status Holding Registers and the concurrent walker design, including the coalescing behaviour that allows multiple TLB misses to the same page to share a single walk. Section 17.5 examines speculative page table walks and their security implications in depth. Section 17.6 extends the analysis to PTW parallelism and throughput limits under workload pressure. Sections 17.7, 17.8, and 17.9 cover the x86-64, ARM64, and RISC-V implementations respectively, highlighting the design choices that distinguish them. Section 17.10 concludes with a synthesis of the key design trade-offs across x86-64, ARM64, and RISC-V, and a diagnostic framework for identifying and addressing PTW-related performance bottlenecks in production systems.

17.2 PTW Pipeline Stages and Latency

A hardware page table walk for a 4-level x86-64 virtual address proceeds through a fixed sequence of stages. Each stage issues a memory read to fetch one level of the page table tree, checks the resulting entry for validity and permission bits, and computes the physical base address for the next level. The full pipeline, shown in Figure 17.1, has six distinct stages.

Stage 0 — TLB miss detection

The load or instruction fetch unit detects that the virtual address has no matching entry in the TLB. In an out-of-order processor the TLB is accessed speculatively during the issue stage, so the miss may be detected before the load reaches the head of the retirement queue. The pipeline marks dependent instructions as blocked and raises an internal miss signal to the page table walker logic. Detection consumes one to four cycles depending on pipeline depth and whether the miss is in the instruction TLB (ITLB) or data TLB (DTLB).

Stage 1 — CR3 load

The walker reads CR3 (Control Register 3), which holds the physical base address of the PML4 table for the current address space. On x86-64, CR3 also carries the PCID (Process-Context Identifier) in bits 11:0 when CR4.PCIDE is set. The PML4 base address is computed as CR3[51:12] << 12. Because CR3 changes only on context switches, processors maintain a microarchitectural buffer that caches the current CR3 value and the current PML4 base, eliminating the latency of a register read in the common case.

Stages 2–5 — PML4E, PDPTE, PDE, PTE fetches

Each of the four page table levels requires exactly one 8-byte aligned load from physical memory. The physical address of each entry is computed by combining the base address from the previous level with the appropriate 9-bit index extracted from the virtual address. The four index fields are VA[47:39] for PML4, VA[38:30] for PDPT, VA[29:21] for PD, and VA[20:12] for PT. Each fetch is a cache-coherent read that traverses the full L1D/L2/LLC/DRAM hierarchy. Stages proceed sequentially within a single walk — the walker cannot issue the PDPTE fetch until it has the physical address from the PML4E fetch.

After each fetch, the walker checks the Present bit (bit 0). If P=0, the walk terminates with a page fault. If the Reserved bits are non-zero, the walker generates a #PF with error code bit 3 set. The Accessed bit (bit 5) in non-leaf entries must be set by the walker if it is clear — this requires a locked read-modify-write cycle that introduces additional latency on the first access to a new mapping.

Stage 6 — TLB fill

Once the leaf PTE is validated, the walker installs the physical frame number and combined permission bits (U/S, R/W, NX, global) into the TLB. On Intel processors, the TLB fill takes one to two cycles. On AMD, the fill is similarly fast but may interlock with concurrent SFENCE.VMA instructions from other threads sharing the same physical core.

Figure 17.1: x86-64 hardware PTW pipeline stages and per-stage latency table. Dashed red arrow shows the PWC shortcut path that skips upper-level fetches when page walk cache entries are present. All cycle counts at 4 GHz with DDR5 memory.

Latency model and huge page impact

The dominant cost of a TLB miss is the sum of cache hierarchy latencies across all four PxE fetches. Table 1 in Figure 17.1 shows representative cycle counts. The best case — all four PxEs resident in L1D — takes approximately 20 cycles. The worst case — all four PxEs missing the LLC — takes 720 to 1,000 cycles, entirely determined by DRAM access latency. The typical production case falls between: L2/LLC-hot page tables yielding 150 to 200 cycles.

Huge pages (2 MB pages in x86-64 terminology) reduce the walk to three stages by terminating at the PDE level. This eliminates the PTE fetch, saving one cache access. Under DRAM-miss conditions this saves 180 to 250 cycles per walk — a 25% reduction in worst-case latency. The PWC interaction with huge pages is analysed in Section 17.3.

The five-level paging extension (PML5, enabling 57-bit virtual addresses) adds a fifth fetch stage for the PML5E lookup. This increases worst-case cold walk latency by a further 180 to 250 cycles, making the decision to enable LA57 a non-trivial performance trade-off for applications with large virtual address spaces but modest working sets. Intel documents LA57 support from Ice Lake (10th generation Core) onwards; AMD supports LA57 from Zen 4.

Measuring walk latency in practice

PTW latency can be measured directly on x86-64 using hardware performance counters. Intel's DTLB_LOAD_MISSES.WALK_DURATION counter accumulates the total number of cycles spent in active page table walks for the DTLB, and DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK counts the number of completed walks. Dividing the former by the latter gives the average walk latency in cycles, which can be compared against the model in Figure 17.1 to estimate the proportion of walks that are LLC-resident versus DRAM-resident. AMD provides equivalent counters in its Processor Programming Reference. Linux perf stat exposes these through the dTLB-load-misses and iTLB-load-misses events; more detailed walk-stage counters are available as raw PMU events. This empirical measurement is the starting point for any systematic PTW performance investigation and should precede any application-level optimisation attempt.

17.3 Page Walk Caches

The page walk cache (PWC) is the single most impactful performance optimisation in the PTW. Rather than repeating all four PxE fetches for every TLB miss, the PWC caches the intermediate physical addresses computed during previous walks. A hit in the PWC eliminates one or more PxE fetches, converting a cold walk latency of 720–1,000 cycles into a warm walk latency that may be as low as 180–250 cycles (a single PTE fetch).

Figure 17.2: Three-level Page Walk Cache (PWC) structure mapped to the VA bit fields it indexes. Hit rates and walk-reduction factors are representative of server-class workloads; ML training shows lower hit rates due to PDPT-level pressure from large tensor allocations.

PWC structure and tagging

The PWC is typically organised as three independently-addressed arrays, one per non-leaf page table level. The L1 PWC caches PML4E lookups, tagged by CR3 concatenated with VA[47:39]. A hit delivers the PDPT base address, eliminating the PML4E fetch. The L2 PWC caches PDPTE lookups, tagged by CR3 and VA[47:30]. A hit delivers the PD base address, eliminating both the PML4E and PDPTE fetches. The L3 PWC caches PDE lookups, tagged by CR3 and VA[47:21]. A hit delivers the PT base address, leaving only the PTE fetch remaining.

The tag widths reflect a design trade-off between coverage and array size. Including the full CR3 value (52 bits of page table base plus 12 bits of PCID) allows the PWC to hold entries for multiple address spaces simultaneously without invalidation on every context switch. When PCID is in use, context switches that change CR3 but preserve entries in the PWC (the PCID non-flushing path, activated by setting bit 63 of CR3 in the MOV-to-CR3 encoding) avoid invalidating any PWC entries — a significant benefit for short-lived system calls and kernel threads that share large portions of the kernel virtual address space mapping.

PWC invalidation

PWC entries are invalidated by the same operations that flush TLB entries: INVLPG flushes all PWC entries associated with the target virtual address. MOV to CR3 with bit 63 clear (the traditional flush) invalidates all entries in all three PWC levels. MOV to CR3 with bit 63 set (the PCID-preserving path) flushes only the entries tagged with the new CR3's PCID. On ARM64, TLBI VA/ASIDE instructions invalidate the corresponding TTW cache entries; the TTW cache equivalent to the PWC is non-architecturally specified but widely implemented.

INVLPG (and the ARM equivalent TLBI) must invalidate not just the leaf TLB entry for the target VA, but also any PWC entries that contributed to walks for that VA. Processors that fail to do this correctly produce page table walk results that do not reflect the current state of the page table tree — a correctness bug rather than merely a performance issue. Intel and AMD document that their implementations correctly invalidate all three PWC levels on INVLPG.

PWC hit rates in practice

PWC effectiveness varies sharply across workload classes. For database workloads with hot kernel mappings and a moderate working set, L1 PWC hit rates of 85 to 95 percent are typical, reducing the average walk to approximately 1.2 PxE fetches rather than 4.0 — a 3.3× reduction in walk cost. Web servers and network-intensive workloads with moderate context-switch rates show L2 hit rates of 50 to 70 percent.

Machine learning training workloads present a stress case for the PWC. Large tensor allocations backed by huge pages populate the PDE level of the page table; each 1 GB of tensor address space occupies one PDPTE and 512 PDEs. With a 6 to 16 entry L2 PWC, a training loop accessing dozens of distinct tensor objects rapidly thrashes the PDPT cache, driving hit rates below 20 percent and forcing near-full four-level walks for the majority of TLB misses. This is one of the microarchitectural mechanisms behind the higher-than-expected TLB miss penalties observed in large-model training discussed in Chapter 11.

The PWC interacts non-trivially with context switches. On x86-64 with PCID disabled, a context switch invalidates all TLB entries and all PWC entries simultaneously, creating a cold-start penalty for the first dozen to hundred instructions in each new address space. With PCID enabled, the PWC entries for the outgoing address space are retained (tagged with its PCID), and the incoming address space can use its previously-cached PWC entries immediately — effectively eliminating the PTW cold-start overhead for processes that share a physical core in round-robin scheduling. This is one of the two primary motivations for PCID support in Linux (the other being TLB entry retention), and it is why the KPTI patch for Meltdown, which forces CR3 switches on every user/kernel boundary, is a performance regression even on processors with PCID support: the increased CR3 switch rate thrashes the PWC for both the user and kernel page table entries.

The PWC also interacts with transparent huge page promotion in the Linux kernel. When khugepaged promotes a set of 4 KB pages to a 2 MB huge page, it must invalidate the PTE-level TLB entries for all 512 constituent pages using INVLPG. It must also invalidate the PDE-level PWC entry for the 2 MB region, because the newly-promoted PDE (with the PageSize bit set) replaces the former chain of PDE → page table → PTEs with a single leaf PDE. Failure to invalidate the PDE-level PWC entry would allow subsequent walks to use the stale PDE pointer to a now-freed page table, a use-after-free at the hardware level. Linux issues flush_tlb_mm_range on the full 2 MB range after promotion, which on x86-64 translates to INVLPG on every 4 KB page and consequently flushes the PDE PWC entry. This correctness requirement is one reason huge page promotion is conservative in frequency: the full-range INVLPG sequence is expensive and must be executed with the page table lock held, creating a latency spike for threads accessing the promoted region.

RISC-V processors with software-managed TLBs lack a PWC by definition — the software handler re-traverses the full page table tree on every miss. Hardware walker extensions introduced by SiFive (U74 core) and T-Head (XuanTie C910) implement a two-level PWC covering the L1 and L2 levels, bringing their warm walk performance within a factor of two of equivalent ARM64 implementations.

17.4 Miss-Status Holding Registers and Concurrent Walks

The PTW does not stall the entire processor on a TLB miss. Instead, like the data cache, it uses Miss-Status Holding Registers (MSHRs) to track outstanding walks and allow independent instructions to continue executing. The number of PTW-MSHR entries is the hard ceiling on PTW throughput: it determines how many TLB misses can be serviced concurrently, and when it is exhausted, every subsequent TLB miss stalls the load queue until a slot is freed.

Figure 17.3: PTW MSHR coalescing and parallel walker operation. Two loads to the same 4 KB page share MSHR[0] (one walk, one TLB fill). Walker 0 and Walker 1 progress independently on different pages simultaneously. A third miss stalls until a walker slot is freed.

MSHR allocation and coalescing

When a TLB miss occurs, the processor checks whether an MSHR entry already exists for the same 4 KB page (identified by the virtual page number, CR3, and PCID/ASID). If a matching entry exists, the new miss is coalesced onto that entry: it joins the waiter list without triggering a second walk. When the walk completes and the TLB entry is installed, all waiters are woken simultaneously. Coalescing is the reason that a scatter access pattern across N distinct cache lines within a single 4 KB page incurs exactly one TLB miss, not N misses — a fact that has significant implications for data structure layout decisions in systems programming.

If no matching MSHR entry exists, the processor allocates a new PTW-MSHR entry, assigns a hardware walker to begin the walk, and adds the triggering load to the waiter list. If all MSHR entries are occupied, the load and all subsequent loads that miss in the TLB are stalled until one walk completes and its MSHR is freed. This stall is the PTW bottleneck under large working set conditions.

Concurrent walker capacity by microarchitecture

Intel x86-64 processors from Haswell (2013) through current Raptor Lake implement two concurrent PTW-MSHR entries. Intel SapphireRapids (Xeon 4th Gen, 2023) maintains two walkers but improves internal walker pipeline depth, reducing average walk latency by approximately 10 to 15 percent under LLC-hot conditions. AMD Zen 4 (EPYC Genoa, 2022) increases the concurrent walker count to four, providing a 2× improvement in PTW throughput under MSHR-bound workloads compared to Intel's two-walker design. This is a deliberate server-market differentiation: data centre workloads with large memory footprints — in-memory databases, virtualised environments, large-model inference — are far more likely to be PTW-bound than client workloads.

ARM Cortex-A710 (ARMv9, 2021) implements two hardware TTWs per core. AWS Graviton3 (Neoverse V1 core, 2021) implements two TTWs per core; Graviton4 (Neoverse V2, 2023) documents improved TTW throughput without specifying the walker count, consistent with either a deeper pipeline or additional coalescing logic. The ARM Architecture Reference Manual deliberately does not mandate a minimum TTW count, leaving the implementation choice to silicon designers.

Coalescing window and temporal locality

The effectiveness of MSHR coalescing depends on the temporal locality of TLB misses to the same page. Two loads to the same page are coalesced only if the second miss arrives while the first walk's MSHR entry is still active — that is, before the walk completes and the MSHR is freed. In practice, the coalescing window is determined by walk latency: a 200-cycle walk provides a 200-cycle window during which any subsequent miss to the same virtual page is coalesced for free. This temporal requirement has an important implication for data structure layout: sequential access patterns within a page (e.g., iterating a 4 KB array) naturally fall within the coalescing window even at high IPC, while non-sequential scatter-gather patterns that revisit the same page after visiting many other pages may miss the coalescing window and incur multiple walks to the same physical page in succession.

Software prefetch instructions (PREFETCHT0, PREFETCHT1) can trigger TLB lookups that warm the TLB for subsequent accesses. On processors where prefetch instructions cause TLB fills (x86-64 Skylake and later, ARM64 Cortex-A710 and later), a prefetch that TLB-misses will initiate a walk and install the TLB entry before the demand load arrives, converting a TLB-miss penalty into a background operation overlapped with useful computation. This "TLB prefetching" effect is exploited explicitly by some HPC libraries and is a latent benefit of software prefetch intrinsics that is rarely discussed in prefetch tuning guides.

MSHR starvation and the PTW bottleneck

The theoretical analysis of MSHR starvation was formalised by Barr et al. (ISCA 2010). With two walker slots and a cold walk latency of 200 ns (800 cycles at 4 GHz), maximum sustained PTW throughput is 2 ÷ 800 cycles = 0.0025 walks per cycle. At a TLB miss rate of 0.01 misses per instruction and an IPC of 3, the processor needs 0.03 walks per cycle — twelve times more than two concurrent walkers can supply. In this regime, IPC collapses to a fraction of peak: nearly every load that touches a page not in the TLB stalls immediately after issue.

Basu et al. (ISCA 2013) measured this effect empirically using graph analytics workloads on a real system: PTW-bound execution reduced effective IPC from 2.8 to 0.4 — a 7× degradation attributable entirely to TLB miss stalls with no cache miss involvement. Their proposed mitigation, direct segments (large flat virtual-to-physical mappings that bypass the page table tree entirely), is discussed in Chapter 16. A complementary hardware approach — increasing walker count and MSHR depth — is the path taken by AMD Zen 4.

Huge pages interact with MSHR occupancy in a second important way: because a 2 MB walk terminates at the PDE stage (three levels, not four), the walk completes faster, freeing the MSHR sooner and increasing throughput under MSHR-bound conditions by approximately 25 percent independent of the TLB coverage benefit. This second-order effect — faster walk = shorter MSHR occupancy = higher throughput — is separate from and additive to the primary TLB reach benefit of huge pages.

17.5 Speculative Page Table Walks

Modern out-of-order processors do not wait for a load to reach the head of the retirement queue before initiating a TLB lookup. The TLB is accessed speculatively, during the issue stage, on the basis of an address that may have been computed from instructions that have not yet retired. If the address later turns out to be incorrect — due to a mispredicted branch, a cancelled instruction, or a preceding store that forwards a different value — the TLB lookup result is discarded and the pipeline is redirected. This speculative behaviour is architecturally transparent in the common case.

Figure 17.4: Speculative page table walk paths for P=1 vs P=0 PTEs, and the L1 Terminal Fault (L1TF/Foreshadow) exploit mechanism. Pre-2018 Intel CPUs speculatively fill L1D from the physical address in a not-present PTE, enabling cache side-channel exfiltration.

The security-critical case: speculative walks on P=0 PTEs

The question that becomes security-critical is what happens when a speculative TLB access misses, and the processor speculatively initiates a page table walk for a virtual address that the retiring instruction will never be permitted to access. For a PTE with the Present bit (P) set to 1 and permissions that allow the access, a speculative walk installs a TLB entry that the eventual permission check at retirement may or may not reject. If the check fails, the TLB entry is not used and the #PF is delivered. This is architecturally correct behaviour and has no persistent side-effects beyond the transient TLB entry.

For a PTE with P=0 — a not-present page — the architectural expectation is that the walk terminates immediately with a page-fault exception. The processor is expected to generate the fault and deliver it when the load retires; the physical address field of the PTE (bits 51:12) has no architectural meaning for a not-present entry and should not be acted upon. Pre-2018 Intel processors violated this expectation in a security-critical way: on encountering a P=0 PTE during a speculative walk, the hardware read the physical address from bits 51:12 of the not-present PTE and issued a speculative L1D cache fill for that physical address before the permission check fired.

L1 Terminal Fault (Foreshadow) — CVE-2018-3615/3620/3646

This behaviour is the root cause of the Foreshadow vulnerability family (Van Bulck et al., 2018), also referred to by Intel as L1 Terminal Fault (L1TF). An attacker who can control the physical address field of a not-present PTE — a capability available to the host hypervisor, a co-resident VM, or in the SGX variant (Foreshadow-SGX, CVE-2018-3615), an untrusted enclave — can cause the speculative walk to fill L1D with data from an arbitrary physical frame. The L1D contents can then be exfiltrated using a Flush+Reload or Prime+Probe side-channel without the speculative load ever retiring.

Three distinct attack surfaces were disclosed simultaneously. CVE-2018-3615 (Foreshadow-SGX) targets Intel SGX enclaves: the hypervisor maps the enclave's memory with P=0, places the target physical address in the PTE, and triggers speculative walks from a co-resident untrusted enclave. CVE-2018-3620 targets OS and System Management Mode (SMM) memory: kernel-mode attackers construct not-present PTEs pointing at protected physical frames. CVE-2018-3646 targets virtual machine memory: a guest VM can cause speculative L1D fills from host physical memory by crafting not-present PTEs in guest page tables, then exfiltrate data through a shared L1D on SMT (Hyper-Threading) cores.

Mitigations and residual risk

Intel's primary mitigation is a microcode update that causes the processor to suppress speculative L1D fills when the PTE.P bit is zero. This is implemented in microcode because the fill-suppression logic requires modifying the PTW's behaviour after the PTE has been fetched but before the speculative memory access is issued — a stage accessible only to microcode on Intel's FSM-plus-microcode architecture. The microcode patch was distributed through operating system firmware updates beginning in August 2018 and is now universally deployed on affected Intel platforms.

For SGX deployments, Intel additionally introduced a hardware-enforced mitigation in newer silicon: the SGX L1D flush on enclave exit ensures that L1D does not contain enclave data when context switches to untrusted code. Hypervisors deploy L1D flush on VM exit as a software mitigation on patched but not hardware-hardened systems.

ARM64's Translation Table Walker is a pure hardware FSM with no microcode involvement. ARM does not publicly document whether its implementations issue speculative fills from P=0 TTEs; no equivalent to L1TF has been publicly demonstrated on any ARM64 silicon. RISC-V software-managed TLB designs are structurally immune: the trap handler is architectural software that explicitly checks the PTE before acting on the physical address field, and no speculation pathway exists for not-present pages in the trap handler's execution context.

The speculative walk interacts with Spectre variant 1 (bounds-check bypass) in a subtle way that is less widely understood than L1TF. A Spectre v1 gadget can speculatively compute an out-of-bounds virtual address and trigger a TLB lookup (and potentially a PTW) for that address before the bounds check retires. If the speculative walk installs a TLB entry for a kernel virtual address (because the kernel mappings are present in the page table during user-mode execution, as was the case before KPTI), the TLB entry's presence can be detected through cache timing. KPTI prevents this by removing kernel mappings from the user-mode page table entirely — ensuring that speculative PTW on kernel addresses fails at the PML4 level rather than installing a usable TLB entry. This interaction between Spectre v1 mitigations and PTW behaviour illustrates why page table isolation must be understood in conjunction with speculative execution, not as an independent design choice.

The broader class of microarchitectural data sampling (MDS) attacks — including the RIDL, Fallout, and ZombieLoad variants — exploit related speculative execution properties of the store buffer, load buffer, and fill buffer to extend the attack surface beyond the PTW itself, and inform the ongoing design of trusted execution environments across all three ISAs examined in this chapter.

17.6 PTW Parallelism and Throughput

The throughput of the page table walker — measured in walks completed per unit time — is determined by the interaction of three factors: the number of concurrent walker slots (MSHR capacity), the latency of each walk (determined by cache hierarchy state), and the degree to which walks can be pipelined. Understanding these interactions is essential for diagnosing performance degradation in large-memory workloads.

Figure 17.5: PTW throughput ceiling model and walker occupancy timelines. W/L ceiling in walks per cycle ×1000. Demand line at 30 represents a workload with TLB miss rate 0.01/instruction at IPC=3. AMD Zen 4's four-walker design meets LLC-hot demand; neither design meets demand under DRAM-cold conditions.

Throughput ceiling analysis

Given W concurrent walkers and an average walk latency of L cycles, the maximum PTW throughput is W/L walks per cycle. For Intel's two-walker design with a worst-case cold walk of 800 cycles, the ceiling is 0.0025 walks per cycle. For AMD Zen 4's four-walker design with the same latency, the ceiling is 0.005 walks per cycle. LLC-hot walks (150 cycles) raise both ceilings by 5.3× to 0.013 and 0.027 walks per cycle respectively.

These ceilings must be compared to the walk demand generated by the workload. Walk demand equals the TLB miss rate (misses per instruction) multiplied by the instruction issue rate (IPC). A workload with a miss rate of 0.01 misses per instruction and an IPC of 3 generates 0.03 walks per cycle — exceeding Intel's LLC-hot ceiling and far exceeding the DRAM-miss ceiling. AMD's four-walker design meets the LLC-hot demand at this miss rate, but both architectures remain bottlenecked under DRAM-miss conditions.

Pipelining between walks

Within a single walk, stages are strictly sequential — the walker cannot issue the PDPTE fetch until the PML4E fetch has returned a physical address. However, two independent walkers (for two different virtual pages) can interleave their stage fetches through the cache hierarchy. Walker 0 at the PTE stage (waiting for an L2 cache response) and Walker 1 at the PDE stage (waiting for an LLC response) simultaneously occupy different cache banks and generate independent cache requests. This is the microarchitectural parallelism that makes concurrent walkers effective even when each individual walk is serialised across stages.

Interaction with huge pages

Huge pages (2 MB) improve PTW throughput through two independent mechanisms. The direct mechanism is the 25% reduction in walk stages (three levels vs four), which reduces average walk latency. The indirect mechanism is the reduction in MSHR occupancy time: a shorter walk frees the MSHR sooner, allowing it to be reallocated to a new miss more quickly. Under MSHR-bound conditions where all walker slots are perpetually occupied, the indirect mechanism can contribute as much throughput improvement as the direct mechanism.

1 GB huge pages (1 GB pages using PDPTE with the PageSize bit set) reduce the walk to two levels — CR3 load and a single PDPTE fetch — providing a further 33% reduction in walk stages relative to 2 MB pages. However, 1 GB page mappings are rarely used for data in practice due to fragmentation pressure: a single 1 GB allocation requires a physically contiguous gigabyte of DRAM, which is increasingly difficult to satisfy in a long-running system with fragmented physical memory.

Workload characterisation and the TLB-to-PTW handoff

Practical throughput optimisation strategies

NUMA effects on PTW latency

On multi-socket NUMA systems, the physical location of page table entries relative to the executing core significantly affects PTW latency. When a process is migrated from socket 0 to socket 1 after its page tables were allocated on socket 0's local DRAM, all four PxE fetches during a cold walk must cross the inter-socket interconnect (UPI on Intel, Infinity Fabric on AMD), incurring a NUMA latency penalty of 1.5× to 2.5× relative to local access. For a cold four-level walk that would take 720 cycles with local page tables, cross-socket PTW can require 1,100 to 1,800 cycles — placing it firmly in the regime where PTW stalls dominate execution. The mitigation is NUMA-local page table allocation: the Linux kernel's MADV_DONTFORK and move_pages system calls can be used to migrate page table physical pages to the socket where a thread is pinned, and the numactl --localalloc policy ensures future page table allocations are NUMA-local. Production HPC and ML training frameworks that pin threads to sockets and use process migration consistently show 15 to 40 percent PTW latency reductions from NUMA-local page table allocation, a benefit that dwarfs the effect of many application-level memory access pattern optimisations.

Given the throughput ceiling analysis, the available levers for PTW-bound workloads are limited but impactful. First, increasing huge page coverage reduces both walk depth and MSHR occupancy time. Linux's transparent_hugepages=always setting enables automatic 2 MB promotion for anonymous memory; explicitly using madvise(MADV_HUGEPAGE) on large allocations provides finer control for latency-critical code paths. Second, NUMA-local allocation policies reduce DRAM latency for page table fetches — a page table allocated on a remote NUMA node incurs NUMA latency multiplier (typically 1.5–2×) on every DRAM-miss PTW stage, compounding the cold walk penalty. Third, pre-faulting large memory regions with mmap(MAP_POPULATE) or mlock() eliminates first-access PTW overhead at the cost of upfront kernel time. Fourth, on AMD Zen 4 systems, the four-walker design means that workloads which saturate Intel's two-walker ceiling may achieve significantly better performance without any software changes.

Not all high TLB miss workloads are PTW-bound. A workload that generates a high TLB miss rate but accesses page tables that are entirely LLC-resident (a common case for databases with hot buffer pool metadata) will have high walk frequency but low per-walk latency. The PTW-MSHR bottleneck manifests specifically when walk latency is high — i.e., when page tables are DRAM-resident. This occurs when the total size of the page table tree exceeds the LLC capacity, which for a workload using 4 KB pages and a 3-level tree happens at approximately LLC_size × 512 bytes of data (since each 4 KB page table entry maps 4 KB of data, and the page table overhead is 1/512th of the data size for a single-level mapping).

17.7 x86-64 PTW Microarchitecture

The x86-64 page table walker is implemented as a combination of a hardware finite-state machine and microcode routines stored in the processor's on-die microcode ROM. This dual implementation reflects the ISA's history: the x86 page table format was designed in the 1980s with complex exception semantics (reserved bit faults, A/D bit maintenance, NX enforcement) that are difficult to implement efficiently in pure hardware without significant silicon area. The microcode-assisted approach allows the common case (present pages, no permission violations, A bit already set) to be handled by the fast hardware FSM, while delegating exception cases to microcode sequences that can construct and deliver precise exceptions without adding hardware complexity.

Figure 17.6: PTW design comparison across x86-64, ARM64, and RISC-V across six properties. RISC-V base is structurally immune to speculative-walk attacks; Intel's microcode patch path allows L1TF mitigations without silicon re-spin.

FSM structure and microcode handoff

Intel's hardware PTW FSM handles the walk stages described in Section 17.2. The FSM reads PxE entries from the cache hierarchy, validates the P bit and permission fields, maintains the A bit (and D bit in the leaf PTE), and computes the next stage's physical address. When the walk encounters a condition that requires exception delivery — P=0 PTE, reserved bit set, privilege violation, NX violation — the FSM transfers control to a microcode sequence. The microcode sequence constructs the page fault error code (indicating the type of violation), sets CR2 to the faulting virtual address, and initiates #PF delivery through the normal exception mechanism.

The microcode-assisted design provides a critical operational advantage: security mitigations that require changes to PTW behaviour can be deployed through microcode updates without silicon re-spin. Intel's response to Spectre, Meltdown, L1TF, and subsequent vulnerabilities has relied extensively on microcode updates that modify the FSM's speculative behaviour. This capability is unavailable to ARM and RISC-V implementors with pure-hardware TTW designs.

PCID interaction and context-switch optimisation

Process-Context Identifiers (PCID), introduced in Intel Westmere and ARM64's equivalent ASID (Address Space Identifiers), allow the TLB and PWC to hold entries for multiple address spaces simultaneously. In the x86-64 PTW, the 12-bit PCID is included in the PWC tag, so PWC entries are not invalidated on a PCID-preserving context switch (MOV to CR3 with bit 63 set). This avoids the costly full PWC flush on every context switch, which was a significant source of overhead before PCID support was added to Linux (KPTI patch, 2018).

The KPTI (Kernel Page-Table Isolation) mitigation for Meltdown doubles the cost of the user-to-kernel transition by switching to a separate page table that contains only the minimal kernel mappings needed for the interrupt/syscall entry path. KPTI uses PCID to maintain separate PCID values for the user and kernel page tables of each process, allowing the PTW and TLB to hold both mapping sets simultaneously and avoid cold-start penalties on the common system call path. The PTW overhead of this approach is approximately 5 to 10% on system-call-intensive workloads and negligible on compute-bound workloads.

Microcode update mechanics for PTW behaviour

Intel's microcode ROM update mechanism is worth understanding in depth because it is the primary delivery vehicle for PTW-related security mitigations. Intel processors contain a microcode ROM that is loaded during manufacturing and a smaller RAM-based microcode patch area that is written by BIOS or the operating system during boot. The OS delivers microcode updates via the WRMSR instruction on MSR 0x79 (IA32_BIOS_UPDT_TRIG) after the BIOS loads the primary update; subsequent late microcode loads can be applied per-core via the same mechanism. The PTW's finite-state machine can be redirected to execute different microcode sequences for specific PTE conditions — including the L1TF mitigation, which inserts a check on PTE.P before initiating the speculative L1D fill, and the Accessed bit atomicity improvements in SGX-capable processors. Because the PTW microcode patch area is per-core and persistent across reboots (loaded at each boot by the kernel's intel-microcode module), the effective PTW behaviour on a patched system differs from factory silicon in measurable ways: the L1TF-mitigated PTW adds 1 to 3 cycles per walk for the additional PTE.P validation step, an overhead visible in DTLB_LOAD_MISSES.WALK_DURATION counter comparisons between pre- and post-patch measurements.

5-level paging (LA57)

Accessed and Dirty bit maintenance

The Accessed (A) and Dirty (D) bits in page table entries must be maintained by the hardware walker on x86-64. If the A bit in any non-leaf PTE is clear when a walk traverses it, the walker must set it atomically (using a locked CMPXCHG or equivalent microcode sequence) before proceeding to the next level. If the Dirty bit in the leaf PTE is clear on a write access, the walker must set it. These atomic RMW operations introduce additional latency when page table entries are cold: a first access to a freshly-mapped page incurs not just four sequential cache reads but also up to four atomic read-modify-write operations if the A bits are clear throughout the tree. On NUMA systems, this can cause cross-socket bus traffic for page table entries allocated on a different NUMA domain — an effect that is particularly pronounced in first-touch allocation policies where the page table itself may live far from the core executing the first access.

Intel Ice Lake (2019) and AMD Zen 4 (2022) introduced 5-level paging support through the LA57 CPU feature flag. When enabled via CR4.LA57=1, the PTW inserts an additional PML5E fetch stage before the PML4E stage, extending virtual addresses to 57 bits (128 PiB user address space). The PTW FSM is extended with a fifth stage; the PWC acquires a fourth level caching PML5E lookups. The performance impact of LA57 is approximately one additional memory access per TLB miss on a cold walk (15 to 25 percent latency increase), and requires a larger PWC to maintain equivalent hit rates under the wider virtual address space.

17.8 ARM64 Translation Table Walk

ARM64's Translation Table Walker (TTW) is a pure-hardware finite-state machine with no microcode involvement. The TTW is specified in the ARM Architecture Reference Manual (ARM DDI 0487) and implemented independently by each licensee — resulting in varying walker counts, pipeline depths, and speculative behaviour across implementations from Apple, Qualcomm, Amazon, Ampere, and others.

Figure 17.7: ARM64 Translation Table Walk. TTBR0_EL1 maps user-space (VA[63]=0); TTBR1_EL1 maps kernel space (VA[63]=1), eliminating full page-table swap at context switch. Walk depth depends on granule and VA width: 3 levels (64 KB granule), 4 levels (4 KB, 48-bit VA), or 5 levels (4 KB, 52-bit LPA2). Walker behaviour — count, pipeline depth, and speculative pre-fetch — is licensee-defined.

TTW structure and TTBR organisation

ARM64 maintains two translation table base registers per exception level: TTBR0_EL1 for user-space mappings (VA[63]=0) and TTBR1_EL1 for kernel mappings (VA[63]=1). This split eliminates the need for separate page table trees at context switch — only TTBR0_EL1 must be updated when switching between user processes, while TTBR1_EL1 (the kernel mapping) remains constant. Each TTBR holds the base physical address of the L0 table and the ASID (Address Space Identifier), a 16-bit tag that identifies the address space for TLB and TTW cache entries.

The ARM64 page table walk depth depends on the configured translation granule (4 KB, 16 KB, or 64 KB) and the VA width. With the common 4 KB granule and 48-bit VAs (TCR_EL1.T0SZ=16), the walk proceeds through four levels: L0 (VA[47:39]), L1 (VA[38:30]), L2 (VA[29:21]), and L3 (VA[20:12]). With 52-bit VAs (LPA2 extension) and 4 KB granule, a five-level walk is used. With 64 KB granule and 48-bit VAs, a three-level walk suffices because each table entry covers a larger VA range.

Stage-2 translation and nested walks

ARM64's hardware virtualisation extension (ARMv8-A VHE, implemented in Cortex-A55 and later) supports two-stage address translation: Stage 1 translates the guest virtual address to the guest physical address (IPA), and Stage 2 translates the IPA to the host physical address (HPA). The TTW performs both stages in sequence, meaning that each of the N Stage-1 PxE fetches must itself be translated through Stage 2 before it can be used. For a four-level Stage-1 walk with a four-level Stage-2 walk, the total number of memory accesses reaches (4+1) × 4 + 1 = 21 in the cold case — a catastrophic latency of several thousand cycles if all accesses miss the LLC.

This nested walk penalty is mitigated in practice by the Stage-2 TLB (IPA-to-HPA cache), which caches the most recently used IPA-to-HPA translations and eliminates the Stage-2 re-walk for hot mappings. ARM processors also maintain a combined Stage-1/Stage-2 TLB that caches the fully-resolved guest-VA-to-HPA translation, bypassing both stages on a hit. The nested walk cost is therefore primarily visible during working set expansion or after VMID-tagged TLB flushes, not on steady-state access patterns.

TLBI instruction granularity

Hardware Dirty State Management (HDBM)

ARM64's optional Hardware Dirty Bit Management (HDBM) extension, introduced in ARMv8.1-A, allows the hardware TTW to set the Dirty Bit Modifier (DBM) in page table entries rather than requiring a software page-fault handler for write-fault-on-clean-page flows. Without HDBM, a write to a read-only mapped page causes a permission fault; the OS handler sets the page's dirty bit and retries. With HDBM enabled, the TTW detects the dirty-bit-modifier flag in the PTE and atomically upgrades the mapping to writable without invoking software, eliminating one round-trip through the fault handler per initially-clean writable page. This is architecturally equivalent to x86-64's hardware A/D bit maintenance but is optional in ARM64 (the Cortex-A710 and Apple M-series both implement HDBM). Operating systems that exploit HDBM (Linux since 4.14 on ARMv8.1+ systems) see measurable reductions in page-fault overhead for copy-on-write workloads.

Implementation survey: Apple M-series and AWS Graviton

The ARM64 TTW specification's implementation freedom has produced meaningfully different designs across major licensees. Apple's M-series processors (M1 through M4) are notable for their extremely high L1 DTLB capacity — 192 entries in the M1 Firestorm cores versus 64 in contemporary ARM Cortex-A cores — combined with a large STLB (second-level TLB) shared between cores in each cluster. The practical effect is that M-series chips reach TLB saturation at far larger working sets than Cortex-A equivalents, deferring the PTW bottleneck to workloads that would be firmly PTW-bound on server-class ARM designs. Apple does not publish the M-series TTW walker count, but microbenchmarks by Dougall Johnson and others suggest a single hardware walker per performance core, compensated by the large TLB arrays that reduce miss frequency dramatically. The M-series also implements the HDBM extension (Hardware Dirty Bit Management), which Apple exploits in macOS's copy-on-write page fault path to reduce fault-handler overhead.

AWS Graviton3 (Neoverse V1) and Graviton4 (Neoverse V2) are designed specifically for cloud workloads where memory footprint and TLB pressure are primary concerns. Graviton3 increased the Neoverse N1's 48-entry L1 DTLB to 64 entries and expanded the STLB, consistent with AWS's observation that many cloud workloads are TLB-miss-intensive at scale. Graviton4 is documented as having "improved" address translation throughput without specific public disclosure of walker count or STLB capacity. Both designs implement 16-bit ASIDs (as permitted by ARMv8.1-A), allowing simultaneous TLB residency for up to 65,535 distinct address spaces — a significant advantage for container-dense cloud workloads that would otherwise cause ASID exhaustion and forced TLB flushes.

ARM64 provides a rich set of TLB invalidation instructions (TLBI) that allow fine-grained invalidation without full TLB flushes. TLBI VAE1IS invalidates the entry for a specific VA in the inner-shareable domain. TLBI ASIDE1IS invalidates all entries tagged with a specific ASID. TLBI VMALLE1IS invalidates all entries for the current VMID. The TTW cache (PWC equivalent) is invalidated by the same TLBI instructions, ensuring that subsequent walks use the updated page table entries. The granularity of ARM64 TLBI instructions exceeds that of x86-64's INVLPG in two important respects: TLBI can target specific ASID-tagged entries without invalidating entries for other address spaces, and TLBI operations are broadcast to all cores in the inner-shareable domain in a single instruction, eliminating the need for the explicit IPI-based TLB shootdown mechanism required on x86-64.

17.9 RISC-V Software-Managed PTW

The base RISC-V privileged architecture specification (The RISC-V Instruction Set Manual, Volume II: Privileged Architecture, version 1.12) deliberately omits a hardware page table walker. This is a conscious design choice: the RISC-V philosophy favours a minimal, composable base specification with optional extensions, rather than mandating microarchitectural choices that may not suit all implementation targets. For embedded and real-time systems where TLB miss rate is low and code size is at a premium, a software-managed TLB is perfectly adequate. For server and cloud deployments, hardware walker extensions are available from multiple vendors.

Figure 17.8: RISC-V software-managed page table walk. TLB miss raises a page-fault exception; the supervisor handler at stvec walks the page table in software using ordinary load instructions, installs the translation, issues SFENCE.VMA, and returns via sret. Walk depth scales with satp.MODE: 3 levels (Sv39), 4 levels (Sv48), 5 levels (Sv57). Handler must pin its own code pages to avoid recursive faults.

Base RISC-V TLB miss handling

When a TLB miss occurs in a RISC-V processor with satp.MODE set to Sv39, Sv48, or Sv57, the hardware raises a page-fault exception and transfers control to the supervisor exception handler at the address in stvec. The faulting virtual address is recorded in stval. The handler is responsible for walking the page table tree in software: loading satp to obtain the root PPN, computing the physical address of each page table entry using ordinary load instructions, validating the chain of entries, and installing the result into the TLB using a privileged instruction sequence followed by SFENCE.VMA.

The software handler must be extremely fast — any instruction that faults during the handler's execution while executing at supervisor privilege may recursively invoke the same handler (a double fault scenario). For this reason, RISC-V operating systems pin the TLB miss handler and the page table structures it accesses in the TLB, typically using a dedicated wired TLB entry for the handler code and a locked physical region for the root page table. Linux's RISC-V port uses a global kernel direct mapping that is pre-populated in the TLB to avoid this recursion.

Sv39, Sv48, Sv57, and Sv64 modes

RISC-V supports four virtual address widths controlled by satp.MODE: Sv39 (39-bit VA, three levels), Sv48 (48-bit VA, four levels), Sv57 (57-bit VA, five levels), and the draft Sv64 (64-bit VA, six levels). The walk depth determines the number of load instructions the handler must execute and the number of TLB entries that must be wired. Sv48 is the most commonly implemented mode in current RISC-V server processors (SiFive P670, SpacemiT X60). Sv57 is available on the XuanTie C910 and has been announced for several upcoming server designs. Sv64 remains in draft status as of 2024.

Each page table level for Sv39 and Sv48 uses a 512-entry array of 8-byte PTEs, identical in structure to the x86-64 format but with different bit assignments. The page table base address is computed from satp[43:0] (the PPN field) shifted left by 12 bits. The per-level index is extracted from the VA at the same 9-bit boundaries as x86-64, enabling direct comparisons of walk depth and latency.

Hardware walker extensions

Multiple RISC-V processor vendors have implemented optional hardware page table walkers. SiFive's U74 core (used in the HiFive Unmatched development board and RISC-V Linux reference platforms) implements a hardware walker for Sv39 and Sv48 with a two-level PWC. The walker reduces TLB miss latency from 100–500 cycles (software handler round-trip) to 20–200 cycles (hardware walk), bringing performance within a factor of two of equivalent ARM64 implementations. T-Head's XuanTie C910 (used in the Alibaba Xuantie series) implements hardware Sv39/Sv48/Sv57 walking with a similar two-level PWC structure.

The RISC-V H extension (hypervisor extension, ratified in version 1.0) adds support for two-stage address translation (VS-stage and G-stage) equivalent to ARM64's Stage-1/Stage-2 mechanism. Hardware walker implementations that support the H extension must perform nested walks using the same combinatorial approach as ARM64, incurring equivalent nested walk penalties for guest VA-to-HPA translation. The Sv57×4 mode (57-bit guest VA with 4-level G-stage) is the most demanding supported configuration, requiring up to 25 memory accesses in the fully cold case.

Cost model and software-hardware tradeoff

Impact on OS design

The software-managed TLB has a profound influence on RISC-V OS design. Because the TLB miss handler executes as ordinary supervisor code, it is fully observable and modifiable: OS developers can instrument miss handlers, add custom caching layers, or implement hardware-accelerated walks in firmware without any silicon involvement. The Linux RISC-V port leverages this flexibility in its hugetlb implementation: the supervisor-mode miss handler explicitly checks for 2 MB and 1 GB huge page PTEs during the walk, enabling huge page support without any hardware changes to the TLB fill mechanism. This contrasts with x86-64 and ARM64, where huge page handling is implemented in hardware FSMs that cannot be altered without a silicon revision.

The tradéoff is portability of timing assumptions. Software miss handlers introduce variability that hardware FSM implementations eliminate: a cache miss in the miss handler code path (e.g., a cold branch prediction) can increase handler latency by 50 to 200 cycles, a variance that is invisible to the hardware-walker model and must be accounted for in real-time system design on RISC-V platforms. SiFive's hardware walker extensions address this by providing deterministic, firmware-transparent walk latency for time-critical embedded applications while preserving the software handler as a fallback for exceptional cases.

The cost of a software-managed TLB miss on a RISC-V processor without a hardware walker includes: trap entry (save registers, switch to S-mode privilege: 20–40 cycles); satp read and root PPN extraction (2–4 cycles); per-level load and PTE validation (4–50 cycles per level depending on cache state); SFENCE.VMA and trap return (10–20 cycles). Total: 60–200 cycles for an L1D-hot walk, rising to 600–2,000 cycles for a walk where all PxEs miss the LLC. The software overhead relative to a hardware walk at equivalent cache state is 40–100 cycles — significant for high-miss-rate workloads but negligible for workloads where TLB miss rate is below 0.001 misses per instruction.

17.10 Chapter Summary

The hardware page table walker sits at the intersection of performance, security, and ISA design — a circuit that was once an implementation detail but has become a first-order constraint on system scalability as memory capacities have outpaced TLB coverage by three orders of magnitude over the past two decades. This chapter has built a complete microarchitectural model of PTW behaviour, from the six pipeline stages of a cold x86-64 walk to the ISA-level design choices that determine whether a vulnerability like L1TF can be patched in software or requires a silicon revision.

The hardware page table walker is the circuit that converts a TLB miss into a TLB entry, and its microarchitectural properties determine the performance floor for any workload whose working set exceeds the TLB's coverage. This chapter has examined those properties in depth: from the pipeline stages of an individual walk through the concurrency limits imposed by MSHR capacity, to the security-critical speculative behaviour that produced the Foreshadow vulnerability class and its continuing influence on processor design.

The x86-64 PTW proceeds through six sequential stages — TLB miss detection, CR3 load, and four page table level fetches — with each stage issuing a cache-coherent memory read through the L1D/L2/LLC/DRAM hierarchy. Cold walk latency ranges from 20 cycles when all PxEs are L1D-resident to 720–1,000 cycles when all four levels miss to DRAM. The page walk cache eliminates upper-level fetches for temporally local access patterns and is the single largest PTW performance lever available within the existing architecture: an L2 PWC hit reduces a four-level walk to a single PTE fetch, reducing worst-case walk latency by 75 percent. Machine learning training and large-scale graph workloads are the primary stress cases for PWC effectiveness, as their irregular access patterns exhaust the small PDPT-level PWC arrays.

Miss-Status Holding Registers set a hard ceiling on PTW throughput that no software optimisation can exceed. Intel's two-walker design has remained constant since Haswell (2013); AMD Zen 4 doubled this to four walkers, a deliberate server-market differentiation for large-memory workloads. Coalescing — the sharing of a single walk across multiple loads to the same 4 KB page — is an underappreciated feature that makes access patterns within a page effectively free from a TLB perspective, and has implications for data structure alignment decisions. Under MSHR-bound conditions, huge pages improve throughput through two independent mechanisms: reduced walk depth (fewer PxE fetches) and reduced MSHR occupancy time (faster walk completion frees the slot sooner).

Speculative page table walks are the microarchitectural feature that connects this chapter directly to security. Pre-2018 Intel processors speculatively issued L1D cache fills from the physical address encoded in not-present PTEs, enabling the L1 Terminal Fault (L1TF/Foreshadow) family of attacks. The critical design difference between x86-64 (microcode-assisted FSM) and ARM64 (pure hardware TTW) is not merely a performance consideration: it determines the patch surface. Intel can distribute L1TF mitigations through microcode updates; ARM implementors require silicon revision or software mitigation. RISC-V software-managed TLBs are structurally immune to this class of attack because the trap handler explicitly validates the PTE before acting on any physical address.

ARM64's Translation Table Walker introduces two additional complexities not present in x86-64: the ASID-tagged dual-TTBR organisation eliminates kernel page table context switches at the cost of a split virtual address space, and the two-stage translation mechanism for virtualisation (Stage 1 + Stage 2) multiplies walk depth dramatically in the cold case, making the Stage-2 TLB critical for virtualisation performance. RISC-V's extensible design allows software-managed TLBs for resource-constrained deployments and hardware walkers (SiFive U74, XuanTie C910) for server deployments, with the H extension providing hardware-accelerated two-stage translation for hypervisor workloads.

The three-ISA comparison crystallises a fundamental design tension in computer architecture. x86-64's microcode-assisted FSM prioritises flexibility and patchability: the L1TF mitigation, the PCID-preserving context switch optimisation, the LA57 extension, and the SGX shadow paging mechanism all required only microcode updates, delivered without silicon replacement to billions of deployed processors. The cost of this flexibility is complexity — the FSM-plus-microcode architecture requires careful validation, and microcode bugs have produced security vulnerabilities in their own right (e.g., the Reptar vulnerability in November 2023, which involved incorrect EEVEX prefix handling in the PTW-adjacent prefix decoder). ARM64's pure hardware TTW prioritises determinism and energy efficiency: the absence of a microcode ROM eliminates a category of microcode-specific bugs and reduces silicon area, but forces security mitigations into either new silicon tapeouts or software workarounds. RISC-V's software-managed base design maximises composability at the cost of miss-handler latency variance, with hardware walker extensions providing an opt-in performance path that avoids mandating area costs on resource-constrained implementations.

For systems architects and performance engineers, the practical synthesis of this chapter is a decision framework for diagnosing and addressing PTW-related bottlenecks. The first diagnostic is measurement: DTLB_LOAD_MISSES.WALK_DURATION divided by DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK gives average walk latency in cycles, immediately revealing whether the bottleneck is MSHR saturation (high walk count, low per-walk latency) or LLC/DRAM miss rate (low walk count, high per-walk latency). The second diagnostic is PWC hit rate: a walk duration consistently at the PTE-only level (~40 cycles LLC-hit) indicates excellent PWC effectiveness; a duration near the full four-fetch latency indicates PWC thrashing. The third is NUMA locality: walk latencies above 400 cycles on a system with LLC-resident page tables almost always indicate cross-socket PTW. Each diagnostic points to a distinct intervention: MSHR saturation → huge pages and working set reduction; high per-walk latency → NUMA-local page table allocation and LLC pressure reduction; PWC thrashing → huge pages to reduce PDPTE-level pressure. TLB consistency and shootdown overhead — the complementary side of TLB management that determines the cost of page table modification operations rather than page table walk operations — represent a distinct and equally important half of the full TLB management picture. Chapter 18 builds directly on the speculative PTW analysis in §17.5 to examine Meltdown (U/S bypass), L1TF/Foreshadow (P=0 PTE speculation), Spectre v2 (BTB poisoning of PTW microcode branches), and MDS (fill buffer sharing) with full CVE coverage and a quantified mitigation cost model.

The Foreshadow and broader microarchitectural data sampling (MDS) vulnerability class — including RIDL, Fallout, and ZombieLoad — extend the attack surface explored in Section 17.5 beyond the PTW to the speculative execution properties of the load buffer, store buffer, and fill buffer, and continue to inform the design of trusted execution environments across all major processor families.

References

Barr, T. W., Cox, A. L., and Rixner, S. (2010). Translation caching: Skip, don't walk (the page table). Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA), 48–59. https://doi.org/10.1145/1815961.1815970
Basu, A., Gandhi, J., Chang, J., Hill, M. D., and Swift, M. M. (2013). Efficient virtual memory for big memory servers. Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), 237–248. https://doi.org/10.1145/2485922.2485949
Van Bulck, J., Minkin, M., Weisse, O., Genkin, D., Kasikci, B., Piessens, F., Silberstein, M., Sherr, T. F., Yarom, Y., and Strackx, R. (2018). Foreshadow: Extracting the keys to the Intel SGX kingdom with transient out-of-order execution. Proceedings of the 27th USENIX Security Symposium, 991–1008.
Intel Corporation. (2022). Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide, Part 1. Chapter 4: Paging. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Arm Limited. (2023). ARM Architecture Reference Manual for A-profile Architecture (DDI 0487J.a). Chapter D8: The AArch64 Virtual Memory System Architecture. https://developer.arm.com/documentation/ddi0487/
RISC-V International. (2021). The RISC-V Instruction Set Manual, Volume II: Privileged Architecture, Version 1.12. Chapter 4: Supervisor-Level ISA. https://github.com/riscv/riscv-isa-manual
Bhatt, D., Chang, J., Hill, M. D., and Swift, M. M. (2013). Agile paging: Exceeding the best of nested and shadow paging. Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), 196–208. https://doi.org/10.1145/2485922.2485946
Gandhi, J., Basu, A., Hill, M. D., and Swift, M. M. (2014). Efficient memory virtualization: Reducing dimensionality of nested page walks. Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 442–453. https://doi.org/10.1109/MICRO.2014.37
Karakostas, V., Gandhi, J., Ayar, F., Cristal, A., Hill, M. D., McKinley, K. S., Nemirovsky, M., Swift, M. M., and Ünsal, O. (2015). Redundant memory mappings for fast access to large memories. Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), 66–78. https://doi.org/10.1145/2749469.2750471
Bhatt, D., Rajwar, R., Ganesan, N., and Marathe, A. (2020). TLB-Shootdown overhead reduction using hardware-assisted invalidation. IEEE Transactions on Computers, 69(3), 374–387. https://doi.org/10.1109/TC.2019.2951282
Weisse, O., Van Bulck, J., Minkin, M., Genkin, D., Kasikci, B., Piessens, F., Silberstein, M., Sherr, T. F., Yarom, Y., and Strackx, R. (2018). Foreshadow-NG: Breaking the virtual memory abstraction with transient out-of-order execution. Technical Report. https://foreshadowattack.eu/foreshadow-NG.pdf
SiFive Inc. (2021). SiFive U74 Core Complex Manual. Chapter 7: Memory Management Unit. https://www.sifive.com/documentation
Yaniv, I., and Tsafrir, D. (2016). Hash, don't cache (the page table). Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, 337–350. https://doi.org/10.1145/2901318.2901333