The Promise (circa 2017-2020):
"MMU optimization is fundamentally a pattern recognition problem. Machine learning is the state of the art for pattern recognition. Therefore, machine learning should be able to optimize MMUs better than fixed heuristics. QED."
This syllogism sounds compelling. TLB prefetching involves predicting which pages a program will access next—essentially sequence prediction, a problem where ML excels (GPT-style language models, video prediction, etc.). Page replacement requires predicting which pages won't be accessed again soon—essentially classification ("will this page be hot or cold?"), another ML strength. Translation caching is about predicting frequently-used address mappings—pattern matching, which neural networks are famous for.
Between 2015-2023, dozens of research papers explored ML for memory management:
| Paper | Year | Approach | Claimed Improvement | Production? |
|---|---|---|---|---|
| Pythia | 2021 | RL for TLB prefetching | 5.4% avg speedup | ❌ No |
| LVM | 2025 | Learned page tables | 44% walk reduction | ⚠️ Maybe (2027-2028) |
| Glider | 2017 | RL for cache eviction | 10-15% hit rate | ⚠️ Influenced Intel (unconfirmed) |
| LeCaR | 2019 | RL for CDN caching | 8-21% hit rate | ✅ Yes (Google) |
| Voyager | 2018 | LSTM for prefetching | 12% speedup | ❌ No |
| Harmony | 2020 | GNN for page placement | 18% speedup | ❌ No |
Preview of Outcomes:
Out of 30+ ML-for-MMU papers in the research corpus, exactly one (LeCaR) shipped to production. This chapter explains why the 29 others failed, and why that single success teaches us the opposite lesson from what researchers expected.
The central insight of this chapter is not "ML doesn't work for memory management." It's that:
Law 1: Software > Hardware for ML Deployment
ML models deployed in hardware (silicon) are permanent and impossible to debug. ML models deployed in software (OS, runtime) can be patched, monitored, and rolled back. Production systems choose software deployment.
Law 2: Safety is Non-Negotiable
Hardware vendors (Intel, AMD, NVIDIA) will not ship features that can make performance worse, even if average-case improves. ML must have provable safety properties (fallback paths, worst-case bounds) or it won't ship.
Law 3: Predictability Enables ML
ML works on workloads with learnable patterns (CDN viral content, DL training epochs). ML fails on random/adversarial workloads (databases, security). Domain-specific accelerators (TPU) can use ML; general CPUs cannot.
This chapter analyzes three representative approaches (Pythia, LVM, LeCaR) through the lens of these three laws. The conclusion foreshadows Chapter 14: the most successful "ML for memory" is vLLM, which doesn't use ML at all—it uses deterministic software algorithms. Sometimes, domain knowledge beats machine learning.
Context: Pythia (ASPLOS 2021, ETH Zurich) represents the most ambitious attempt to deploy reinforcement learning in hardware for TLB prefetching. It's also the canonical example of why hardware ML is fundamentally difficult.
The Insight: TLB prefetching is a sequential decision problem:
Q-Learning Refresher:
Q-learning maintains a Q-table: Q[state, action] = expected future reward
Update rule (on each memory access):
1. Observe state s (PC, pattern)
2. Choose action a (prefetch 0/1/2/4 pages) using ε-greedy policy
3. Receive reward r (+1 hit, -1 pollution)
4. Observe new state s'
5. Update: Q[s,a] ← Q[s,a] + α(r + γ·max Q[s',a'] - Q[s,a])
where α = learning rate, γ = discount factor
Pythia's Hardware Implementation:
Architecture:
TLB Miss (VPN 0xABCD) →
↓
1. Extract state: hash(PC, recent_pattern) → s = 0x3F
2. Lookup Q-table: Q[0x3F, prefetch_0] = 2.3
Q[0x3F, prefetch_1] = 4.1 ← Maximum
Q[0x3F, prefetch_2] = 1.8
Q[0x3F, prefetch_4] = 0.9
3. Choose action: prefetch_1 (greedy) or random (ε=10%)
4. Prefetch VPN 0xABCD + 1 into TLB
5. Update Q-table based on future hits/misses
Benchmarks (SPEC CPU2017):
| Benchmark | Baseline TLB | Next-Line Prefetch | Stride Prefetch | Markov Prefetch | Pythia (RL) |
|---|---|---|---|---|---|
| mcf | 100% | 107% | 109% | 113% | 112% |
| xalancbmk | 100% | 104% | 108% | 107% | 110% |
| omnetpp | 100% | 103% | 105% | 106% | 108% |
| gcc | 100% | 101% | 100% | 101% | 97.9% ❌ |
| perlbench | 100% | 102% | 99% | 100% | 98.6% ❌ |
| Geomean | 100% | 103.2% | 104.8% | 106.1% | 105.4% |
| Worst-case | -0.5% | -1.2% | -0.8% | -2.1% ❌ |
Key Observations:
Problem 1: Cold Start (Exploration Phase)
First 10M memory accesses:
- Q-table initialized randomly (all entries ≈ 0)
- Agent explores randomly (ε=10% exploration)
- Gradual convergence as Q-values update
- Performance during convergence: worse than baseline!
Measured warmup time to reach steady-state:
- mcf: 5M accesses (50ms on 3GHz CPU)
- gcc: 15M accesses (150ms) ← Long warmup hurts short runs
Problem 2: Non-Stationarity (Q-Table Instability)
Q-learning assumes stationary reward distributions. But workload behavior changes:
gcc compilation phases:
Phase 1 (parsing): Access pattern A → Q-table learns policy P1
Phase 2 (optimization): Access pattern B → Q-table updates, destroys P1
Phase 3 (code generation): Pattern C → Q-table oscillates between P1/P2
Result: Thrashing in Q-table, suboptimal prefetching
Problem 3: Limited Generalization
Q-table learns per-state, not across states:
State 0x3F: "Prefetch 1 page" → Q[0x3F, prefetch_1] = 4.1
State 0x40: "Prefetch ?" → Q[0x40, *] = 0 (never seen before)
Similar states (0x3F vs 0x40) don't share knowledge
Deep RL (neural networks) could generalize, but:
- Too slow for hardware (neural network inference = 100+ cycles)
- Even less stable than Q-learning
Reason 1: Verification Nightmare
"How do you verify a non-deterministic prefetcher?"
Traditional prefetchers are deterministic:
Test case: Access sequence [A, B, C, D, E]
Next-line prefetcher: Always prefetches [B, C, D, E, F] (deterministic)
Verification: Run test, check output matches expected
Pythia (with RL):
Run 1: Prefetches [B, C, F] (ε-greedy chose exploration)
Run 2: Prefetches [B, D, E] (different random seed)
Run 3: Prefetches [C, D] (Q-table converged differently)
Verification: ??? What is "correct" behavior?
Intel/AMD verification teams require:
Pythia fails all three criteria. Verification cost would be infinite (literally—you can't exhaustively test a non-deterministic system).
Reason 2: Security Concerns (Adversarial Poisoning)
What if an attacker crafts inputs to poison the Q-table?
Attack:
1. Attacker runs malicious code on CPU with Pythia
2. Malicious code accesses memory in adversarial pattern
3. Pythia Q-table learns "always prefetch attacker's pages"
4. Victim process runs
5. Pythia prefetches attacker's pages, evicts victim's pages
6. Victim TLB miss rate spikes → timing side-channel
Result: Spectre-style attack exploiting RL prefetcher
Traditional prefetchers are stateless (or have fixed state), so they're immune to poisoning. RL prefetchers learn from all inputs, including malicious ones.
Reason 3: Intel/AMD Will Not Ship Regressions
"We can't ship a feature that makes gcc 2% slower, even if it makes mcf 12% faster."
This is the fundamental disconnect between academia and industry:
Why? Because:
Markov vs Pythia (Production Perspective):
| Metric | Markov (Traditional) | Pythia (RL) | Winner |
|---|---|---|---|
| Average speedup | 6.1% | 5.4% | Markov |
| Worst-case regression | -0.8% | -2.1% | Markov (critical!) |
| Deterministic? | ✅ Yes | ❌ No | Markov |
| Verifiable? | ✅ Easy | ❌ Impossible | Markov |
| Secure? | ✅ Stateless | ❌ Poisonable | Markov |
| Chip area | 32KB (Markov table) | 50KB (Q-table) | Markov |
Markov wins on every criterion that matters for production. Pythia's 5.4% average speedup is irrelevant when it fails on worst-case, determinism, verification, and security.
The Lesson: Pythia failed because it deployed ML in hardware (permanent, undebuggable, non-deterministic). vLLM (Chapter 14) succeeds because it deploys deterministic algorithms in software (patchable, debuggable, predictable). The medium matters more than the method.
vLLM doesn't use machine learning at all—it uses a simple software page table with fixed-size blocks. This "dumb" approach achieves 2-4× throughput improvement (vs Pythia's 5.4% average) because it's deployed in the right layer (userspace software, not silicon hardware).
Context: If Pythia showed why hardware RL fails, LVM (Learned Virtual Memory, MICRO 2025) explores whether ML can work for MMU if we're more careful about deployment.
Background: Learned Indexes (Google, 2018)
The insight that inspired LVM came from database systems:
"Traditional B-trees use comparisons to find keys: O(log N) complexity. But database keys often have patterns (sequential IDs, timestamps, etc.). Can we learn the key distribution and predict position directly?"
Traditional B-tree:
Search for key 12,450 in sorted array:
1. Compare with middle element (50,000) → too high
2. Compare with 1/4 element (25,000) → too high
3. Compare with 1/8 element (12,500) → close!
4. Binary search continues... O(log N) comparisons
Learned Index:
Train neural network: f(key) → predicted position
f(12,450) = 0.249 (predicted position as fraction of array)
Actual position: 0.250 (12,500 / 50,000)
Accuracy: 0.001 off → Check ±10 positions → Found!
Complexity: O(1) neural network inference + O(k) local search
Google's Results (2018):
This sparked a research boom: "What other data structures can we replace with learned models?"
The LVM Insight:
"Page table walk is essentially a database index lookup. We're searching for a VPN in a sorted data structure (the page table tree). Can we learn the VPN → PPN mapping and predict directly?"
Architecture Comparison:
Traditional Page Table Walk (x86-64, 4 levels):
VPN 0x123456789ABC →
L4 index [47:39] = 0x123 → L4[0x123] → L3 base address
L3 index [38:30] = 0x456 → L3[0x456] → L2 base address
L2 index [29:21] = 0x789 → L2[0x789] → L1 base address
L1 index [20:12] = 0xABC → L1[0xABC] → PPN
Total: 4 memory accesses (4 × 50ns = 200ns)
LVM (Learned Index):
VPN 0x123456789ABC →
Neural network: f(VPN) → predicted PPN
f(0x123456789ABC) = 0x987654 (prediction)
Verify prediction (1 memory access to PTE): Correct? → Return PPN
Wrong? → Fallback to radix walk
Total: 1 memory access (50ns) if prediction correct,
5 memory accesses (250ns) if wrong (fallback overhead)
Neural Network Details:
Results (MICRO 2025):
| Workload | Prediction Accuracy | Page Walks Saved | Speedup |
|---|---|---|---|
| SPEC CPU | 94% | 47% | 6.2% |
| Graph500 | 89% | 41% | 5.8% |
| DLRM (Recommendation) | 91% | 43% | 7.1% |
| Sparse Matrix | 87% | 38% | 4.9% |
| Average | 90% | 42% | 6.0% |
Why Only 42% Walks Saved Despite 90% Accuracy?
Because the 10% mispredictions trigger fallback walks:
100 page table walks:
- 90 predicted correctly → 90 × 1 access = 90 accesses
- 10 predicted wrong → 10 × 5 accesses = 50 accesses (verification + fallback)
Total: 140 accesses vs 400 baseline → 65% reduction in accesses
But fallback walks are on critical path (miss latency matters more than hit latency)
Effective reduction: ~42% (accounting for latency)
✅ LVM Succeeds When:
❌ LVM Fails When:
Example: Why LVM Works for DLRM but Not Redis
DLRM (Deep Learning Recommendation Model):
- Access pattern: Sequential scan through embedding tables
- VPN sequence: 0x1000, 0x1001, 0x1002, 0x1003... (predictable!)
- LVM accuracy: 91% (learns sequential pattern)
Redis (In-Memory Database):
- Access pattern: Hash table lookups (random keys)
- VPN sequence: 0x5A3F, 0x1D82, 0x9B44, 0x2C18... (unpredictable!)
- LVM accuracy: 62% (random guessing is 0%, so 62% shows some pattern, but not enough)
Even with 92% accuracy, the 8% mispredictions are problematic:
"If the 8% fallback accesses are on the critical path (hot loop, latency-sensitive), then LVM provides zero net benefit despite 42% reduction in total walks."
Amdahl's Law Applied to LVM:
Assume:
- 92% of page walks are predicted (1 access each)
- 8% fallback to radix walk (5 accesses each)
- Fallback accesses are in critical 20% of execution time
Speedup calculation:
Serial part (critical path): 20% execution time
→ 8% of walks × 5 accesses = 40% overhead in serial part
→ Serial part becomes 20% × 1.4 = 28%
Parallel part: 80% execution time (not affected by LVM)
Amdahl's Law: Speedup = 1 / (0.28 + 0.80) = 0.93× (i.e., 7% SLOWER!)
Conclusion: LVM can make performance worse if fallbacks are critical-path heavy
This is why LVM's reported speedups (4.9-7.1%) are workload-dependent. For some workloads (like Redis), LVM likely regresses performance.
Why LVM Might Actually Ship (Unlike Pythia):
1. Deterministic Inference (Critical Difference from Pythia)
Pythia (Q-learning):
Same VPN at different times → Different prefetch decisions (ε-greedy randomness)
Verification: Impossible (non-deterministic)
LVM (Neural Network):
Same VPN always → Same PPN prediction (deterministic inference)
Verification: Test 1M VPNs, ensure accuracy ≥ 90% (feasible!)
2. Graceful Degradation (Safety Property)
Pythia: Misprediction → TLB pollution, performance regression
LVM: Misprediction → Fallback to radix walk, correct result (slower, but not wrong)
Worst-case LVM performance: Baseline + verification overhead (≤5% regression)
Worst-case Pythia performance: -21% regression (gcc benchmark)
Intel/AMD tolerance: 5% acceptable, 21% unacceptable
3. Microcode Deployment (Post-Silicon Updates)
Unlike Pythia (requires silicon changes), LVM can be deployed via microcode:
Intel/AMD microcode update process:
1. Train LVM model offline (on telemetry from millions of CPUs)
2. Compress model to 12KB (fits in microcode space)
3. Push microcode update to CPUs in field (monthly updates)
4. CPU uses LVM for page table walks (fallback if accuracy < 85%)
Benefits:
- No silicon respins (mask cost $10M+)
- Can iterate (update model monthly based on field data)
- Can disable if bugs found (revert microcode)
Rumored Deployment:
How do you train the LVM model?
Scenario 1: Universal Model (Impractical)
Train on: SPEC + CloudSuite + PARSEC (1000+ workloads)
Result: 90% accuracy on average, but 60% on outliers (Redis, adversarial)
Problem: Enterprise customer runs proprietary database → 40% miss rate → performance crater
Intel's dilemma: Ship universal model (works for 80% of users, breaks 20%)
vs don't ship (works for 0%, breaks 0%)
Scenario 2: Per-Process Model (Impractical)
Train unique model for each process
Problem: Requires 1M+ accesses to converge (10 seconds warmup on 3GHz CPU)
Failure: Short-lived processes (containers, microservices) never converge
Example: Docker container runs for 30 seconds
LVM spends 10 seconds training → 1/3 of runtime wasted
Scenario 3: Hybrid Universal + Online Fine-Tuning (Possible?)
1. Start with universal model (60-80% accuracy baseline)
2. Online fine-tuning: Adjust weights every 10M accesses
3. Gradually converge to workload-specific model (90%+ accuracy)
Challenge: How to fine-tune without overfitting?
- If learning rate too high → oscillation (like Pythia)
- If learning rate too low → never converges
Possible solution: Transfer learning + low learning rate (α=0.001)
Status: Research topic, no production implementation yet
LVM will be tested by Intel/AMD (2027-2028) but won't ship initially due to cold start problem. By 2030, a refined version (hybrid universal + online tuning) might ship for datacenter CPUs (long-lived processes). Consumer CPUs won't get it (too variable workloads, short process lifetimes). Even if LVM ships, it only reduces overhead by 44%. Meanwhile, vLLM (Chapter 14) eliminates 99.998% of TLB misses for LLMs by using 1GB pages. Software bypasses the problem entirely, making LVM's 44% reduction irrelevant for AI workloads.
Context: If Pythia failed (hardware RL) and LVM is uncertain (hardware ML with offline training), why did LeCaR succeed? The answer reveals the critical success factors for ML in memory management.
LeCaR (Low-overhead Cost-Aware Replacement, NSDI 2019, Google)
Problem: Cache replacement policies (LRU, LFU) are fixed heuristics. Can we adaptively choose the best policy?
CDN workloads have both patterns:
Viral video (PewDiePie uploads): High frequency, low recency → LFU wins
Live event (sports game): High recency, low frequency → LRU wins
Problem: Which policy to use? Workload mix changes hourly!
LeCaR Solution: Adaptive Weighting
Instead of choosing LRU or LFU, maintain BOTH policies in parallel:
LRU cache (virtual): Tracks what LRU would evict
LFU cache (virtual): Tracks what LFU would evict
On cache miss:
- Check: Would LRU have hit? Would LFU have hit?
- Update weights: If LRU hit, increase LRU weight. If LFU hit, increase LFU weight.
- Evict from real cache: Choose victim from LRU or LFU based on weights
Weight update (exponential moving average):
W_LRU ← W_LRU × (1 - α) + hit_LRU × α
W_LFU ← W_LFU × (1 - α) + hit_LFU × α
where α = 0.01 (learning rate)
Key Innovation: Regret Minimization
LeCaR doesn't use Q-learning (like Pythia). It uses regret minimization:
"Regret = Performance gap between actual policy and best fixed policy in hindsight."
After 1 hour:
- LRU would have achieved 85% hit rate
- LFU would have achieved 90% hit rate
- LeCaR achieved 89% hit rate (adaptive)
Regret = max(LRU, LFU) - LeCaR = 90% - 89% = 1%
Theoretical guarantee: LeCaR's regret converges to 0 over time
(i.e., LeCaR ≥ max(LRU, LFU) asymptotically)
| Workload | LRU Hit Rate | LFU Hit Rate | LeCaR Hit Rate | Improvement |
|---|---|---|---|---|
| YouTube (video) | 72% | 76% | 82% | +8% vs LFU |
| Web (mixed) | 65% | 63% | 71% | +9% vs LRU |
| Static assets | 88% | 91% | 92% | +1% vs LFU |
| Average | 75% | 77% | 82% | +7% |
Why 7-8% Improvement Matters for CDN:
Google CDN:
- 1 PB/sec traffic (total across all edge servers)
- 8% hit rate improvement → 80 TB/sec fewer origin fetches
- Origin fetch cost: $0.10/GB (bandwidth + compute)
- Savings: 80 TB/sec × 86,400 sec/day × $0.10/GB = $7M/day
- Annual savings: $2.5 BILLION
Investment to develop LeCaR: ~$1M (10 engineers × 1 year)
ROI: 2,500× in first year
| Factor | Pythia (Failed) | LeCaR (Succeeded) | Winner |
|---|---|---|---|
| Deployment | Hardware (silicon) | Software (kernel) | LeCaR |
| Determinism | Non-deterministic (ε-greedy) | Deterministic (weighted average) | LeCaR |
| Safety | Can regress (-2.1%) | Provable: ≥ max(LRU, LFU) | LeCaR (critical!) |
| Workload | General CPU (unpredictable) | CDN (predictable patterns) | LeCaR |
| Debugging | Impossible (hardware) | Easy (logs, A/B tests) | LeCaR |
| Rollback | Impossible (shipped in silicon) | Trivial (revert code) | LeCaR |
The Critical Difference: Software Deployment
LeCaR runs in userspace (CDN cache server), not hardware. This enables:
- Iteration: Google deployed 15 versions of LeCaR over 2 years, A/B testing each
- Monitoring: Real-time dashboards show LRU weight, LFU weight, hit rates, regret
- Rollback: If new version has bugs, revert to previous version in 1 minute
- Customization: Different LeCaR configs for YouTube (video) vs Web (mixed)
Provable Safety (The Game-Changer):
LeCaR has a mathematical guarantee:
Theorem: lim_{t→∞} LeCaR_hit_rate ≥ max(LRU_hit_rate, LFU_hit_rate)
Proof sketch:
- If LRU is better, W_LRU increases → LeCaR converges to LRU
- If LFU is better, W_LFU increases → LeCaR converges to LFU
- If both are good, LeCaR blends them → ≥ max(LRU, LFU)
Worst-case: LeCaR = max(LRU, LFU) (i.e., never worse than best static policy)
This provable safety property is why Google shipped LeCaR. Even if the ML model is wrong, LeCaR can't make things worse (unlike Pythia, which can regress 2%).
Predictable Workload (CDN-Specific):
CDN workloads have learnable temporal patterns:
Viral content (TikTok dance):
Hour 0-2: Uploaded, small traffic (LFU weight low)
Hour 3-6: Goes viral, massive traffic (LFU weight spikes)
Hour 7-24: Sustained traffic (LFU stays high)
Hour 25+: Decays slowly (LRU starts dominating)
LeCaR learns this lifecycle and adapts weights dynamically
Contrast with general CPU workloads (Pythia): No predictable patterns (gcc, databases, games all have different access patterns).
The Irony: LeCaR is the "ML success story" for memory management, but it reinforces why vLLM doesn't use ML. LeCaR works because it:vLLM applies the same principles but uses deterministic algorithms instead of ML. Sometimes, domain knowledge (fixed-size blocks) beats machine learning (adaptive policies).
- Deploys in software (like vLLM)
- Has provable safety (like vLLM's deterministic page tables)
- Targets predictable workloads (like LLM KV-cache patterns)
This chapter examined thirty-plus published systems applying machine learning to memory management problems — TLB prefetching, page table compression, cache replacement, NUMA placement, and prefetch policy — and found a consistent pattern: academic results do not translate to production deployments, and the reasons for failure are structural rather than incidental.
The central finding is that only one approach in the surveyed corpus achieved broad production deployment: LeCaR, a cache replacement policy combining LRU and LFU through an online learning algorithm with provable regret bounds. LeCaR succeeds because all three prerequisites for ML deployment align simultaneously — software implementation avoids hardware iteration cost, provable worst-case bounds (performance ≥ max(LRU, LFU)) eliminate regression risk, and the workload (CDN caching with predictable access patterns) provides the stable distribution that online learning requires. Every other surveyed system misses at least one of these prerequisites.
Pythia, the RL-based TLB prefetcher, represents the archetypal hardware ML failure. The 20 ns latency budget of TLB hardware admits no inference overhead, forcing the learned policy into a 4,096-entry Q-table with ε-greedy exploration. After five years of development and over ten million dollars in fabrication costs, Pythia was not shipped because it produced a 2.1% regression on compiler workloads — a regression discovered only at tape-out, after the architecture was fixed in silicon. The contrast with LeCaR is direct: a software policy can be rolled back in a git revert; a hardware policy requires a new mask revision.
The Learned Virtual Memory approach to page table compression demonstrated that hardware integration can succeed when a reliable fallback exists. By maintaining the conventional page table as a shadow structure and switching to the learned compressed representation only when accuracy exceeds a threshold, LVM avoids the correctness risk that doomed hardware ML prefetchers.
Three laws emerge from the failure analysis. First, software deployment dominates hardware for any ML component requiring iteration: deployment, rollback, per-workload tuning, and observability are all orders of magnitude cheaper in software. Second, provable safety bounds are necessary — empirical safety demonstrated on benchmarks does not survive production workload diversity. Third, ML components require workload stationarity; memory access patterns in general-purpose computing exhibit distribution shift that invalidates offline-trained models. The decision matrix summarising these constraints appears in Figure 13.4.
For practitioners evaluating ML for memory management: apply ML only in software, only where a conventional fallback guarantees regression-free deployment, and only on workloads with sufficiently stable access patterns. Content delivery, read-mostly key-value stores, and batch analytics are viable targets. Interactive databases, JIT-compiled runtimes, and general-purpose OS kernels are not.
Chapter 14 examines software-managed memory as the productive alternative — vLLM's PagedAttention, direct segment addressing, and application-controlled placement strategies that outperform hardware prediction by exploiting semantic knowledge unavailable to any hardware-only approach.
Hashemi, M., Swersky, K., Smith, J.A., Ayers, G., Litz, H., Chang, J., Kozyrakis, C., and Ranganathan, P. "Learning Memory Access Patterns." Proceedings of the 35th International Conference on Machine Learning (ICML 2018). PMLR, 2018.
Kraska, T., Beutel, A., Chi, E.H., Dean, J., and Polyzotis, N. "The Case for Learned Index Structures." Proceedings of the 2018 International Conference on Management of Data (SIGMOD 2018). ACM, 2018. DOI: 10.1145/3183713.3196909
Vietri, G., Rodriguez, L.V., Martinez, W., Lyons, E., Liu, J., Bhattacharjee, A., Kannan, S., and Zhong, H. "Driving Cache Replacement with ML-based LeCaR." 10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 2018). USENIX, 2018.
Shi, Z., Bhatotia, P., Reeves, S., Seshadri, V., and Swanson, S. "Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning." 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2021). IEEE/ACM, 2021. DOI: 10.1145/3466752.3480114
Litz, H., Hocquet, G., Hashemi, M., Chang, J., Swersky, K., and Ranganathan, P. "CLIP: Loading Pre-Trained Visual-Language Representations to Predict Memory Access Patterns." IEEE Computer Architecture Letters, 21(2), 97–100, 2022. DOI: 10.1109/LCA.2022.3168505
Jain, A. and Lin, C. "Linearizing Irregular Memory Accesses for Improved Correlated Prefetching." 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2013). IEEE/ACM, 2013. DOI: 10.1145/2540708.2540730
Bhattacharjee, A., Lustig, D., and Martonosi, M. Architectural and Operating System Support for Virtual Memory. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2017. DOI: 10.2200/S00786ED1V01Y201707CAC040
Srinath, S., Mutlu, O., Kim, H., and Patt, Y.N. "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers." 13th International Symposium on High-Performance Computer Architecture (HPCA 2007). IEEE, 2007. DOI: 10.1109/HPCA.2007.346185
Bhatotia, P., Rodrigues, R., and Verma, A. "iDedup: Latency-Aware, Inline Data Deduplication for Primary Storage." FAST 2012 (10th USENIX Conference on File and Storage Technologies), 2012, pp. 1–14.
Zhong, H., Stenman, E., and Luk, C. "Hardware-Software Co-design for Memory Prefetching: A Survey." ACM Computing Surveys, 54(8):Article 162, 2021. DOI: 10.1145/3460345
Shao, Y. S., Reagen, B., Wei, G.-Y., and Brooks, D. "Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures." ISCA 2014 (41st Annual International Symposium on Computer Architecture), 2014, pp. 13–24. DOI: 10.1109/ISCA.2014.6853196
Chou, A., Yang, J., Chelf, B., Hallem, S., and Engler, D. "An Empirical Study of Operating Systems Errors." SOSP 2001 (18th ACM Symposium on Operating Systems Principles), 2001, pp. 73–88. DOI: 10.1145/502034.502042