The Swapping Cliff: Mitigating Latency Spikes in vLLM High-Concurrency Workloads
Introduction: The Utilization Trap
In production ML infrastructure, high GPU utilization is often viewed as the primary success metric. If your H100s are running at 95% utilization, the assumption is that you are maximizing ROI. However, in the context of vLLM and PagedAttention, maximizing memory utilization without understanding the underlying eviction mechanics invites a catastrophic failure mode known as the Swapping Cliff.
The situation is deceptive. You deploy a Llama 3 70B model using vLLM, tuning for maximum concurrency. Throughput looks stable. Then, a specific mix of long-context requests hits the server. Suddenly, P99 latency explodes from 200ms to 8 seconds. Throughput flatlines. The GPUs are still pegged at 100%, but no tokens are being generated.
This is not a network issue; it is a scheduling collapse.
In vLLM V1 architecture, the mechanism for handling memory pressure has fundamentally shifted from Swapping to Recomputation. Understanding this shift—and the “Cliff” it creates—is mandatory for any architect building high-concurrency inference systems. This article dissects the physics of vLLM preemption, the fallacy of block size tuning, and the architectural patterns required to survive the “Death Spiral.”
1. The Anatomy of a Stall: Recompute vs. Swap
Legacy serving engines handled memory exhaustion by offloading Key-Value (KV) cache blocks from GPU VRAM to CPU RAM via PCIe (Swapping). While logical, this approach introduced significant latency variance due to PCIe bandwidth constraints and memory pinning overhead.
vLLM V1 takes a more aggressive approach. For standard autoregressive decoding, swapping is effectively deprecated. The default behavior is governed by PreemptionMode.RECOMPUTE.
The Preemption Mechanics
When the vLLM block manager exhausts available num_gpu_blocks, it cannot allocate space for the next token in the running batch. To resolve this deadlock, the scheduler must free up space.
Under PreemptionMode.RECOMPUTE, vLLM does not gracefully move data to the CPU. Instead, it performs a hard eviction:
- Drop: The KV blocks for the lowest-priority request(s) are deleted from VRAM.
- Queue: The request is moved back to the
waitingqueue. - Stall: The request sits idle until sufficient blocks become available.
Recompute: Once rescheduled, the engine must perform a fresh Prefill phase on the entire sequence (prompt + all tokens generated prior to eviction).
The Latency Penalty
The “Cliff” is defined by the Time-To-First-Token (TTFT) of the accumulated sequence length.
If a request has generated 2,048 tokens before eviction, the latency penalty for the next token is not the 15ms decode time; it is the ~250ms required to re-process the prompt and those 2,048 tokens through the FlashAttention kernels.
Why Recompute? The PCIe Bottleneck
Why would vLLM choose to destroy data rather than move it? The decision is driven by the disparity between compute speed and interconnect bandwidth on modern hardware (H100/A100).
PCIe Latency Floor: For small block sizes (16 or 32 tokens), the OS overhead of memory management, page pinning, and PCIe transaction setup often exceeds the execution time of a FlashAttention forward pass.
Kernel Efficiency: Modern FlashAttention-3 kernels are compute-bound but extremely fast. Recomputing 4k tokens is often faster than transferring ~10GB of KV cache back and forth over a shared PCIe Gen5 bus. Exception: Beam Search. Because Beam Search involves multiple sequences sharing a history, recomputation is algorithmically complex and currently unsupported. If your workload relies on Beam Search, vLLM forces PreemptionMode.SWAP.
2. The Block Size Illusion: Hard Limits & Trade-offs
A common “optimization” attempting to mitigate fragmentation is tuning the block_size parameter. Engineers often increase this from the default 16 to 128, assuming larger blocks mean less tracking overhead.
For Llama 3 on NVIDIA GPUs, this is usually a mistake.
The CUDA Hard Limit
Standard PagedAttention CUDA kernels have a hard limit of 32. If you set block_size=64 or 128 in your config for a standard Llama 3 deployment, vLLM will likely revert to a fallback kernel or crash, unless you are using specific backends designed for it.
The MLA Exception (DeepSeek)
The only scenario where large block sizes are mandatory is when using architectures with Multi-Head Latent Attention (MLA), such as DeepSeek-V3. Backends like FlashMLA or CUTLASS require alignment to 64 or 128 bytes/tokens. In these cases, vLLM automatically overrides configurations to enforce the requirement.
The Mathematics of Fragmentation
For standard models, increasing block size increases Internal Fragmentation. The wasted memory (slots allocated but not yet filled) follows this formula:
AverageWaste=block_size−1/2×Sizetoken
Comparative Impact:
Block Size 16: Wastes ~7.5 tokens per sequence.
Block Size 128: Wastes ~63.5 tokens per sequence. In a high-concurrency batch (e.g., 256 requests), block_size=128 wastes approximately 16,000 token slots. On an 80GB H100, this is only ~1-2% of memory, which seems negligible. However, this “tail waste” prevents those blocks from being freed until the entire request finishes, creating artificial scarcity that triggers preemptions earlier than necessary.
Engineering Verdict: Stick to block_size=16. There is no hidden throughput unlock at 128 for Llama 3, only increased risk of fragmentation.
3. Theoretical Engineering Model: The Physics of Preemption
To understand when swapping would be viable (and why it usually isn’t), we must look at the inequality governing the trade-off. Swapping is beneficial only when the time to move data is less than the time to regenerate it.
Variables:
BPCIe: PCIe Gen5 x16 Effective Bandwidth ≈50 GB/s.
SKV: KV Cache Size per token. For Llama 3 70B (FP16, TP=1), this is ≈2.5 MB. With Tensor Parallelism (TP=8), it is ≈0.3 MB.
Tprefill(L): Time to prefill length L. The Inequality:
2×L×SKV / BPCIe <Tprefill(L)
(The factor of 2 accounts for the round trip: GPU →→ CPU and CPU →→ GPU).
Case Study: Llama 3 70B (TP=8) on H100
Let’s analyze a request with 8,000 tokens.
- Total KV Size: 8000×2.5 MB=20
- Per-GPU KV Size (TP=8): 2.5 GB.
Swap Time: With TP=8, the transfer is parallelized across 8 PCIe buses.
2×2.5 GB / 50 GB/s ≈ 0.1 seconds
Recompute Time: A prefill of 8k tokens on H100 usually clocks between 0.2s – 0.4s depending on quantization.
The Conclusion: Mathematically, for Long Context (>8k) combined with High Tensor Parallelism, Swapping is theoretically superior (0.1s vs 0.3s).
However, vLLM defaults to Recompute because the theoretical max bandwidth of PCIe is rarely achieved in practice due to “convoy effects”—where KV cache transfers block weight loading or input transfers—and the complexity of CPU memory synchronization. Recompute is a “shared nothing” operation that is easier to schedule deterministically.
4. Monitoring Strategy: Detecting "Silent" Eviction Cascades
The most dangerous failure mode in vLLM is the Eviction Cascade, or “Death Spiral.” This occurs when a few long-context requests (e.g., RAG analysis) force the eviction of many short-context requests (e.g., Chat).
Because the short requests are evicted, they are re-queued. When they run again, they must recompute. This recomputation burns GPU cycles without producing new tokens. If new requests keep arriving, the system enters a state where it is doing nothing but prefilling (recomputing) preempted requests.
The Dashboard Signals
You must instrument your Prometheus/Grafana dashboards to detect this immediately.
| Metric | Signature | Diagnosis |
|---|---|---|
| vllm:gpu_cache_usage_perc | Pinned > 95% | Saturation. Normal under load, but dangerous if static and correlated with latency spikes. |
| vllm:num_preemptions | Rising Rate | The Red Flag. The engine is actively killing requests. Any sustained rate > 0 requires intervention. |
| vllm:num_requests_running | Sawtooth Pattern | Requests are being demoted to waiting faster than they complete. |
| throughput (tokens/sec) | Flat / Dropping | The GPU is 100% busy, but output is stalling because cycles are wasted on recompute. |
If num_preemptions is rising while throughput is flat, you are in a death spiral.
5. Architectural Solutions
Increasing VRAM is the brute-force solution, but it is rarely cost-efficient. The engineering solution involves routing and configuration.
A. The Split-Router Pattern
Do not mix short-context (Chat) and long-context (RAG/Document Analysis) workloads on the same vLLM instance. The memory dynamics are incompatible.
Implementation: Use an edge gateway (NGINX/Envoy) or a control plane (Ray Serve) to route requests based on len(prompt).
Pool A (Chat): len(prompt) < 4096. Optimized for high concurrency, smaller block cache.
Pool B (Long Context): len(prompt) > 4096. Lower concurrency limit, massive KV cache reservation.
B. Capacity Planning Formula
To avoid the cliff, you must strictly limit concurrency based on your worst-case sequence length. Use the following formula to set your autoscaling triggers or max_num_seqs:
MaxConcurrency=VRAMtotal × 0.9− VRAMweights / SKV × AvgLength
VRAMweights: Size of the model weights (e.g., 140GB for 70B FP16).
SKV: KV cache size per token per sequence.
0.9: Safety buffer for activation memory. If your traffic exceeds this concurrency, the system will hit the Recomputation Wall. You must scale horizontally, not vertically.
C. Configuration Tuning
If you must run mixed workloads, increase max_num_batched_tokens. This allows larger chunks of recomputation to happen per step. It increases the latency of individual steps slightly but reduces the total wall-clock penalty of a preemption stall by clearing the recompute backlog faster.
Performance Audit & Optimization
At Azguards Technolabs, we specialize in the “Hard Parts” of ML systems engineering. The difference between a profitable inference endpoint and a latency nightmare often lies in the invisible interaction between the scheduler, the kernel, and the cache.
We partner with enterprise engineering teams to perform Deep-Dive Performance Audits of vLLM and Triton-based infrastructure. We don’t just recommend “more GPUs”—we analyze your memory access patterns, tune block allocation strategies, and implement custom routing layers to eliminate eviction cascades.
If your inference metrics are showing the “Sawtooth” of death, or if you are struggling to stabilize throughput at scale, contact our engineering team for a specialized architectural review.
The “Swapping Cliff” in vLLM is not a bug; it is a design choice optimized for modern hardware constraints. vLLM V1 bets on compute being cheaper than bandwidth. For 90% of cases, this is true. However, for the 10% of high-load, mixed-context scenarios, this behavior can silently destroy your SLAs.
As systems architects, we cannot rely on default configurations. We must respect the physics of the hardware, isolate workloads by memory profile, and monitor preemption rates as a critical health metric. The era of “fire and forget” LLM serving is over; the era of memory-aware scheduling has begun.
Would you like to share this article?
Ready to Eliminate "Silent" Eviction Cascades?
If your H100s are pegged at 100% utilization but throughput is flat, you are stuck on the Swapping Cliff. At Azguards Technolabs, we specialize in these engineering "hard parts." We partner with enterprise teams to perform Deep-Dive Performance Audits—analyzing your memory access patterns, tuning block allocation strategies, and implementing custom routing layers to stabilize your high-concurrency vLLM infrastructure. Don't just recommend "more GPUs"; respect the physics of your hardware.
Talk to an Expert
All Categories
Latest Post
- The Bloated Context: Mitigating Worker OOMs in Resumable N8N Pipelines
- The Lock Wait Cliff: Decoupling Atomic Inventory States from wp_postmeta in WooCommerce
- The Swapping Cliff: Mitigating Latency Spikes in vLLM High-Concurrency Workloads
- The Rebalance Spiral: Debugging Cooperative Sticky Assigner Livelocks in Kafka Consumer Groups
- The Propagation Penalty: Bypassing React Context Re-renders via useSyncExternalStore