Speculative Decoding for vLLM Inference Services
TOC
IntroductionBefore You DecideMethods Validated in This Guide on Alauda AIRecommended Starting PointsInternal Validation Snapshot — N-gramInternal Validation Snapshot — EAGLE-3PrerequisitesConfiguration SurfaceProviding Model Artifacts on Alauda AISingle-artifact pattern (N-gram)Two-artifact pattern (EAGLE-3 and similar)Option A — KServestorageUris (preferred when available)Option B — Single OCI Modelcar containing both artifactsOption C — Pre-staged on a shared PVCPicking between A / B / CEnd-to-End ExamplesExample 1 — N-gramExample 2 — EAGLE-3 with target + draft on a shared PVCVerify and Measure the Impact1. Confirm the configuration was applied2. Confirm speculative decoding is actually running3. Measure end-to-end impact4. How to report or compare numbersRollbackTroubleshootingCaveats and Known LimitationsReferencesIntroduction
Speculative decoding lets a vLLM server propose several tokens per decode step and verify them with a single forward pass of the target model, lowering per-token latency on interactive workloads without changing the output distribution.
This page focuses on how to enable, configure, verify, and roll back speculative decoding for an InferenceService running on Alauda AI. For the upstream technique itself and the full list of methods supported by vLLM, see the vLLM speculative decoding documentation.
Speculative decoding involves runtime-version-sensitive flags. The exact --speculative-config JSON keys, supported method values, and the metric names referenced below depend on the vLLM version inside your runtime image. Treat all snippets here as starting points and confirm against the vLLM version you ship.
Before You Decide
Speculative decoding helps when the per-request decode loop dominates end-to-end latency and the proposed tokens are accepted often enough to amortize the proposal overhead.
It tends to help on:
- Interactive chat / agent loops with relatively predictable continuations.
- Summarization, RAG answers, and code completion, where output overlaps the prompt.
It can hurt or be neutral on:
- High-temperature sampling, where acceptance rate collapses.
- High-QPS / batch-saturated services, where decode capacity is no longer idle. The vLLM team's 2024 V0-engine benchmarks reported 1.4×–1.8× slowdowns on the same datasets at high QPS. The V1 engine schedules differently, so the magnitude may differ on your runtime, but the direction of the risk is the same.
- Very small target models, where the verification step is already cheap.
Run a representative workload before committing speculative decoding as a default. See Verify and Measure the Impact.
Methods Validated in This Guide on Alauda AI
The two methods below are the ones this guide covers and that have been exercised end-to-end on Alauda AI. vLLM upstream supports additional methods (for example MTP for models that ship multi-token-prediction heads, Medusa, MLP Speculator, Suffix, Draft Model), and those methods may also be usable on Alauda AI through the same --speculative-config flag. They are out of scope for this page, so refer to the upstream documentation and validate on your own setup before promoting to production.
Notes:
- vLLM upstream describes N-gram as "effective for use cases like summarization and question-answering, where there is a significant overlap between the prompt and the answer".
- vLLM upstream describes EAGLE-3 as "the current SOTA for speculative decoding algorithms" (snapshot from the latest features page; revisit per release).
Recommended Starting Points
There is no single best method for every workload. The following are conservative starting points to reduce trial cost. Always validate against your own traffic before promoting to production.
Internal Validation Snapshot — N-gram
The starting points above are guidance, not guarantees. The measurement below is one concrete data point from Alauda AI's internal lab, intended to help calibrate expectations on similar single-GPU serving setups. Your own model, GPU, runtime version, and traffic will produce different numbers — always benchmark before promoting to production.
- Hardware: NVIDIA A30 24 GB × 1
- Model: Qwen3-8B (BF16, HuggingFace
Qwen/Qwen3-8B) - Runtime: vLLM 0.19.1 (V1 engine)
- Request parameters:
temperature=0,seed=42,max_tokens=1024,enable_thinking=false, single concurrent request, 1 warmup discarded + 3 timed runs (median reported)
Baseline command (no spec decode):
N-gram command (only differs by --speculative-config):
Workloads:
- code refactor (high prompt-output overlap): ask the model to add docstrings and type annotations to a 30-line Python class and return the full updated class
- general chat (no prompt-output overlap): ask the model to explain a concept in ≥800 words
Results:
Interpretation:
- On this single-GPU 8B setup, N-gram registered as a slight regression on the code-refactor workload and a clear ~15% regression on chat. The proposer's CPU work, the verification of five candidate tokens per step, and the fact that vLLM disables async scheduling under N-gram together cost more than the accepted tokens save.
- The acceptance rate for the high-overlap code workload is healthy (mean acceptance length ≈ 3 in earlier informal probes), but acceptance rate alone does not predict end-to-end speedup — the per-step overhead must be amortized against actual decode time of the target model. On a small target model on a single GPU, decode is already cheap and there is little room to amortize.
- The chat result confirms the Caveats about workloads without prompt-output overlap.
The same method on a larger target model (where each verify step costs more), with multi-GPU tensor parallelism, or under higher concurrency may behave very differently. Treat this snapshot as a reminder to measure, not as a verdict on N-gram itself.
Internal Validation Snapshot — EAGLE-3
The starting points above are guidance, not guarantees. The measurement below is one concrete data point from Alauda AI's internal lab, intended to help calibrate expectations on similar single-GPU EAGLE-3 setups. Your own model, GPU, runtime version, and traffic will produce different numbers — always benchmark before promoting to production.
- Hardware: NVIDIA A30 24 GB × 1
- Model: Meta-Llama-3.1-8B-Instruct (BF16, HuggingFace
meta-llama/Meta-Llama-3.1-8B-Instruct) with EAGLE-3 draftyuhuili/EAGLE3-LLaMA3.1-Instruct-8B - Runtime: vLLM 0.19.1 (V1 engine)
- Request parameters:
temperature=0,seed=42,max_tokens=1024, single concurrent request, 1 warmup discarded + 3 timed runs (median reported)
Baseline command (no spec decode):
EAGLE-3 command (only differs by --speculative-config):
Workloads:
- code refactor (high prompt-output overlap): ask the model to add docstrings and type annotations to a 30-line Python class and return the full updated class
- general chat (no prompt-output overlap): ask the model to explain a concept in ≥800 words
Results:
Speedup is the tok/s ratio (completion-length-invariant). Wall delta compares median wall-clock time directly; the chat runs generated different amounts of output (baseline 588 vs EAGLE-3 709 tokens), so Speedup is the more reliable indicator there.
Speculative-decoding behaviour (EAGLE-3 side, from SpecDecoding metrics log windows):
Mean acceptance length and acceptance rates are draft-weighted across the SpecDecoding metrics log windows that covered each benchmark run; per-position values are from the sustained-load windows inside each run.
Interpretation:
- EAGLE-3 delivered a ~1.84× speedup on code-refactor and was essentially break-even on general chat (~0.99×) on this single-GPU 8B setup. The two baseline runs sat on top of each other at ~47.8 tok/s, as expected — base decode rate is a model-and-hardware property and does not depend on prompt content. All of the observable gap comes from the EAGLE-3 side.
- Why code wins and chat doesn't — acceptance data tells the mechanism directly. On code the draft head landed ~2.54 tokens per decode step at ~51% acceptance, so most steps emit multiple tokens; per-position acceptance decays slowly (0.50 / 0.40 / 0.33), so even the 3rd speculative slot still pays off a third of the time. On chat mean acceptance length sits at ~1.19 with only ~6% acceptance, and per-position acceptance collapses by the 2nd slot (0.16 / 0.02 / 0.01) — almost every step emits just the verified token and the drafted ones are discarded.
- Realized vs theoretical. Mean acceptance length is the theoretical upper bound on speedup with zero proposer overhead. Code realized 1.84× against a 2.54× ceiling (~72% converted), i.e. proposer CPU work, verification of rejected proposals, and async-scheduling costs ate about a quarter of the headroom. Chat's 1.19× theoretical ceiling was entirely consumed by overhead and tipped into a slight regression. This is consistent with the Caveats: on small models on a single GPU, per-step overhead has little idle decode capacity to hide behind.
The same method on a larger target model (where each verify step costs more), with multi-GPU tensor parallelism, or under higher concurrency may behave very differently. Treat this snapshot as a reminder to measure, not as a verdict on EAGLE-3 itself.
Prerequisites
- A Kubernetes cluster with KServe installed and a namespace where you can create
InferenceServiceresources. - A vLLM serving runtime registered on the platform whose vLLM version supports the speculative method you plan to use. To check the version, exec into a running pod with that runtime:
kubectl exec <pod> -- python3 -c "import vllm; print(vllm.__version__)". - Your target model is accessible to the service through its storage source (model repository, PVC, or OCI image).
- For EAGLE-3: a draft head whose architecture, tokenizer, and base version match the exact target model. A mismatched head silently degrades acceptance rate and may not surface as a startup error.
- For EAGLE-3: a model-artifact loading mechanism that can deliver both target and draft into the same pod. See Providing Model Artifacts on Alauda AI.
Configuration Surface
In vLLM v1, speculative decoding is enabled by a single argument:
Common keys:
method: the proposer to use. Values used in this guide:ngramandeagle3. Other values exist upstream (for examplemedusa, or model-specific MTP names such asdeepseek_mtp) — confirm the exact value for your method in the vLLM speculative decoding documentation.num_speculative_tokens: how many tokens to propose per step. Higher values can increase speedup but also waste compute on rejected proposals.model: for methods that load a separate draft artifact (such as EAGLE-3), the path to that artifact inside the container.- Method-specific keys, such as
prompt_lookup_max/prompt_lookup_minfor N-gram. These names have changed across vLLM releases — verify against the version you ship.
All other vLLM arguments (--model, --tensor-parallel-size, --gpu-memory-utilization, …) work the same as in a non-speculative deployment.
Providing Model Artifacts on Alauda AI
Different methods need different files inside the predictor pod.
Single-artifact pattern (N-gram)
For N-gram only the target model is required. Use storageUri exactly as for any other inference service:
The model lands at /mnt/models and is passed to vLLM through --model.
Two-artifact pattern (EAGLE-3 and similar)
EAGLE-3 needs both the target model and a matching draft head loaded into the same pod. There are three supported ways to deliver them. Pick based on your platform version, network access, and operational preference.
Option A — KServe storageUris (preferred when available)
storageUris is a KServe field that accepts multiple storage locations and mounts each at a declared path. It is the cleanest option when your platform's KServe version supports it (KServe 0.16 and later).
Then point vLLM at the two paths:
Constraints to be aware of:
storageUri(singular) andstorageUris(plural) are mutually exclusive.- All
mountPathvalues must be absolute and share a common parent directory (for example/mnt/models/targetand/mnt/models/draft). - For private repositories, attach the appropriate credentials secret to the service account used by the predictor pod.
If your platform's KServe version does not yet include storageUris, use Option B or Option C.
Option B — Single OCI Modelcar containing both artifacts
Package the target model and the draft head into one OCI image under predictable subdirectories (for example /models/target and /models/draft), then deploy with storageUri: oci://.... See Using KServe Modelcar for Model Storage for the packaging steps. Sample on-disk layout to bake into the image:
The vLLM command then references the same paths:
This option is well-suited to offline / air-gapped clusters because the artifacts are versioned together and pulled from your own registry.
Option C — Pre-staged on a shared PVC
Stage both artifacts onto a PVC under a known directory layout, mount the PVC, and reference the local paths from the vLLM command. This is the simplest option if you already manage model files on a shared filesystem.
Picking between A / B / C
End-to-End Examples
The two examples below cover the methods listed in Methods Available on Alauda AI. Replace <your-namespace>, <your-vllm-runtime>, and storage URIs with values from your environment.
Example 1 — N-gram
- Replace with your actual model name; this annotation is used by the platform for display.
- The
prompt_lookup_*keys belong to the n-gram proposer. Their names have changed between vLLM releases — verify against the version inside your runtime image.
Example 2 — EAGLE-3 with target + draft on a shared PVC
This manifest matches the setup used for the Internal Validation Snapshot — EAGLE-3 above. Both the target model and the EAGLE-3 draft head are pre-staged inside a single PVC under predictable subdirectories; the PVC is mounted at /mnt/models/ by storageUri: pvc://..., and the vLLM command references the two subdirectories directly.
- Both paths in the vLLM command (
--modeland themodelkey inside--speculative-config) must match the directory names inside the PVC exactly. If your PVC lays the artifacts out under different names, adjust these two paths together. - The EAGLE-3 head occupies GPU memory outside the
--gpu-memory-utilizationbudget. Leaving headroom (here0.8instead of0.9) reduces the chance of OOM when both artifacts are loaded. pvc://<your-pvc-name>/expects a PVC pre-staged with both the target model and the EAGLE-3 draft head; the PVC root is mounted at/mnt/models/, so the two artifacts must live at/mnt/models/<target-subdir>/and/mnt/models/<draft-subdir>/. See the expected layout below. If you prefer declarative multi-URI mounts (KServe 0.16+) or bundling target + draft into a single OCI image instead, see Option A or Option B in Providing Model Artifacts.
Expected layout inside the PVC (mounted at /mnt/models/ in the pod):
Verify the layout from inside the predictor pod once it starts:
Apply any of the manifests above with:
Verify and Measure the Impact
Verifying that speculative decoding was configured is one step. Verifying that it helps your workload is a different step.
1. Confirm the configuration was applied
Look for --speculative-config in the predictor command and confirm the readiness state:
2. Confirm speculative decoding is actually running
The first startup-time signal is the engine-config log line; it prints the speculative_config the engine resolved, so you can verify the method and draft path took effect:
For live counters, vLLM exposes Prometheus metrics at /metrics. The exact metric names depend on the vLLM version, so cast a wide net first:
If that returns nothing, the pod either hasn't served any requests yet (counters only publish once the first generation completes) or the metric names in your vLLM build differ — in which case fall back to the predictor logs.
vLLM prints a per-window summary line that is the most readable live picture. This is the real shape of the line on vLLM 0.19.1 with num_speculative_tokens=3:
How to read it:
- Mean acceptance length — average tokens delivered per decode step. Baseline is
1. This is the practical upper bound for the speedup you can hope to get on this workload. - Avg Draft acceptance rate — overall fraction of proposed tokens that were accepted. A single number for "is the proposer mostly paying off or mostly wasted?".
- Per-position acceptance rate — per-slot acceptance for slots
1..num_speculative_tokens. You will see exactlynum_speculative_tokensvalues — the example above has 3 because the run usednum_speculative_tokens=3; anngramrun withnum_speculative_tokens=5prints 5 values. A healthy curve decays slowly; a curve that collapses to near-zero by the 2nd slot means the workload is not a fit for this proposer.
3. Measure end-to-end impact
Run the same representative workload twice:
- With
--speculative-configremoved (baseline). - With it enabled (everything else identical, including
--seed).
Capture three numbers per run:
- Time to first token (TTFT).
- Per-token latency (or end-to-end latency at fixed output length).
- Throughput (tokens/second) under the QPS you actually serve.
Speculative decoding is worth keeping on if all three improve at your target QPS. A common failure mode is improvement at low QPS but regression at production QPS — measure where you actually run.
4. How to report or compare numbers
Performance numbers without their context cannot be reproduced or trusted. Any time you publish a comparison — internally, in a customer report, or back to the platform team — include the five fields below. Numbers that omit any of them should be treated as anecdotal, not as evidence.
Spec-decode command (only differs by --speculative-config):
Results:
Two practical rules when running the comparison:
- Use the same
--seedandtemperature=0for both sides, and warm up each service with 3 discarded requests before timing — otherwise sampling and compile-cache noise will dominate the differences you measure. - Run baseline and spec-decode against the same fixed prompt list, in the same order, at least 5–10 times per prompt, and compare medians rather than averages.
Rollback
To disable speculative decoding without changing anything else, remove the --speculative-config line from the predictor command and re-apply:
Or re-apply a manifest that omits the flag:
The service rolls to a new revision without the speculative proposer. No model artifact changes are required for N-gram. For EAGLE-3 the draft head remains mounted but is unused — if you want to reclaim disk, remove the draft-head artifact on the next change (delete the matching storageUris entry for Option A, rebuild the OCI image without the draft directory for Option B, or drop the draft subdirectory from the PVC for Option C).
Troubleshooting
For pod-level issues, the standard inference-service troubleshooting commands apply:
Caveats and Known Limitations
- Outcomes swing widely with workload shape — regression and speedup are both real. Upstream V0 benchmarks reported 1.4×–1.8× slowdowns at high QPS. Our own A30 + Qwen3-8B N-gram test (see Internal Validation Snapshot — N-gram) saw a slight regression even on a high-overlap code workload. On the same hardware, EAGLE-3 on Llama-3.1-8B (see Internal Validation Snapshot — EAGLE-3) hit a 1.84× speedup on code-refactor but was break-even on chat (~0.99×) — same model, same method, same pod, 2× swing in realized benefit between two prompt shapes. Always validate against your production traffic profile.
- N-gram disables async scheduling. In recent vLLM versions, enabling the
ngrammethod forces async scheduling off (the predictor logsAsync scheduling not supported with ngram-based speculative decoding and will be disabled). If your service depends on async scheduling for throughput, prefer EAGLE-3, or measure the trade-off explicitly. storageUrisavailability. The field is available from KServe 0.16. Older platform releases must use the Modelcar or PVC option.- Draft head mismatch is silent. A draft head that does not exactly match the target model usually starts up and serves traffic correctly but with very low acceptance rate. Always check acceptance rate after enabling.
- Sampling parameters affect acceptance. High temperature reduces acceptance rate; benchmark with sampling settings that reflect production usage.
gpu-memory-utilizationbudget. Draft artifacts (EAGLE-3 head, MLP speculator, draft model) are not included in the--gpu-memory-utilizationbudget; reduce that value when adding a draft artifact.- Image dependencies. The runtime image must include the libraries required by the chosen method. If a method fails to initialize, rebuild or replace the runtime image — see Extend Inference Runtimes.
min_pandlogit_biasare silently ignored. Under speculative decoding, vLLM logs the warningmin_p and logit_bias parameters won't work with speculative decoding.during engine init. Requests that pass either of these sampling parameters will still receive a 200 response, but the parameters are not honored — validate this against your client assumptions if your traffic relies on them.- Composition with other features. Speculative decoding composes with tensor parallelism and continuous batching but interacts with autoscaling and with EP / advanced parallelism in ways that depend on the vLLM version. Cold start is notably more expensive with a draft artifact: on our A30 + Llama-3.1-8B + EAGLE-3 head lab setup, the predictor went from container-ready to
Application startup completein ~5 minutes (weight load ~45 s, draft weights ~5 s,torch.compilebackbone ~48 s,torch.compileEAGLE head ~17 s, CUDA-graph capture and warmup ~10 s, plus ~2 minutes of engine profiling and KV-cache sizing). Size your Knativeprogress-deadlineannotation and any autoscaling scale-from-zero SLO to this, not to a non-speculative baseline. - Output equivalence. vLLM states that speculative decoding does not change the output distribution. This is a vLLM property, not an Alauda AI guarantee — if exact equivalence under your runtime image is required, validate it as part of acceptance testing.
References
- Speculative Decoding - vLLM
- How Speculative Decoding Boosts vLLM Performance by up to 2.8x — vLLM Blog
- Speculative Decoding Guide — vllm-ascend
- KServe Multiple Storage URIs
- Using KServe Modelcar for Model Storage
- Extend Inference Runtimes
- Enable Expert Parallel for vLLM Inference Services
- Create Inference Service using CLI