vllm-project

For the full experience including quality scoring and one-click install features for each skill — upgrade to Pro.

vllm-serving-setup

Design, deploy, and tune vLLM v0.18.2 inference serving on EKS with PagedAttention v2, Multi-LoRA, FP8 KV Cache, Chunked Prefill, and Continuous Batching.

vllm-configuration

Configure vLLM completely — YAML config file format, CLI arg precedence, full VLLM_*/HF_*/TRANSFORMERS_* env-var catalog, end-to-end recipe for air-gapped environments (internal…

vllm-tool-parsers

vLLM tool-calling operator reference — picking `--tool-call-parser` per model family, writing custom parsers via `--tool-parser-plugin`, navigating vLLM source + GitHub tracker to…

vllm-quantization

vLLM datacenter-GPU quantization — picking, configuring, troubleshooting NVFP4, FP8, MXFP4, MXFP8, AWQ, GPTQ, INT8, compressed-tensors, modelopt, quark on…

vllm-performance-tuning

vLLM performance-tuning operator reference — tuning workflow (baseline → bottleneck → knob → re-bench), fused-MoE kernel autotune (`benchmark_moe.py` generates…

vllm-benchmarking

Run production vLLM benchmarks — `vllm bench` (serve, throughput, latency, sweep, startup, mm-processor), request-rate vs max-concurrency semantics, TTFT/TPOT/ITL/E2EL…

serving-llms-vllm

Use when serving LLMs at high throughput via vLLM (OpenAI-compatible API, quantization). Triggers: vllm, serve LLM, openai api endpoint, quantize.

verify-claim

Use BEFORE asserting any fact you're not 100% certain about. Anti-hallucination protocol. Triggers latest, recent, current, какая версия, что нового в, как сейчас принято, в новой…

vllm-observability

Observe production vLLM — `/metrics` Prometheus surface (V1 engine), SLO-driven alerting on TTFT/ITL/queue/KV/preemption/aborts/corrupted-logits, shipping Grafana dashboards in…

vllm-omni

vLLM-Omni output-side multimodal generation — image (FLUX.1/2, Qwen-Image, GLM-Image, BAGEL, SD3.5, HunyuanImage-3.0), video (Wan2.1/2.2, LTX-2, HunyuanVideo-1.5), TTS (Qwen3-TTS,…

vllm-caching

vLLM tiered KV cache configuration for production H100/H200 clusters. Native CPU offload, LMCache (CPU+NVMe+GDS), NixlConnector (disaggregated prefill), MooncakeConnector (RDMA),…

Categories

Use cases

Popular tags

Learn

Site