---
name: nvidia-tensorrt-llm-deployment-review
description: Use this skill when reviewing TensorRT or TensorRT-LLM deployment artifacts statically — ONNX/PyTorch export pipelines, precision selection (FP16/BF16/INT8/FP8/INT4), calibration cache integrity, dynamic shape profiles, custom plugin loading, engine cache and serialized engine provenance, runtime memory pool sizing. Trigger when the user asks whether a TensorRT build script, calibration pipeline, or trtexec invocation follows NVIDIA's published guidance.
allowed-tools: Read Grep Glob
metadata:
  author: "github: Raishin"
  version: "0.1.0"
  updated: "2026-05-10"
  category: platform
---

# NVIDIA TensorRT-LLM Deployment Review

## Purpose

Static review of TensorRT and TensorRT-LLM deployment pipelines against NVIDIA's TensorRT Developer Guide — ONNX/PyTorch export, FP16/INT8/FP8/INT4 precision, calibration data integrity, dynamic shape profiles, plugin trust boundaries, engine cache provenance. This skill is doc-anchored: it grounds review findings in NVIDIA's published documentation rather than in a certification blueprint, because no NVIDIA certification currently covers this developer-facing surface as a standalone exam objective.

## Lean operating rules

- Prefer the user's actual TensorRT build scripts, ONNX export code, and calibration pipelines as evidence; otherwise fall back to documentation-based inference.
- Treat custom TensorRT plugins loaded from non-pinned sources or unsigned object files as a critical finding — native-code execution surface inside the inference engine.
- Treat serialized engines (`.engine`, `.plan`) distributed without sha256 verification or provenance attestation as a high finding — silent model substitution.
- Treat INT8 / FP8 calibration data containing production user traffic without redaction or retention controls as a high finding — confidentiality and PII surface.
- Treat absence of `optimization_profiles` for variable input shapes as a medium finding — builds either fail at runtime or fall back to padded inference.
- Treat hardcoded `--workspace` or `--memory-pool-size` values that exceed the deployment GPU's free memory as a medium finding — engine build will OOM in CI.
- Treat use of `--strict-types` without explicit precision tagging on every layer as a low finding — actual precision drifts from intent.
- Always emit the exact `trtexec`, `polygraphy run`, or `tensorrt_llm/build.py` commands the user should run — do not execute them.

## Response minimum

Return, at minimum:
- the scoped target (model source and export pipeline, precision selection and calibration posture, dynamic shape and profile posture, plugin and engine provenance posture, runtime memory and concurrency posture, recommended trtexec/polygraphy invocations) and evidence level,
- findings labelled critical / high / medium / low,
- recommended NVIDIA-tooling invocations the user should run themselves,
- safe next actions and assumptions or blockers.
