---
name: rag-blueprint
description: "NVIDIA RAG Blueprint — deploy, configure, troubleshoot, and manage. Handles any RAG action: deploy, install, start, enable, disable, toggle, change, configure, troubleshoot, debug, fix, shutdown, stop, or tear down any RAG feature or service (VLM, guardrails, query rewriting, models, search, ingestion, observability, summarization, and more)."
argument-hint: deploy RAG | enable feature | disable feature | configure | troubleshoot | shutdown
allowed-tools: Bash(echo *), Bash(nvidia-smi *), Bash(curl *), Bash(docker ps *), Bash(docker exec *), Bash(docker info *), Bash(docker --version *), Bash(docker compose version *), Bash(docker logs *), Bash(docker system *), Bash(kubectl get *), Bash(kubectl describe *), Bash(kubectl version *), Bash(kubectl logs *), Bash(helm version *), Bash(helm list *), Bash(git rev-parse *), Bash(git describe *), Bash(git status *), Bash(python3 --version *), Bash(pip3 show *), Bash(df *), Bash(du *), Bash(cat /proc/*), Bash(cat /etc/os-release *), Bash(ss *), Bash(netstat *), Bash(ls *), Bash(grep *), Bash(lsof *), Bash(ps aux *), Read, Grep, Glob
license: Apache-2.0
metadata:
  author: nvidia-rag-team
  version: "1.0"
---

# NVIDIA RAG Blueprint

## Autonomy Principles

- Auto-detect everything: GPU, VRAM, drivers, Docker, CUDA, disk, OS, ports, existing services, NGC key, repo state.
- If it can be checked with a command, check it — don't ask the user.
- Ask only when user action is required: providing an API key, confirming data deletion, or choosing between equally valid options.
- Once analysis is done, route to the correct workflow and execute.

## Intent Detection

Determine what the user wants and route immediately:

| User Intent | Action |
|-------------|--------|
| Deploy, install, set up, start RAG | Read and follow `references/deploy.md` |
| Configure, enable, change, toggle a feature | Use the **Configure** section below |
| Troubleshoot, debug, fix, error, unhealthy | Read and follow `references/troubleshoot.md` |
| Stop, shutdown, tear down, clean up | Read and follow `references/shutdown.md` |

If the intent is ambiguous, infer from context (e.g., "RAG isn't working" → troubleshoot; "get RAG running" → deploy). Only ask if genuinely unclear.

---

## Configure

Requires a running RAG deployment. If services are not running, deploy first via `references/deploy.md`.

Match the user's request to a reference file, then read and follow it:

| Feature Keywords | Reference |
|-----------------|-----------|
| VLM, VLM embeddings, image captioning | `references/configure/vlm.md` |
| NeMo Guardrails | `references/configure/guardrails.md` |
| Query rewriting, decomposition, multi-turn | `references/configure/query-and-conversation.md` |
| Ingestion (text-only, audio, Nemotron Parse, OCR, batch CLI, NV-Ingest, volume mount, performance) | `references/configure/ingestion.md` |
| Search, retrieval, hybrid search, multi-collection, metadata, filters, reranker, topK, accuracy/performance | `references/configure/search-and-retrieval.md` |
| LLM/embedding/ranking model changes, vector DB, Milvus/Elasticsearch auth, service keys, model profiles, ports/GPU | `references/configure/models-and-infrastructure.md` |
| Reasoning, self-reflection, prompts, generation params (tokens, temperature, citations), per-request LLM params | `references/configure/reasoning-and-generation.md` |
| Summarization | `references/configure/summarization.md` |
| Observability (tracing, Zipkin, Grafana, Prometheus) | `references/configure/observability.md` |
| Multimodal query (image + text) | `references/configure/multimodal-query.md` |
| Data catalog (collection/document metadata) | `references/configure/data-catalog.md` |
| User interface (UI settings) | `references/configure/user-interface.md` |
| API reference (endpoints, schemas) | `references/configure/api-reference.md` |
| Evaluation (RAGAS metrics) | `references/configure/evaluation.md` |
| MCP server & client, agent toolkit | `references/configure/mcp.md` |
| Migration (version upgrades) | `references/configure/migration.md` |
| Notebooks (setup and catalog) | `references/configure/notebooks.md` |

### Configure Flow

1. Match the user's request to a reference file from the table above.

2. Detect what's running:
   ```bash
   echo "=== NIM ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(nim-llm|nemoretriever-embedding|nemoretriever-ranking|nemo-vlm|nemotron-vlm)' || echo "NO_LOCAL_NIMS"; echo "=== RAG ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(rag-server|ingestor-server|milvus)' || echo "NO_DOCKER_RAG"; echo "=== K8S ===" && kubectl get pods -n rag 2>/dev/null | head -5 || echo "NO_K8S"; echo "=== LIBRARY ===" && ps aux 2>/dev/null | grep -E '(nvidia_rag|uvicorn.*rag)' | grep -v grep || echo "NO_LIBRARY"
   ```

3. Use this table to determine platform, deployment type, and where config lives:

   | Local NIMs running? | RAG services running? | Deployment Type | Config Location |
   |---------------------|-----------------------|-----------------|-----------------|
   | Yes (Docker) | Any | Self-hosted | `deploy/compose/.env` |
   | No | Yes (Docker) | NVIDIA-hosted | `deploy/compose/nvdev.env` |
   | Yes (K8s pods) | Any | Self-hosted | `values.yaml` (NIM sections) |
   | No | Yes (K8s pods) | NVIDIA-hosted | `values.yaml` (envVars) |
   | — | Library processes | Library mode | `notebooks/config.yaml` |
   | No | No | Not running | Deploy first via `references/deploy.md` |

   Tell the user what you detected and ask to confirm. Example: "I see local NIM containers running (nim-llm-ms, nemoretriever-embedding-ms) — this is a self-hosted deployment. Config file is `deploy/compose/.env`. Correct?"

4. Check current feature state before changing anything — read the config location from step 3, then cross-check the live service:
   - Docker: `docker exec rag-server env 2>/dev/null | grep -E "<VAR_NAME>"`
   - Helm: `kubectl get pod -n rag -l app=rag-server -o jsonpath='{.items[0].spec.containers[0].env}' 2>/dev/null`

   If the config file and live service disagree, tell the user the service has stale config and will need a restart.

5. If the feature needs extra GPUs, check availability against hardware restrictions (see below):
   ```bash
   nvidia-smi --query-gpu=index,name,memory.total,memory.used --format=csv,noheader 2>/dev/null || echo "NO_GPU"
   ```

6. Read the reference file and apply changes:
   - **Docker**: edit the env file (uncomment to enable, re-comment to disable — the env file is the source of truth). Then restart the affected service:
     ```
     source <env-file> && docker compose -f deploy/compose/<compose-file> up -d
     ```
     | Service | Compose File |
     |---------|-------------|
     | rag-server | `docker-compose-rag-server.yaml` |
     | ingestor-server | `docker-compose-ingestor-server.yaml` |
     | milvus, etcd, minio | `vectordb.yaml` |
     | NIM containers (LLM, embedding, ranking, VLM, OCR) | `nims.yaml` |
     | guardrails | `docker-compose-nemo-guardrails.yaml` |
     | observability (Grafana, Prometheus, Zipkin) | `observability.yaml` |
   - **Helm**: edit `values.yaml`, then upgrade: `helm upgrade rag <chart> -n rag -f values.yaml`
   - **Library**: edit `notebooks/config.yaml`, then restart the Python process

7. Verify:
   - Docker: `docker ps --format "table {{.Names}}\t{{.Status}}" | head -20; curl -s http://localhost:8081/v1/health?check_dependencies=true 2>/dev/null | head -1`
   - Helm: `kubectl get pods -n rag; kubectl rollout status deployment/rag-server -n rag --timeout=120s`
   - Library: `curl -s http://localhost:8081/v1/health 2>/dev/null | head -1`

8. If restart fails, read `references/troubleshoot.md`. If multiple features requested, repeat from step 1 for each.

### When User Says "Configure" Without Specifics

Run steps 2–3 above, then read the identified config file to list what's currently enabled:
```bash
grep -E "^(export )?(ENABLE_|APP_)" <config-file> 2>/dev/null | sort
```
Summarize what's running and enabled, then ask which feature to change.

---

## Hardware Restrictions

Read `docs/support-matrix.md` for current GPU requirements per deployment mode.
Read `docs/service-port-gpu-reference.md` for port mappings and GPU assignments.

| GPU | Feature Restrictions |
|-----|---------------------|
| B200 | No VLM, No Guardrails, No Nemotron Parse. May need multi-GPU LLM (`LLM_MS_GPU_ID`). |
| RTX PRO 6000 | No Nemotron Parse. No Audio on Helm. |