Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking.
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking.
Build automated LLM evaluation pipelines with benchmarks, regression tests, RAGAS, and human eval workflows.
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking.
Comprehensive guide to using LLMs throughout the game development lifecycle - from design to implementation to testingUse when "ai game development, llm game dev, claude game, gpt…
LLM gateway and routing configuration using OpenRouter and LiteLLM. Invoke when: - Setting up multi-model access (OpenRouter, LiteLLM) - Configuring model fallbacks and…
Design input and output guardrails for an LLM app — decide what to check (injection patterns, PII, secrets, policy, schema, leakage, toxicity), place them as input vs.
Optimize LLM response latency with caching and architecture. TRIGGERS - Use when user needs help with llm-latency-optimizer related tasks.
LLM inference load testing for throughput and concurrency limits. Token/s benchmarks, concurrent request sweeps, latency-vs-throughput curves, and breaking-point identification.
Implement semantic and exact-match caching for LLM responses to reduce cost 40-60% and latency. Activate on: LLM caching, semantic cache, reduce API costs, cache AI responses.
LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns.
Build production LLM streaming UIs with Server-Sent Events, real-time token display, cancellation, error recovery. Handles OpenAI/Anthropic/Claude streaming APIs.
Install and configure LLMem for an agent harness. Handles CLI install, plugin deployment, skill registration, and provider setup.
Use the CLI info command to summarize what llms-txt-php-cli detects/configures for the current repository.
Guide the user to generate an initial llms.txt for a repository using llms-txt-php-cli, choosing sensible defaults and verifying output.
Validate an existing llms.txt with llms-txt-php-cli and guide the user through fixing validation errors.
LMCache multiprocess (MP) mode — standalone LMCache server in its own pod/process that vLLM connects to over ZMQ.
Builds and queries code knowledge graph for dependency analysis, references, implementations, and architecture overview.
Builds and queries code knowledge graph for dependency analysis, references, implementations, and architecture overview.
Creates core project docs (requirements, architecture, tech stack, patterns catalog). Use for any project regardless of type.
Creates infrastructure.md and runbook.md (Docker-conditional). Use for DevOps documentation in any project.
Creates reference docs (ADRs, guides, manuals) for nontrivial tech stack choices. Use when project needs justified architecture decision records.
Creates test documentation (testing-strategy.md, tests/README.md) with Risk-Based Testing philosophy. Use when setting up test strategy for a project.
Executes test tasks (label 'tests') through Todo to To Review with risk-based limits. Use for test task execution. Not for implementation tasks.
Worker that checks DRY/KISS/YAGNI/architecture compliance with quantitative Code Quality Score. Validates architectural decisions via MCP Ref: (1) Optimality - is chosen approach…
Orchestrates test planning pipeline (research → manual → auto tests). Coordinates ln-511, ln-512, ln-513. Invoked by ln-500-story-quality-gate.
Checks DRY/KISS/YAGNI/architecture compliance with quantitative Code Quality Score. Use when implementation tasks are Done and need quality scoring.
Performs manual testing of Story AC via executable bash scripts saved to tests/manual/. Creates reusable test suites per Story. Worker for ln-510.
Auto-fixes low-risk tech debt (unused imports, dead code, commented-out code) with >=90% confidence. Use when audit findings need safe automated cleanup.
Plans automated tests (E2E/Integration/Unit) using Risk-Based Testing after manual testing. Calculates priorities, delegates to ln-301-task-creator. Worker for ln-510.
Analyzes application logs: classifies errors, checks log quality, maps stack traces to source. Use when logs need review after test runs or during development.
Orchestrates test planning pipeline: research, manual testing, automated test planning. Use when Story needs comprehensive test coverage planning.
Performs manual testing of Story AC via executable bash scripts in tests/manual/. Use when Story implementation needs hands-on AC verification.
Plans automated tests (E2E/Integration/Unit) using Risk-Based Testing after manual testing. Use when Story needs a test task with prioritized scenarios. — from engineering/testing
Audit code comments and docstrings quality across 6 categories (WHY-not-WHAT, Density, Forbidden Content, Docstrings, Actuality, Legacy).
Architecture audit worker (L3). Checks DRY (7 types), KISS/YAGNI, layer breaks, error handling, DI patterns. Returns findings with severity, location, effort, recommendations.
Use when auditing the test surface through the evaluation platform with mandatory research, coordinated test audit workers, and structured summaries.
Detects tests validating framework/library behavior instead of project code. Use when auditing test business logic focus.
Validates E2E coverage for critical paths (money, security, data integrity). Risk-based prioritization. Use when auditing E2E test coverage.
Scores each test by Impact x Probability, returns KEEP/REVIEW/REMOVE decisions. Use when auditing test value and pruning low-value tests.
Identifies missing tests for critical paths (money, security, data integrity, core flows). Use when auditing test coverage gaps.
Checks test isolation (API/DB/FS/Time/Network), determinism, flaky tests, order-dependency, anti-patterns. Use when auditing test isolation.
Checks manual test scripts for harness adoption, golden files, fail-fast, config sourcing, idempotency. Use when auditing manual test quality.
Checks test file organization, directory layout, test-to-source mapping, domain grouping, co-location. Use when auditing test structure.
Audits assertion strength and test oracles that prove real defects. Use when finding weak tests that execute code but prove little.
Checks layer boundary violations, transaction boundaries, session ownership, cross-layer consistency. Use when auditing architecture layers.
Checks layer, resource ownership, and orchestration boundaries. Use when auditing architecture boundary enforcement.
Builds dependency topology, detects cycles, validates import rules, and calculates coupling metrics. Use when auditing architecture topology.
Finds architecture-level modernization opportunities: obsolete custom mechanisms, overbuilt extension points, and simplifiable architecture.
Audits architecture config boundaries: typed settings, scattered env reads, config leakage, and layer ownership. Use for config architecture.
Checks redundant fetches, N+1 loops, over-fetching, missing bulk operations, wrong caching scope. Use when auditing query efficiency.
Scaffolds new or restructures existing projects to Clean Architecture. Use when setting up project structure.
Scaffolds new React projects or restructures monoliths to component-based architecture. Use when setting up frontend structure.
Removes platform-specific artifacts from Replit, StackBlitz, CodeSandbox, Glitch. Use when preparing exported projects for production.
Sets up Docker, CI/CD, and environment configuration with auto-detection. Use when adding DevOps infrastructure to a project.
Sets up test infrastructure with Vitest, xUnit, and pytest. Use when adding testing frameworks and sample tests to a project.
Configures structured JSON logging with Serilog (.NET) or structlog (Python). Use when adding logging to backend projects.
Configures global exception handling middleware. Use when adding centralized error handling to .NET or Python backends.
Configures health check endpoints for Kubernetes readiness/liveness/startup probes. Use when deploying to Kubernetes.