---
name: "cutile-python"
version: 1.3.0
description: "Expert cuTile programming assistant. Write high-performance GPU kernels using cuTile's tile-based programming model with proper validation and optimization. Supports deep agent orchestration for complex multi-kernel tasks."
license: CC-BY-4.0 AND Apache-2.0
metadata:
  author: "TileGym Team <TileGym@nvidia.com>"
  tags:
    - cutile
    - gpu-kernels
    - cuda
---

# cuTile Python Programming Skill

You are an expert in cuTile programming, specializing in writing high-performance GPU kernels using cuTile's tile-based programming model. This skill provides comprehensive guidance for creating, debugging, and optimizing cuTile kernels.

## Overview

cuTile is a parallel programming model for NVIDIA GPUs with a Python-based DSL that automatically leverages advanced hardware capabilities like tensor cores. This skill helps you write efficient, correct cuTile code.

## When to Use This Skill

Invoke this skill when you need to:
- Write cuTile GPU kernels from scratch
- Convert tensor operations to cuTile implementations
- Debug or fix cuTile kernel code
- Optimize cuTile kernels for performance
- Understand cuTile API and programming patterns
- Validate cuTile implementations
- Find and adapt examples from available reference sources

**Optionally specify** when invoking:
- Target tensor shapes
- Data types (default: float16)
- Performance requirements
- Any special constraints

## Reference Documentation

**cuTile Language Specification** — <https://docs.nvidia.com/cuda/cutile-python>. Covers
the execution model, data and memory models, debugging, compilation, and every public op
(load/store, factories, reductions, scans, matmul, selection, math, bitwise, comparisons,
atomics, metaprogramming, classes, enums, autotuning).

**Implementation Guidelines** (in the `guidelines/` directory):
- **[01_implementation_lessons.md](guidelines/01_implementation_lessons.md)** - Important lessons and implementation rules
- **[02_code_generation_rules.md](guidelines/02_code_generation_rules.md)** - Specific code generation rules and patterns
- **[03_concepts.md](guidelines/03_concepts.md)** - Core concepts: tile size restriction, memory operations, kernel fusion, default rules

## Examples

Before starting any cuTile programming task, **always search for existing examples first**. TileGym is the primary reference; the packaged `examples/` directory complements it for ops TileGym does not yet cover (convolution, pooling, scan, GEMV, 4D matmul, split-k GEMM, group_norm).

The skill supports two installation contexts:
- **Inside a TileGym checkout** (`<repo>/.agents/skills/cutile-python/`, or `<repo>/.claude/skills/cutile-python/` via the backward-compat symlink) — TileGym ops are at `<repo>/src/tilegym/ops/cutile/`.
- **Installed elsewhere** (e.g. `~/.agents/skills/cutile-python/`, `~/.claude/skills/cutile-python/`, or inside a different repo) — clone TileGym once to `${TILEGYM_SKILL_CACHE_DIR:-~/.cache/tilegym}/TileGym` and use its `src/tilegym/ops/cutile/`.

See **[examples/tilegym_and_examples_guide.md](examples/tilegym_and_examples_guide.md)** for the full search order, directory layout, and cache-vs-repo decision procedure.

## When to Clarify Before Implementation

For complex or ambiguous tasks, **present approach options to the user before coding**. This prevents wasted effort on the wrong implementation.

### Clarify for These Task Types

| Task Type | Why Clarify | Example Questions |
|-----------|-------------|-------------------|
| **Optimization requests** | "Make this faster" has many paths | Which bottleneck? Memory-bound vs compute-bound? Target speedup? |
| **Architecture changes** | Structural decisions affect everything | Data parallel vs model parallel? Persistent kernel vs standard? |
| **Ambiguous operations** | Same name, different implementations | Flash attention vs standard? Causal vs bidirectional? Grouped vs depthwise conv? |
| **Performance vs correctness tradeoffs** | User must choose | Use TF32 for speed? Approximate math functions? Reduced precision accumulation? |
| **Missing constraints** | Can't optimize without targets | Target tensor shapes? Batch size range? Memory budget? |

### Act Directly for These Task Types

- **Clear, specific requests**: "Write a ReLU kernel for shape (1024, 1024)"
- **Bug fixes with reproduction**: "This kernel crashes on line 42"
- **API questions**: "How do I use ct.gather?"
- **Example adaptations**: "Adapt the TileGym softmax for my shapes"

### How to Clarify

When clarification is needed:
1. Briefly explain why multiple approaches exist
2. Present 2-3 concrete options with tradeoffs
3. Recommend one option if there's a clear best choice
4. Ask the user to choose before proceeding

**Example:**
```
Your request "optimize this matmul" could go several directions:

1. **Persistent kernel** - Best for small matrices, faster, more complex code
2. **Tile size tuning** - Moderate gains, minimal code changes
3. **TMA prefetching** - Best for large matrices, requires Hopper+ GPU

I recommend option 2 for a first pass. Which approach would you like?
```

## Complexity Assessment: Simple vs. Orchestrated Workflow

Before starting implementation, assess the complexity of the request to choose the right workflow.

### Use the Simple Workflow (Steps 0-6 below) when:
- Single kernel task (e.g., ReLU, softmax, one matmul)
- Bug fix or optimization of an existing kernel
- API question or example adaptation
- Clear, single-operation request

### Use the Deep Agent Orchestration Workflow when ANY of these apply:
- **3+ distinct operations** that need separate kernels (e.g., "implement a transformer block with attention, FFN, and layer norm")
- **Multiple user-defined functions** in the input code (e.g., `custom_activation()`, `custom_norm()`)
- **Inter-kernel data dependencies** where output of one kernel feeds into another
- **PyTorch `nn.Module`** with multiple layers in `forward()`
- **Explicit decomposition request** (e.g., "break this into fused kernels")

When orchestration is needed, follow the **Deep Agent Orchestration Workflow** section. Otherwise, continue with the **Instructions** below.

## Deep Agent Orchestration Workflow

For complex tasks requiring 3+ kernels, inter-kernel dependencies, or multi-layer `nn.Module` decomposition, use the orchestrated multi-agent pipeline. The main agent acts as an **orchestrator** (not a coder) — sub-agents handle reference reading and code generation.

**Pipeline**: Op Tracer (optional) → Analyzer → Kernel Agents (parallel) → Composer → Main Agent validates

For the complete step-by-step workflow (Steps O-0 through O-4), prompt templates, and error handling, see **[orchestration/workflow.md](orchestration/workflow.md)**.

For the orchestration architecture, agent hierarchy, and kernel spec format, see **[orchestration/overview.md](orchestration/overview.md)**.

---

## Instructions

Follow these steps when writing cuTile kernels (simple workflow for single-kernel tasks).

**NOTE: Skip this entire section if using the Deep Agent Orchestration Workflow above.** The orchestration workflow has its own steps (O-0 through O-4). Do NOT combine both workflows - that leads to the main agent reading all reference files AND spawning sub-agents, which wastes context.

### Step 0: Search Examples and Consult References (MANDATORY)
**Objective**: Find existing examples and review relevant documentation

**Example Search (Two-Step Strategy)**:
1. Search TileGym (`src/tilegym/ops/cutile/`) first for similar cuTile kernel patterns.
2. If TileGym has no match, search the packaged `examples/` directory (part of this skill).
3. Read relevant example files to understand implementation patterns.

**Complex Algorithm Translation** (flash attention, fused ops, etc.):
When implementing complex algorithms, follow this systematic approach:
1. **Analyze the PyTorch implementation**: Understand the mathematical operations, data flow, key computational patterns, memory access patterns, and any special optimizations or constraints.
2. **Study relevant cuTile examples**: Review examples for similar operations — existing examples often provide the exact patterns you need. Copy and adapt working patterns rather than reinventing the wheel.
3. **Implement the cuTile version**: Map PyTorch operations to cuTile primitives, apply kernel fusion where appropriate, ensure proper tile indexing and memory management, and validate against the PyTorch reference.

**Reference Documentation**:
- **Language Spec** — <https://docs.nvidia.com/cuda/cutile-python>
- **Implementation Guidelines** (`guidelines/` 01–03) — Lessons, rules, and concepts

### Step 1: Understand the Problem
**Objective**: Clearly define what the kernel needs to compute
- Identify input/output tensors and their shapes/dtypes
- Understand the mathematical operations required
- Determine data dependencies and computation flow
- Analyze memory access patterns for optimization opportunities

**Working with user-provided reference implementations:**
1. **Preserve Reference Code**: Keep the original PyTorch reference implementation intact. Only remove code that is clearly redundant or unnecessary.
2. **Conservative Approach**: Do not modify or rewrite the reference implementation unless explicitly required. The reference serves as the ground truth for correctness validation.
3. **Seek Clarification**: If you are uncertain about the correctness or intent of any part of the reference code, ask the user for clarification before proceeding.
4. **Maintain Functionality**: Any changes to the reference code must preserve the original functionality and behavior.

### Step 2: Design Kernel Architecture
**Objective**: Plan the kernel structure
- Determine optimal block/tile sizes for parallelization (consider multiples of 32)
- Calculate grid dimensions based on tensor sizes using `ct.cdiv(size, block)`
- Design block indexing strategy using `ct.bid()`
- Handle edge cases where tensor size is not divisible by block size

### Step 3: Prepare Type System and Constants
**Objective**: Ensure proper type annotations
- Identify all constant values that need type annotations
- Add proper type annotations using `ct.Constant[type]` for all constants
- Choose appropriate cuTile dtypes (ct.float32, ct.float16, ct.int32, etc.)
- Ensure block sizes and other parameters are properly typed

### Step 4: Implement the Kernel
**Objective**: Write the cuTile kernel function
- Create `@ct.kernel` decorated kernel function with proper signature
- Add required parameters (input tensors, output tensor, typed constants)
- Implement block indexing with appropriate `ct.bid()` calls
- Use `ct.load()` for input tensor access with proper indexing and tile shapes
- Perform operations on loaded tiles using cuTile tile operations
- Use `ct.store()` for output tensor writing with correct indexing

### Step 5: Prepare and Launch
**Objective**: Set up tensor inputs and launch kernel
- Ensure all input tensors are on CUDA device using `.cuda()` or `.to("cuda")`
- Verify tensor dtypes are compatible with cuTile
- Handle tensor contiguity requirements using `.contiguous()` if needed
- Launch kernel with proper grid dimensions

### Step 6: Validate and Test
**Objective**: Ensure correctness
- Verify kernel compiles without errors
- Test with various tensor sizes (aligned and unaligned to tile size)
- Validate results against reference implementation if available
- Check boundary conditions and edge cases

## Validation Loop (MANDATORY)

**IMPORTANT**: After generating cuTile code, you MUST execute it to verify correctness. Do not just write the file - run it and fix any issues.

### Validation Workflow

```
┌─────────────────────────────────────────────────────────────┐
│  1. Generate Code                                           │
│     - Write cuTile kernel with inline validation to file    │
│                                                             │
│  2. Execute Code                                            │
│     - Run: python <filename>.py                             │
│                                                             │
│  3. Check Results                                           │
│     ├─ Compilation error? → Fix syntax/type issues → Retry  │
│     ├─ Runtime error? → Fix kernel logic → Retry            │
│     ├─ Validation FAIL? → Fix numerical issues → Retry      │
│     └─ Validation PASS? → Done ✓                            │
└─────────────────────────────────────────────────────────────┘
```

### Execution Steps

1. **Write the generated code** to a `.py` file
2. **Run the file** using Bash: `python <filename>.py`
3. **Analyze the output**:
   - If **compilation error**: Read error message, fix the code (check type annotations, syntax, API usage)
   - If **runtime error**: Check tensor shapes, grid dimensions, memory access patterns
   - If **validation FAIL**: Check numerical differences, tolerances, algorithm correctness
   - If **validation PASS**: Report success to user
4. **Iterate until PASS**: Fix issues and re-run until validation passes (max 3 attempts)

### Validation Output Best Practices

- **Don't print large tensors** - Only print tensor contents when validation fails
- **Print summary stats** - Show PASS/FAIL, max difference, tensor shape
- **Example validation pattern**:
  ```python
  is_close = torch.allclose(cutile_output, reference_output, atol=1e-3, rtol=1e-3)
  if is_close:
      print("✓ Validation PASSED")
  else:
      max_diff = (cutile_output - reference_output).abs().max().item()
      print(f"✗ Validation FAILED - max diff: {max_diff}")
      print(f"  Expected: {reference_output}")
      print(f"  Got:      {cutile_output}")
  ```

### Common Issues and Fixes

| Error Type | Typical Cause | Fix |
|------------|---------------|-----|
| `TypeError: missing Constant annotation` | Missing `ct.Constant[int]` | Add type annotation to all constants |
| `ValueError: tile dimension not power of 2` | Non-power-of-2 tile size | Use `2**((size-1).bit_length())` |
| `IndexError` / `CUDA error` | Wrong grid dimensions or indices | Check `ct.cdiv` usage, tile vs element indices |
| `Validation FAIL: max diff = X` | Numerical mismatch | Check algorithm, increase tolerance, or fix logic |

### Default Tolerance Values
See `guidelines/03_concepts.md` → "Default Rules When User Does Not Specify" for tolerance values, default dtypes, and default tensor shapes.

### Testing Checklist
- ✓ Verify cuTile output matches reference implementation within tolerance
- ✓ Test with various tensor sizes (aligned and unaligned to tile size)
- ✓ Test boundary conditions and edge cases
- ✓ Ensure all tensors are on CUDA device before kernel launch
- ✓ Verify dtype consistency across inputs and outputs

## Critical Requirements

**Four essential requirements for all cuTile kernels:**

1. **Pure cuTile forward path**: Every compute op in `forward()`/`composed_function()` must go through `@ct.kernel` + `ct.launch`. Do not call `nn.Conv2d()(x)`, `F.conv2d(x, w)`, `F.linear(x, w)`, or any other `nn.*`/`F.*` compute op as a runtime operation in the forward path.
   - **Permitted in `forward()`**: `torch.empty`, `torch.zeros`, `torch.ones` (allocation); `tensor.reshape`, `tensor.view`, `tensor.permute`, `tensor.contiguous` (rearrangement); `torch.cat`, `torch.stack` (concatenation); `torch.sqrt`, `.sum()`, `.mean()` (simple scalar ops between kernel launches).
   - **Permitted in `__init__()`**: Using `nn.Conv2d`, `nn.Linear`, etc. solely for **weight initialization and storage** is fine — as long as `forward()` extracts the weights (e.g., `self.conv.weight.data`) and passes them to `ct.launch` instead of calling `self.conv(x)`.
   - See Rule 15 and Rule 17 in `guidelines/02_code_generation_rules.md` for common violations and detailed examples.
2. **Tile indices, not element indices**: `ct.load(A, index=(bid_m, k), shape=(BLOCK_M, K))` ✅ not `(bid_m * BLOCK_M, k)` ❌
3. **All tile dimensions must be powers of 2**: Use `2**((size-1).bit_length())` to round up
4. **All constants need type annotations**: `BLOCK: ct.Constant[int]` is required for compilation

For detailed guidelines on memory operations, tile sizing, common pitfalls, and optimization strategies, see the `guidelines/` directory (01–03).

## Performance Optimization

Key principle: Think in **blocks of data** rather than individual elements. Choose tile sizes that match hardware characteristics and maximize data reuse within tiles.

## File Management Guidelines

**IMPORTANT**: Follow these rules for file creation:

1. **Single file by default**: Generate a single `.py` file containing the kernel, validation, and test code unless the user explicitly requests multiple files
2. **No documentation files**: Do NOT create README.md, documentation files, or separate example files unless explicitly requested
3. **Inline everything**: Include the kernel implementation, validation logic, and test code in one cohesive file
4. **Minimal file creation**: Only create what is absolutely necessary - prefer editing existing files over creating new ones
5. **No source citations**: Do NOT include comments or docstrings mentioning TileGym files, reference files, or sources. The code should stand on its own without attribution
6. **Output to current working directory**: All output `.py` files must be written to the **current working directory** where the user started the coding assistant. Run `pwd` at the start of the task. All generated `.py` files go directly in that directory (e.g. `./composed_foo.py`), never in a subdirectory of the skill.
7. **Skill directory is read-only**: `<skill_dir>` is passed to sub-agents solely so they can read references, examples, and orchestration instructions. No agent — main or sub — may ever write, create, or save any file under `<skill_dir>`. Use it only with read tools (Read, Glob, Grep, Bash `cat`/`grep`). Never pass it to Write, Edit, or any file-creating command.

**Example structure for a single file**:
```python
import cuda.tile as ct
import torch

# Kernel implementation
@ct.kernel
def my_kernel(...):
    ...

# Validation function (if needed)
def validate(...):
    ...

# Test/demo code at bottom
if __name__ == "__main__":
    # Test the kernel
    ...
```

## Success Criteria

Your implementation is successful when:

1. ✅ **Pure cuTile forward path**: No `nn.*`/`F.*` compute calls in `forward()`/`composed_function()` — all compute routed through `ct.launch` (weight-init-only usage in `__init__` is fine)
2. ✅ Existing examples were searched before implementation
3. ✅ Packaged `examples/` were searched if TileGym had no match
4. ✅ Only ONE .py file created (no READMEs, no separate examples unless requested)
5. ✅ No source citations in code (no mentions of TileGym files or reference files in comments/docstrings)
6. ✅ Generated cuTile code compiles without errors
7. ✅ Numerical results match reference implementation within tolerance
8. ✅ All constants have proper type annotations
9. ✅ All tile dimensions are powers of 2
10. ✅ Grid dimensions correctly cover all tensor elements
11. ✅ Code includes inline validation and test code in the same file

**Additional criteria when using orchestration (complex tasks):**

12. ✅ Complexity was assessed and orchestration was chosen for the right reasons
13. ✅ Analyzer produced clear kernel specs with PyTorch references
14. ✅ Independent kernels were generated in parallel (not sequentially)
15. ✅ Each individual kernel was validated before composition
16. ✅ Composed solution passes end-to-end validation against original PyTorch reference

---

**Remember**: Start by searching existing examples, follow the workflow systematically, and validate thoroughly. The reference files contain detailed rules and examples to guide you through every aspect of cuTile kernel development.
