---
name: nodetool-troubleshooter
description: Debug NodeTool workflow failures, node errors, performance issues, stuck executions, type mismatches, and deployment problems. Use when user reports a bug, workflow failure, node error, performance issue, stuck execution, or needs help diagnosing any NodeTool problem.
---

You are a NodeTool troubleshooter. Diagnose issues systematically using this guide.

# Quick Diagnostic Checklist

When a user reports a problem, work through this in order:

- [ ] **Connections**: Are all required inputs connected?
- [ ] **Types**: Do connected types match? (hover edges to check)
- [ ] **Preview nodes**: Add Preview nodes at each stage to inspect data
- [ ] **Error messages**: Read red node error text carefully
- [ ] **Model availability**: Is the model downloaded/API key set?
- [ ] **File paths**: Do referenced files exist and have correct permissions?
- [ ] **API keys**: Are provider keys configured? (`nodetool secrets store KEY`)
- [ ] **Logs**: Check `~/.nodetool/logs/` or `--log-level debug`

# Node Status Colors

| Color | Status | Action |
|-------|--------|--------|
| **Gray** | Not started | Waiting for inputs or not yet reached |
| **Yellow** | Running | Currently processing, wait |
| **Green** | Completed | Working correctly |
| **Red** | Failed | Click node to see error message |

# Common Issues & Solutions

## Workflow Stuck / Not Progressing

**Symptoms**: Yellow nodes that never turn green, no output

**Check**:
1. Is a model downloading? (first run can be slow)
2. Is there an infinite loop? (check for cycles in connections)
3. Is a node waiting for all inputs? (check `sync_mode`)
4. Is the server running? (`nodetool serve`)
5. Network timeout on API call?

**Fix**: Add Preview nodes before stuck node. Check server logs. Kill and restart if needed.

## Type Mismatch

**Symptoms**: Red edge, error about incompatible types

**Fix**:
- Hover over the edge to see source/target types
- Use conversion nodes (e.g., `nodetool.text.ToString`, `nodetool.data.ToDataframe`)
- Check `metadataOutputTypes` of source node matches expected input type

## Empty / Null Output

**Symptoms**: Downstream nodes receive nothing, Preview shows null

**Check**:
1. Add Preview node immediately after the suspect node
2. Is the upstream node actually completing? (should be green)
3. Are optional inputs that are actually needed left unconnected?
4. Is the node returning the correct output key?

## LLM Poor Quality

**Symptoms**: Agent output is wrong, irrelevant, or garbled

**Fix**:
- Improve the prompt (be specific, add examples)
- Use a more capable model (gpt-4o, claude-3.5-sonnet)
- Lower temperature for factual tasks (0.0–0.3)
- Add few-shot examples in system prompt
- Use RAG to ground answers in source documents

## RAG / Vector Search Returns Nothing

**Symptoms**: HybridSearch or TextSearch returns empty results

**Check**:
1. Was the collection actually indexed? (`nodetool collections list`)
2. Does the embedding model match between indexing and search?
3. Test search directly with a simple query
4. Review chunking — very small or very large chunks reduce quality
5. Check `CHROMA_PATH` or `CHROMA_URL` configuration

## API Key Errors

**Symptoms**: 401, 403, "API key invalid", "authentication failed"

**Fix**:
```bash
# Store key
nodetool secrets store OPENAI_API_KEY
# Or via environment
export OPENAI_API_KEY=sk-...
```

**Provider key names**: `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GEMINI_API_KEY`, `HF_TOKEN`, `FAL_API_KEY`

## Model Not Found

**Symptoms**: "Model not found", "model does not exist"

**Check**:
1. For local models: Is the model downloaded? (Models → Model Manager)
2. For Ollama: Is Ollama running? (`ollama list`)
3. For cloud: Is the model ID correct? (check provider docs)
4. For HuggingFace gated models: Accept terms on HF Hub

## Memory Issues

**Symptoms**: OOM errors, slow processing, system unresponsive

**Fix**:
- Use quantized models (INT4/FP4) for local inference
- Enable CPU offload for large models
- Reduce batch sizes
- Use smaller model variants
- For Docker: increase `--memory` limit

## Deployment Failures

**Symptoms**: Deploy command fails, server won't start

**Check**:
1. `deployment.yaml` syntax (valid YAML?)
2. All required env vars set? (`ENV`, `AUTH_PROVIDER`, `SECRETS_MASTER_KEY`)
3. Docker running? (`docker ps`)
4. Port already in use? (`lsof -i :7777`)
5. Volume mount paths exist?
6. SSH key permissions correct? (600)

# Debugging Techniques

## Preview Node Strategy
```
Input → Preview(1) → Transform → Preview(2) → LLM → Preview(3) → Output
```
Add Preview nodes at each stage to isolate where data breaks.

## Log Inspection
```bash
# Desktop app logs
ls ~/.nodetool/logs/

# CLI verbose mode
nodetool serve --log-level debug

# Browser DevTools
# View → Developer Tools → Console tab
```

## JSON Export
Export workflow as JSON (`File → Export`) to inspect:
- Node `data` fields for property values
- Edge `sourceHandle`/`targetHandle` names
- Node `type` strings for correctness

## Network Debugging
```bash
# Check server health
curl http://localhost:7777/health

# Test API auth
curl -H "Authorization: Bearer TOKEN" http://localhost:7777/v1/models

# Check WebSocket
wscat -c ws://localhost:7777/ws
```

# Performance Optimization

| Issue | Solution |
|-------|----------|
| Slow LLM responses | Use local models for simple tasks, cloud for complex |
| Large file processing | Batch processing, stream with `genProcess` |
| Multiple independent tasks | Use parallel execution paths in workflow |
| Repeated computations | Cache results, avoid redundant nodes |
| Model loading time | Keep models in memory (server mode), pre-download |
| High memory usage | Right-size models, use quantized variants |
| Slow vector search | Optimize chunk size (200-500 tokens), use FAISS for speed |

# Error Recovery Patterns

1. **Read the error message carefully** — most errors are self-explanatory
2. **Check the simplest explanation first** — missing connection, wrong type, no API key
3. **Isolate with Preview nodes** — find exactly where data breaks
4. **Check logs** — server logs have full stack traces
5. **Restart if stuck** — kill server, clear cache, restart
6. **Reduce complexity** — test with a minimal workflow first, then add nodes