---
name: livekit-stt-selfhosted
description: Build self-hosted speech-to-text APIs using Hugging Face models (Whisper, Wav2Vec2) and create LiveKit voice agent plugins. Use when building STT infrastructure, creating custom LiveKit plugins, deploying self-hosted transcription services, or integrating Whisper/HF models with LiveKit agents. Includes FastAPI server templates, LiveKit plugin implementation, model selection guides, and production deployment patterns.
---

# LiveKit Self-Hosted STT Plugin

Build self-hosted speech-to-text APIs and LiveKit voice agent plugins using Hugging Face models.

## Overview

This skill provides templates and guidance for:
1. Building a self-hosted STT API server using FastAPI + Whisper/HF models
2. Creating a LiveKit plugin that connects to your self-hosted API
3. Deploying and scaling in production

## Quick Start

### Option 1: Build Both (API + Plugin)

When user wants complete setup:

1. **Create API Server**:
```bash
python scripts/setup_api_server.py my-stt-server --model openai/whisper-medium
cd my-stt-server
pip install -r requirements.txt
python main.py
```

2. **Create Plugin**:
```bash
python scripts/setup_plugin.py custom-stt
cd livekit-plugins-custom-stt
pip install -e .
```

3. **Use in LiveKit Agent**:
```python
from livekit.plugins import custom_stt

stt=custom_stt.STT(api_url="ws://localhost:8000/ws/transcribe")
```

### Option 2: API Server Only

When user only needs the API server:
- Use `scripts/setup_api_server.py` with desired model
- See `references/api_server_guide.md` for implementation details
- Template in `assets/api-server/`

### Option 3: Plugin Only

When user has existing API and needs LiveKit plugin:
- Use `scripts/setup_plugin.py` with plugin name
- See `references/plugin_implementation.md` for details
- Template in `assets/plugin-template/`

## Model Selection

Help user choose the right model:

| Use Case | Recommended Model | Rationale |
|----------|------------------|-----------|
| Best accuracy | `openai/whisper-large-v3` | SOTA quality, requires GPU |
| Production balance | `openai/whisper-medium` | Good quality, reasonable speed |
| Real-time/fast | `openai/whisper-small` | Fast, acceptable quality |
| CPU-only | `openai/whisper-tiny` | Can run without GPU |
| English-only | `facebook/wav2vec2-large-960h` | Optimized for English |

For detailed comparison and optimization tips, see `references/models_comparison.md`.

## Implementation Workflow

### Building the API Server

1. **Use the template**: Start with `assets/api-server/main.py`
2. **Key components**:
   - FastAPI app with WebSocket endpoint
   - Model loading at startup (kept in memory)
   - Audio buffer management
   - WebSocket protocol for streaming

3. **Customization points**:
   - Model selection (change `MODEL_ID` in .env)
   - Audio processing parameters
   - Batch size and optimization
   - Error handling

For complete implementation guide, see `references/api_server_guide.md`.

### Building the LiveKit Plugin

1. **Use the template**: Start with `assets/plugin-template/`
2. **Required implementations**:
   - `_recognize_impl()` - Non-streaming recognition
   - `stream()` - Return SpeechStream instance
   - `SpeechStream` class - Handle streaming

3. **Key considerations**:
   - Audio format conversion (16kHz, mono, 16-bit PCM)
   - WebSocket connection management
   - Event emission (interim/final transcripts)
   - Error handling and cleanup

For complete implementation guide, see `references/plugin_implementation.md`.

## Deployment

### Development
```bash
# API Server
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# Test WebSocket
ws://localhost:8000/ws/transcribe
```

### Production

**Docker** (Recommended):
```bash
docker-compose up
```

**Kubernetes**: Use manifests in deployment guide

**Cloud Platforms**: AWS ECS, GCP Cloud Run, Azure Container Instances

For complete deployment guide including scaling, monitoring, and security, see `references/deployment.md`.

## WebSocket Protocol

### Client → Server
- **Audio**: Binary (16-bit PCM, 16kHz)
- **Config**: `{"type": "config", "language": "en"}`
- **End**: `{"type": "end"}`

### Server → Client
- **Interim**: `{"type": "interim", "text": "..."}`
- **Final**: `{"type": "final", "text": "...", "language": "en"}`
- **Error**: `{"type": "error", "message": "..."}`

## Common Tasks

### Change Model
Edit `.env`:
```bash
MODEL_ID=openai/whisper-small  # Faster model
```

### Add Language Support
In plugin usage:
```python
stt=custom_stt.STT(language="es")  # Spanish
stt=custom_stt.STT(detect_language=True)  # Auto-detect
```

### Enable GPU
In API server:
```bash
DEVICE=cuda:0  # Use GPU
```

### Scale Horizontally
Deploy multiple API server instances behind load balancer. See `references/deployment.md` for Nginx configuration.

## Troubleshooting

### Out of Memory
- Use smaller model (`whisper-small` or `whisper-tiny`)
- Reduce `batch_size` in pipeline
- Enable `low_cpu_mem_usage=True`

### Slow Transcription
- Ensure GPU is enabled (`DEVICE=cuda:0`)
- Use FP16 precision (automatic on GPU)
- Increase `batch_size`
- Use smaller model

### Connection Issues
- Verify WebSocket support in load balancer
- Check firewall rules
- Increase timeout settings

## Scripts

- `scripts/setup_api_server.py` - Generate API server from template
- `scripts/setup_plugin.py` - Generate LiveKit plugin from template

## References

Load these as needed for detailed information:

- `references/api_server_guide.md` - Complete API implementation guide
- `references/plugin_implementation.md` - LiveKit plugin development
- `references/models_comparison.md` - Model selection and optimization
- `references/deployment.md` - Production deployment best practices

## Assets

Ready-to-use templates:

- `assets/api-server/` - Complete FastAPI server with Whisper
- `assets/plugin-template/` - LiveKit STT plugin structure

## Best Practices

1. **Keep models in memory** - Load once at startup, not per request
2. **Use appropriate model size** - Balance quality vs. speed for your use case
3. **Process audio in chunks** - 1-second chunks work well for streaming
4. **Implement proper cleanup** - Close WebSocket connections gracefully
5. **Monitor metrics** - Track latency, throughput, GPU utilization
6. **Use Docker** - Ensures consistent deployments
7. **Enable authentication** - Secure production APIs
8. **Scale horizontally** - Use load balancer for high availability