---
name: venice-audio-speech
description: Generate speech from text via POST /audio/speech. Covers TTS models (Kokoro, Qwen 3, xAI, Inworld, Chatterbox, Orpheus, ElevenLabs Turbo, MiniMax, Gemini Flash), voices per family, output formats (mp3/opus/aac/flac/wav/pcm), streaming, prompt/emotion styling, temperature/top_p, and language hints.
---

# Venice TTS (`/audio/speech`)

`POST /api/v1/audio/speech` converts text to an audio stream or file. OpenAI-compatible — the OpenAI SDK's `audio.speech.create()` works as a drop-in.

## Use when

- You want narration, voice replies, or UI audio from text.
- You need a specific voice family (ElevenLabs, Kokoro, xAI, Qwen 3, Orpheus, Chatterbox, MiniMax, Inworld, Gemini Flash).
- You want streaming audio returned sentence-by-sentence.
- You need style/emotion control on supported models.

For music generation (lyrics + instrumental), see [`venice-audio-music`](../venice-audio-music/SKILL.md). For transcription (audio → text), see [`venice-audio-transcription`](../venice-audio-transcription/SKILL.md).

## Minimal request

```bash
curl https://api.venice.ai/api/v1/audio/speech \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-xai-v1",
    "voice": "eve",
    "input": "Hello, welcome to Venice Voice.",
    "response_format": "mp3",
    "speed": 1.0,
    "streaming": false
  }' --output hello.mp3
```

Response is the raw audio (`Content-Type` matches `response_format`).

## Request schema

| Field | Type | Default | Notes |
|---|---|---|---|
| `input` | string | — | **Required.** Up to **4096** characters. |
| `model` | enum | `tts-kokoro` (OpenAPI schema default) | See model list below. `tts-xai-v1` is the recommended frontier default; pick the model that fits your voice + language needs. |
| `voice` | enum | model-specific (e.g. `eve` for `tts-xai-v1`) | **Voice is model-specific** — wrong combo = `400`. See voice families. |
| `response_format` | `mp3` / `opus` / `aac` / `flac` / `wav` / `pcm` | `mp3` | `pcm` returns 24 kHz signed-16 LE for pipelines. |
| `speed` | number | `1.0` | Range `0.25–4.0`. |
| `streaming` | bool | `false` | `true` → streamed sentence-by-sentence as audio continues to generate. |
| `language` | string | — | Optional hint. Accepted form depends on model (Qwen 3 = full names like `English`; xAI / ElevenLabs = ISO 639-1 like `en`; MiniMax = full names). Unsupported values silently ignored. |
| `prompt` | string, ≤ 500 | — | Emotion / style cue. Only for models with `supportsPromptParam` (Qwen 3 currently). Examples: *"Very happy."*, *"Sad and slow."*. |
| `temperature` | 0–2 | — | Sampling temperature. Only for models with `supportsTemperatureParam` (Qwen 3, Orpheus, Chatterbox HD). |
| `top_p` | 0–1 | — | Only Qwen 3 currently. |

## Models

| Model ID | Family | Highlights |
|---|---|---|
| `tts-xai-v1` | xAI | **Recommended default.** Conversational style, ISO 639-1 language hints. |
| `tts-kokoro` | Kokoro | OpenAPI schema default. Multilingual, many voices across languages. |
| `tts-qwen3-0-6b` / `tts-qwen3-1-7b` | Qwen 3 | Emotion control via `prompt`, temperature, top_p. |
| `tts-inworld-1-5-max` | Inworld | Character-driven voices (Craig, Ashley, …). |
| `tts-chatterbox-hd` | Chatterbox | HD voices (Aurora, Blade, …), temperature. |
| `tts-orpheus` | Orpheus | Conversational (tara, leah, jess, leo, …), temperature. |
| `tts-elevenlabs-turbo-v2-5` | ElevenLabs Turbo | Rachel, Aria, Charlotte, Roger, … |
| `tts-minimax-speech-02-hd` | MiniMax | WiseWoman, DeepVoiceMan, … |
| `tts-gemini-3-1-flash` | Gemini Flash | Star-named voices (Achernar, Achird, Zephyr, …). |

Always inspect the entry for your model in `GET /models?type=tts` — `model_spec.voices` is the authoritative voice list. Per-model toggles like `supportsPromptParam`, `supportsTemperatureParam`, `supportsTopPParam` live on the internal model definitions but are not currently exposed on `/models` — treat the request schema below (`instructions`, `temperature`, `top_p`) as the support matrix.

## Voice families (by prefix)

- **Kokoro** — lowercase + language/gender prefix:
  - `af_*`, `am_*` — American female / male
  - `bf_*`, `bm_*` — British female / male
  - `zf_*`, `zm_*` — Chinese
  - `ff_*`, `hf_*`, `hm_*`, `if_*`, `im_*`, `jf_*`, `jm_*`, `pf_*`, `pm_*`, `ef_*`, `em_*` — French, Hindi, Italian, Japanese, Portuguese, Spanish
  - Examples: `af_sky`, `af_bella`, `am_adam`, `bm_george`, `zf_xiaoxiao`
- **Qwen 3** — `Vivian`, `Serena`, `Ono_Anna`, `Sohee`, `Uncle_Fu`, `Dylan`, `Eric`, `Ryan`, `Aiden`
- **xAI** — `eve`, `ara`, `rex`, `sal`, `leo`
- **Orpheus** — `tara`, `leah`, `jess`, `mia`, `zoe`, `dan`, `zac`
- **Inworld** — `Craig`, `Ashley`, `Olivia`, `Sarah`, `Elizabeth`, `Priya`, `Alex`, `Edward`, `Theodore`, `Ronald`, `Mark`, `Hades`, `Luna`, `Pixie`
- **Chatterbox** — `Aurora`, `Britney`, `Siobhan`, `Vicky`, `Blade`, `Carl`, `Cliff`, `Richard`, `Rico`
- **ElevenLabs Turbo** — `Rachel`, `Aria`, `Laura`, `Charlotte`, `Alice`, `Matilda`, `Jessica`, `Lily`, `Roger`, `Charlie`, `George`, `Callum`, `River`, `Liam`, `Will`, `Chris`, `Brian`, `Daniel`, `Bill`
- **MiniMax** — `WiseWoman`, `FriendlyPerson`, `InspirationalGirl`, `CalmWoman`, `LivelyGirl`, `LovelyGirl`, `SweetGirl`, `ExuberantGirl`, `DeepVoiceMan`, `CasualGuy`, `PatientMan`, `YoungKnight`, `DeterminedMan`, `ImposingManner`, `ElegantMan`
- **Gemini 3 Flash** — star names: `Achernar`, `Achird`, `Algenib`, `Algieba`, `Alnilam`, `Aoede`, `Autonoe`, `Callirrhoe`, `Charon`, `Despina`, `Enceladus`, `Erinome`, `Fenrir`, `Gacrux`, `Iapetus`, `Kore`, `Laomedeia`, `Leda`, `Orus`, `Pulcherrima`, `Puck`, `Rasalgethi`, `Sadachbia`, `Sadaltager`, `Schedar`, `Sulafat`, `Umbriel`, `Vindemiatrix`, `Zephyr`, `Zubenelgenubi`

Pass a voice that isn't in the chosen model's list and you get `400`.

## Streaming

```json
{
  "model": "tts-xai-v1",
  "voice": "eve",
  "input": "Hello, this is a long document to narrate. ...",
  "streaming": true,
  "response_format": "mp3"
}
```

With `streaming: true`, the HTTP body is a chunked audio stream. Decode as it arrives — useful for latency-sensitive UIs. `response_format: pcm` pairs well with browser Web Audio API for raw playback.

## OpenAI SDK

```ts
import OpenAI from 'openai'
import fs from 'node:fs/promises'

const client = new OpenAI({
  apiKey: process.env.VENICE_API_KEY,
  baseURL: 'https://api.venice.ai/api/v1',
})

const mp3 = await client.audio.speech.create({
  model: 'tts-xai-v1',
  voice: 'eve',
  input: 'Hello from Venice.',
  response_format: 'mp3',
})

await fs.writeFile('hello.mp3', Buffer.from(await mp3.arrayBuffer()))
```

## Emotion / style (Qwen 3 only)

```json
{
  "model": "tts-qwen3-1-7b",
  "voice": "Vivian",
  "input": "We did it!",
  "prompt": "Excited and energetic.",
  "temperature": 0.9,
  "top_p": 0.95
}
```

For other families, emotion comes from the **voice choice itself** (e.g. Inworld `Hades` vs `Pixie`). `prompt` / `temperature` / `top_p` are silently ignored.

## Errors

| Code | Meaning |
|---|---|
| `400` | Bad voice/model combo, input too long (>4096), language hint rejected by a strict model, invalid voice for the chosen model. |
| `401` | Auth / Pro-only model. |
| `402` | Insufficient balance. |
| `429` | Rate limited. |
| `500` / `503` | Inference / capacity issue — retry with jitter. |

## Gotchas

- `input` hard cap is 4096 chars. For books / long content, split on sentence boundaries and concatenate audio client-side.
- `streaming: true` + SDKs: some OpenAI SDK versions don't expose streaming for `audio.speech.create`; call the REST endpoint directly and consume the HTTP body.
- `speed` compounds with model internal speech rate — extreme values (`0.25`, `4.0`) often sound unnatural; keep within `0.8–1.3` for narration.
- Voice names are case-sensitive (`eve` ≠ `EVE`, `af_sky` ≠ `AF_SKY`).
