What is VoiceBox?

Voicebox is an open-source desktop app for voice cloning and text-to-speech on macOS, Windows, and Linux.It clones voices from short samples (around 3 seconds) and generates speech across multiple TTS engines, accepting WAV, MP3, FLAC, and WEBM uploads as well as microphone and system-audio capture.

A timeline-based multi-voice editor supports arranging tracks, trimming clips, mixing conversations, and applying audio effects (pitch shift, reverb, delay, compression) with live preview and per-profile presets.

Local inference runs on Metal, CUDA, ROCm, Intel ARC, and DirectML or on remote GPUs via one-click server setup to enable local and offline workflows.Speech-to-text uses Whisper models across sizes and 99 languages, with optional local LLM transcript refinement for punctuation and disfluency removal.

Generation supports long outputs (up to 50,000 characters per request) with automatic chunking and seamless crossfades between segments.Developer-focused MCP integration exposes a voicebox.speak API for agents, scripts, and toolchains, making the tool suitable for creators, podcasters, voice artists, accessibility users, writers, and developers.

VoiceBox user reviews

Based on 3 reviews, 100.0% of users recommend VoiceBox, rated highly for quality results.

recommend

don't

3 reviews

Liked for

Quality results 3 of 3

Easy to use 3 of 3

All key features 3 of 3

Good integrations 3 of 3

Would you recommend VoiceBox?

Recommend this tool?

Playbooks 1

Step-by-step guides to get the most out of VoiceBox

Create a Personalized AI Voice Clone

Clone your own natural‑sounding AI voice and generate speech from custom text in minutes using the free Voicebox tool.

0 0 View guide

VoiceBox's key features

Voice cloning from short audio samples (≈3 seconds)
Multi-engine TTS accepting WAV/MP3/FLAC/WEBM uploads plus microphone and system-audio capture
Timeline-based multi-voice editor for arranging tracks, trimming, mixing, and applying audio effects (pitch shift, reverb, delay, compression) with live preview and per-profile presets
Local inference on Metal, CUDA, ROCm, Intel ARC, and DirectML, with remote GPU support via one-click server setup for local/offline workflows
Speech-to-text using Whisper models (multiple sizes, 99 languages) with optional local LLM transcript refinement for punctuation and disfluency removal

VoiceBox use cases

Produce professional audiobooks and long-form narrated content using Voicebox by cloning a narrator’s voice from a short sample, generating natural-sounding long-form TTS locally or via remote GPU inference, and exporting finished files as WAV/MP3/FLAC for publishers
Assemble multi-voice podcasts, audioplays, and marketing voiceovers in Voicebox’s timeline editor—capture mic takes, clone guest voices from short samples, apply effects and seamless multi-track editing, then mix and export episodes without relying on cloud services for privacy
Integrate Voicebox’s TTS and Whisper STT into apps and workflows via its API to build multilingual voice assistants, automated transcription pipelines, or data-sensitive IVR systems that run offline or on remote GPUs while supporting mic capture and common audio formats