What is VoiceBox?

Voicebox is an open-source desktop app for voice cloning and text-to-speech on macOS, Windows, and Linux.It clones voices from short samples (around 3 seconds) and generates speech across multiple TTS engines, accepting WAV, MP3, FLAC, and WEBM uploads as well as microphone and system-audio capture.

A timeline-based multi-voice editor supports arranging tracks, trimming clips, mixing conversations, and applying audio effects (pitch shift, reverb, delay, compression) with live preview and per-profile presets.

Local inference runs on Metal, CUDA, ROCm, Intel ARC, and DirectML or on remote GPUs via one-click server setup to enable local and offline workflows.Speech-to-text uses Whisper models across sizes and 99 languages, with optional local LLM transcript refinement for punctuation and disfluency removal.

Generation supports long outputs (up to 50,000 characters per request) with automatic chunking and seamless crossfades between segments.Developer-focused MCP integration exposes a voicebox.speak API for agents, scripts, and toolchains, making the tool suitable for creators, podcasters, voice artists, accessibility users, writers, and developers.

VoiceBox user reviews

Based on 1 review, 100.0% of users recommend VoiceBox, rated highly for quality results.

1
recommend
0
don't
1 review

Liked for

Quality results 1 of 1
Easy to use 1 of 1
All key features 1 of 1
Good integrations 1 of 1
Would you recommend VoiceBox?

VoiceBox's key features

  • Voice cloning from short audio samples (≈3 seconds)
  • Multi-engine TTS accepting WAV/MP3/FLAC/WEBM uploads plus microphone and system-audio capture
  • Timeline-based multi-voice editor for arranging tracks, trimming, mixing, and applying audio effects (pitch shift, reverb, delay, compression) with live preview and per-profile presets
  • Local inference on Metal, CUDA, ROCm, Intel ARC, and DirectML, with remote GPU support via one-click server setup for local/offline workflows
  • Speech-to-text using Whisper models (multiple sizes, 99 languages) with optional local LLM transcript refinement for punctuation and disfluency removal

VoiceBox use cases

  • Produce professional audiobooks and long-form narrated content using Voicebox by cloning a narrator’s voice from a short sample, generating natural-sounding long-form TTS locally or via remote GPU inference, and exporting finished files as WAV/MP3/FLAC for publishers
  • Assemble multi-voice podcasts, audioplays, and marketing voiceovers in Voicebox’s timeline editor—capture mic takes, clone guest voices from short samples, apply effects and seamless multi-track editing, then mix and export episodes without relying on cloud services for privacy
  • Integrate Voicebox’s TTS and Whisper STT into apps and workflows via its API to build multilingual voice assistants, automated transcription pipelines, or data-sensitive IVR systems that run offline or on remote GPUs while supporting mic capture and common audio formats

Who is it for?

  • Content creators
  • Voice artists
  • Writers
  • Developers
  • Accessibility enthusiasts

Community Discussions

🔍 Looking for AI tools? Try searching!