Coqui XTTS: The Complete Guide to Open-Source Voice Cloning

Close-up of a professional microphone with glowing blue sound wave visualizations emanating from it against a dark background.

1. What Is Coqui XTTS?

Coqui XTTS is a deep learning toolkit for local voice generation — not a cloud service, not a SaaS subscription. Everything runs on your machine. That distinction matters more than most guides acknowledge.

Core Capabilities: Zero-Shot Voice Cloning

The core capability is zero-shot voice cloning: hand the model a 6-second audio clip of any speaker, and it extracts a voice embedding without any training run. No fine-tuning, no dataset preparation. You skip the gradient updates entirely.

What Is Coqui TTS Used For?

Developers reach for it in narration pipelines, video dubbing workflows, accessibility tooling, and prototype voice assistants. The zero-shot cloning path means you can test a local voice clone in minutes rather than days.

Is Coqui XTTS Free?

The model is free under the Coqui Public Model License, which permits most research and personal use. Some commercial uses have restrictions — read the full license before deploying in a commercial product. The underlying TTS library is Apache 2.0.

What Happened to Coqui AI?

Coqui AI shut down in January 2024. The model still exists, still works, and community forks are active. But there's no commercial support entity behind it anymore.

2026 Status Update: As of mid-2026, the most actively maintained community fork is coqui-ai/TTS — search GitHub for "coqui-tts fork" sorted by recent activity to find the most current maintained version. The XTTS-v2 weights on Hugging Face remain publicly available with substantial download activity, and community Discord channels continue active discussion. XTTS-v2 remains the most capable open-source zero-shot voice cloning model for most use cases, though StyleTTS2 has closed the gap on naturalness for single-speaker synthesis, and OpenVoice has expanded multilingual support. Some community forks have extended the base 17-language support — check the specific fork's documentation for current language coverage. I wouldn't build a customer-facing product on unforked XTTS without a contingency plan for the day the Hugging Face weights disappear.

---

2. How XTTS Clones a Voice in 3 Stages (and Why It Matters for Quality)

The pipeline has three stages that determine output quality. Understanding them helps you identify problems and optimize your reference audio.

Speaker Encoder — Processes your reference clip and extracts a fixed-length voice embedding (a numerical fingerprint of the speaker's timbre and prosodic style). Six seconds is the floor, not the target. Most people use the minimum. Don't.

Transformer Architecture — Takes the voice embedding alongside your input text and predicts a spectrogram (a time-frequency representation of the audio). This is where XTTS diverges from older models like Tacotron2 and Glow-TTS, which relied on recurrent or flow-based architectures that struggled with longer sequences and cross-lingual transfer.

Vocoder — Converts the spectrogram into a waveform you can actually hear. Artifacts live here. Breathiness lives here. Most subtle quality differences between versions emerge at this stage.

Why Cross-Lingual Cloning Works

Cross-lingual voice cloning is the genuinely impressive part. The same voice embedding extracted from an English recording can drive synthesis in French, Japanese, or Arabic. The speaker's timbre and prosodic style transfer across languages because the embedding lives in a language-agnostic space.

Zero-Shot vs. Fine-Tuned: Which Should You Use?

Zero-shot means no gradient updates at inference time. Pass in a reference clip, the model generalizes, and you get output immediately — no training, no waiting hours for GPU jobs to finish.

The tradeoff is a quality ceiling: the model can only capture what's in that short clip. Fine-tuning means running a training pass on a dataset of that speaker's recordings, typically 30 minutes to several hours of audio. The XTTS codebase supports both inference and fine-tuning, per the Hugging Face model card.

	Zero-Shot	Fine-Tuned
Setup time	Minutes	Hours
Quality ceiling	Good	Very good
Best for	Prototyping, one-off tasks	Production, long-form content

For audiobook narration, a consistent brand voice, or long-form content — fine-tuning is worth the effort. For prototyping or one-off voice conversion tasks, zero-shot is fast enough. See our text-to-speech comparison guide for how this tradeoff plays out across different use cases.

---

3. XTTS-v2 vs. XTTS-v1: What Actually Changed

The naturalness improvement in v2 is audible. I ran the same 15-second reference clip through both versions synthesizing a 200-word paragraph — v1 produced a metallic flutter on every 's' cluster. V2 eliminated it almost entirely. Not perfect, but the difference is obvious to anyone who's spent time evaluating TTS output.

Beyond audio quality, language support expanded to 17 languages in XTTS-v2: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, and Hindi, per the Hugging Face model card. That's a meaningful jump for multilingual TTS workflows. Some community forks have extended this — check the specific fork's documentation.

Short reference audio handling improved too. XTTS-v1 degraded noticeably below 10 seconds. XTTS-v2 is more tolerant of the 6-second minimum, though output quality still scales with clip length.

XTTS-v2 is also the same model that powered Coqui Studio and the Coqui API before the company closed, per the model card. That's a meaningful quality signal — this was production-grade infrastructure, not a research demo.

The license changed between versions. Check the specific version you're pulling from Hugging Face before assuming the terms match what you read in an older tutorial.

---

4. Clone Your First Voice in 5 Minutes: Install, Configure, and Synthesize

Prerequisites:

Python 3.9–3.11 (tested on Ubuntu 18.04; Windows and macOS work with caveats)
CUDA-capable GPU strongly recommended — CPU inference works but is too slow for interactive use
~2GB disk space for model weights

Install the TTS library from PyPI:

pip install TTS

A Docker image is also available for reproducible environments or teams with strict dependency management, per the GitHub repository.

Step-by-Step Code Walkthrough

Model initialization and your first synthesis call:

from TTS.api import TTS import torch device = "cuda" if torch.cuda.is_available() else "cpu" tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device) tts.tts_to_file( text="The quick brown fox jumps over the lazy dog.", speaker_wav="reference_audio.wav",  # your 6-second+ clip language="en", file_path="output.wav" )

To synthesize in a different language, change only the `language` parameter — the voice embedding transfers automatically:

tts.tts_to_file( text="El zorro marrón rápido salta sobre el perro perezoso.", speaker_wav="reference_audio.wav", language="es", file_path="output_spanish.wav" )

Reference audio quality note: Background noise is the single biggest quality killer. Even modest room reverb degrades the speaker embedding. A consistent vocal performance across the clip matters more than clip length beyond 30 seconds.

Why Am I Getting CUDA Out of Memory Errors?

Add the `--half` precision flag or reduce batch size. Some users report that half-precision (FP16) introduces artifacts on certain GPU architectures — if output quality degrades unexpectedly, switch back to FP32.

How Do I Fix the Hugging Face License Agreement Error?

XTTS-v2 requires you to accept the Coqui Public Model License on the model card before downloading. Authenticate with `huggingface-cli login` and accept the agreement on the web interface.

Why Is My XTTS Output Quality Poor Despite Clean Reference Audio?

Check that your reference clip contains a single speaker with consistent vocal characteristics. Mixed-speaker clips confuse the speaker encoder and produce blended, unstable output.

Monitor showing terminal with audio waveforms and synthesis output, representing successful XTTS text-to-speech processing in a professional development environment.

---

5. Ship XTTS in Production: Discord Bots and Video Dubbing

XTTS isn't just for standalone scripts. Two workflows come up constantly in production.

How Do I Integrate Coqui XTTS With a Discord Bot?

Load XTTS, receive text commands, synthesize with `tts.tts_to_file()`, and stream the WAV to a voice channel using `discord.py`. Synthesis takes a few seconds on GPU — acceptable for most bot use cases, too slow on CPU.

A minimal pattern:

import discord from TTS.api import TTS import torch tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda") @bot.command() async def speak(ctx, *, text): tts.tts_to_file(text=text, speaker_wav="voice.wav", language="en", file_path="out.wav") await ctx.voice_client.play(discord.FFmpegPCMAudio("out.wav"))

One limitation to plan around: real-time streaming isn't supported in the base implementation. For low-latency applications, you'd need chunked synthesis — split the text into sentences, synthesize each, and queue them. Community forks have implemented this; search the GitHub issuesfor "streaming" to find current options.

How Do I Dub a Video With Coqui XTTS?

The core pattern has four steps:

Transcribe original audio (Whisper works well here)
Translate the transcript if needed
Synthesize with XTTS using the original speaker's voice as reference
Replace the audio track with ffmpeg

ffmpeg -i original_video.mp4 -i synthesized_audio.wav \ -c:v copy -map 0:v:0 -map 1:a:0 \ dubbed_output.mp4

Timing alignment is the hard part. XTTS doesn't control speech rate precisely enough to match lip movements without post-processing. For rough dubbing — podcasts, internal training videos, content where lip sync doesn't matter — this pipeline works well. For broadcast-quality dubbing, you'll need additional tooling on top.

---

6. Coqui XTTS vs. Paid Alternatives: Honest Cost-Benefit Analysis

How good is XTTS? Honest answer: it sits in a tier below ElevenLabs and Azure Neural TTS on out-of-the-box naturalness, but the gap is smaller than most paid-service marketing implies — and it closes significantly with fine-tuning.

I'll be direct about the quality column in the table below. Comparing XTTS to ElevenLabs on voice quality is genuinely context-dependent. For a 30-second clip with a clean reference recording, the gap is smaller than ElevenLabs' marketing implies. For a 6-second clip in a noisy room? ElevenLabs wins clearly. The table reflects the average case.

	XTTS-v2	ElevenLabs	Google Cloud TTS	Azure Neural TTS
Cost	Free (hardware only)	$5–$330/month	$4–$16 per 1M chars	$4–$16 per 1M chars
Voice quality	Good (fine-tuned: very good)	Excellent	Very good	Very good
Languages	17 (base)	32	40+	140+
Cloning speed	~2-5s (GPU)	Near-instant (API)	No cloning	Limited cloning
Privacy/data control	Full — no data leaves machine	Audio processed on servers	Audio processed on servers	Audio processed on servers
Offline use	Yes	No	No	No

Pricing figures are approximate as of June 2026 — verify current tiers directly with each provider.

Scroll right on mobile to see full comparison.

The Privacy Argument: Why Local Processing Changes the Calculus

Beyond raw cost, most developers overweight voice quality and underweight data privacy in their evaluation. If you're building a tool where users paste personal content — medical notes, legal documents, private correspondence — the fact that XTTS processes everything locally isn't a nice-to-have. It's the whole decision.

We've seen teams switch from cloud TTS to local voice cloning purely because their legal team flagged the data processing terms.

The cost math at scale reinforces this. At 1 million characters per month, commercial API costs add up fast. XTTS on a used consumer GPU amortizes to near zero over a year of production use.

Paid services win on managed infrastructure, support SLAs, and quality for one-shot synthesis without fine-tuning. If you need something that sounds great immediately with zero setup, ElevenLabs is faster to ship. That's a real tradeoff.

XTTS vs. Other Open-Source TTS Models

The open-source speech synthesis landscape has shifted since XTTS-v2 launched in late 2023. Three alternatives come up most often:

XTTS-v2 vs. Bark (Suno): Bark produces more expressive, emotionally varied output and handles non-speech sounds (laughter, sighs) better than XTTS. The tradeoff: Bark is slower, less controllable, and doesn't support the same clean voice cloning workflow. For narration with consistent voice identity, XTTS wins. For creative audio generation, Bark is worth evaluating.

XTTS-v2 vs. StyleTTS2: StyleTTS2 has closed the naturalness gap significantly, particularly for single-speaker synthesis. In informal listening tests, StyleTTS2 output is often indistinguishable from the reference speaker. The limitation: multilingual support and zero-shot cloning are less mature than XTTS-v2. For English-only, single-voice production use, StyleTTS2 deserves a serious look.

XTTS-v2 vs. OpenVoice: OpenVoice (from MyShell) focuses specifically on voice conversion and cross-lingual cloning, similar to XTTS. It's more actively maintained as of 2026 and has expanded language support. If XTTS-v2's community fork situation concerns you, OpenVoice is the most direct alternative to evaluate. More on voice conversion approaches in our voice cloning overview.

---

7. XTTS Limitations: What Production Teams Discover the Hard Way

XTTS has real limitations. Accent accuracy degrades on short clips — a 6-second clip of a speaker with a strong regional accent often produces output that flattens that accent. The embedding doesn't have enough data to capture subtle phonetic patterns. Prosody control is limited too; you can't reliably direct emphasis or emotional tone through markup alone.

Real-time streaming requires chunked synthesis or a community fork — see the Discord Bot Integration section above for implementation notes.

Audio Quality Optimization Checklist

Reference clip length: 15-30 seconds outperforms the 6-second minimum. Beyond 30 seconds, gains flatten — don't bother with hour-long clips.
Noise floor: Record in a quiet room. Background noise is the fastest path to degraded output.
Codec: WAV or FLAC only. MP3 compression artifacts confuse the speaker encoder in ways that are hard to trace.
Speaker consistency: Single speaker, consistent vocal energy. Avoid clips with laughter, coughing, or significant pitch variation.
Sample rate: Use the native rate for your XTTS configuration. Mismatched rates get downsampled internally, which adds unnecessary processing.

Ethical Use and Consent Framework

Be direct about the consent requirement before you deploy anything.

The Coqui Public Model License explicitly prohibits harmful use — but legal prohibition and technical prevention are different things. Teams deploying XTTS in production should implement a consent layer. Document that the cloned voice belongs to someone who explicitly authorized the use, keep that documentation, and disclose to end users when they're hearing synthesized speech.

Synthetic audio impersonation is the highest-stakes misuse case. Cloning a public figure's voice for impersonation, generating fake audio evidence, or producing non-consensual synthetic speech of private individuals are all prohibited uses and increasingly illegal in multiple jurisdictions.

Watermarking options for XTTS output are limited in the base library. If disclosure is a compliance requirement, look at third-party audio watermarking tools applied post-synthesis.

For accessibility applications — reading tools for users with dyslexia, low vision, or ADHD — XTTS can support documented benefit frameworks around text-to-speech access. It does not treat, assess, or replace medical guidance. The relevant framework is the W3C WCAG accessibility guidelines and Section 508 compliance documentation for any tool deployed in an institutional context.

---

8. The On-Device TTS Tradeoff: Why You Can't Optimize for All Three (And What to Choose)

Every on-device TTS decision involves three competing constraints — what we call the on-device TTS triangle. You can optimize for model size (small enough to ship), quality (output that sounds natural), or latency (fast enough for interactive use). Optimizing for all three simultaneously is the hard problem.

XTTS-v2 sits quality-first on this triangle: model weights are substantial (not lightweight for mobile), and latency is acceptable on GPU but painful on CPU. That's the right tradeoff for a desktop or server deployment. It's the wrong tradeoff for a mobile app.

This is why running production-ready TTS on a phone requires a different architecture entirely — smaller models, hardware-specific inference paths, quantization — the whole stack changes. The gap between desktop open-source TTS and mobile on-device TTS is larger than most developers expect before they try it. More on the mobile-vs-desktop TTS gap in our on-device TTS guide.

---

9. Key Takeaways

XTTS-v2 supports 17 languages and requires only 6 seconds of reference audio for zero-shot voice cloning — no training run required, per the Hugging Face model card
The speaker encoder extracts a voice embedding that transfers across languages, enabling cross-lingual voice cloning from a single reference clip
XTTS-v2 runs entirely on local hardware, making it the default choice for privacy-sensitive applications where audio cannot leave the machine
Local voice cloning via XTTS costs a fraction of commercial API alternatives at scale — hardware amortizes; per-character API fees don't
Coqui AI shut down in January 2024 — the model works, but evaluate community fork health before production commitments
Reference audio quality (noise floor, clip length, codec) has more impact on output quality than almost any other variable
Fine-tuning is available in the XTTS codebase and is worth the effort for consistent long-form production use cases

---

10. Frequently Asked Questions

What is Coqui XTTS and how does it work? Coqui XTTS is an open-source zero-shot voice cloning model that runs entirely on local hardware. It extracts a voice embedding from a short reference clip, uses a transformer to predict spectrograms from input text, and converts those spectrograms to audio via a vocoder. No training required — the whole process runs at inference time. The XTTS codebase also supports fine-tuning for higher-quality production use.

Is Coqui XTTS free to use? The model is free under the Coqui Public Model License, which permits most research and personal use. Some commercial uses have restrictions — read the full license on the Hugging Face model card before deploying in a commercial product. The underlying TTS library is Apache 2.0.

How many languages does XTTS-v2 support? XTTS-v2 supports 17 languages in the base release: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, and Hindi, per the Hugging Face model card. Some community forks have extended this — check the specific fork's documentation for current language coverage.

What happened to Coqui AI? Coqui AI shut down in January 2024. The model weights remain publicly available on Hugging Face and the GitHub repository is archived. Community forks continue active development, but there is no official support or maintenance from the original team. As of 2026, the community fork ecosystem remains active — verify current fork health before committing XTTS to a long-term production pipeline.

How long does the reference audio clip need to be? The minimum is 6 seconds, per the Hugging Face model card. Output quality improves meaningfully up to about 15-30 seconds of reference audio. Beyond 30 seconds, quality gains flatten. The clip should be clean (low noise floor), single-speaker, and consistent in vocal energy.

Can XTTS clone a voice in a different language than the reference clip? Yes. Cross-lingual voice cloning is a core XTTS capability. The voice embedding extracted from an English reference clip can drive synthesis in any of the 17 supported languages — the speaker's timbre and prosodic characteristics transfer across the language boundary.

How does XTTS compare to other open-source TTS options like Bark or StyleTTS2? XTTS-v2 leads on multilingual zero-shot voice cloning. Bark produces more expressive output but is slower and less controllable. StyleTTS2 has better naturalness for English single-speaker synthesis but less mature multilingual support. OpenVoice is the most direct alternative for cross-lingual voice conversion and is more actively maintained as of 2026. For multilingual zero-shot cloning, XTTS-v2 remains the strongest open-source option.

---

11. How VoicePod Fits

If the XTTS pipeline interests you but local GPU setup is the barrier, VoicePod solves that specific problem. The LuxTTS model (122M parameters) runs entirely on your iPhone using the Apple Neural Engine — 3-second voice clone, no Python environment, no audio sent to external servers. Same privacy guarantee as XTTS, none of the infrastructure overhead.