10 KiB
IndexTTS - Agent's Guide
Project Overview
IndexTTS is an industrial-level, emotionally expressive and duration-controlled auto-regressive zero-shot Text-to-Speech (TTS) system developed by the Bilibili Index Team. The project enables high-quality voice cloning and emotional speech synthesis from a single reference audio file.
Key Features
- Zero-shot voice cloning: Clone any voice from a single audio prompt
- Emotion control: Independent control over timbre and emotion through multiple input modalities
- Duration control: First autoregressive TTS model with precise synthesis duration control
- Multilingual support: Supports both Chinese and English with mixed modeling
- Pinyin support: Fine-grained pronunciation control via Pinyin annotations
Project Structure
index-tts/
├── indextts/ # Main Python package
│ ├── accel/ # Acceleration engine for GPT2 optimization
│ ├── BigVGAN/ # BigVGAN vocoder implementation
│ ├── gpt/ # GPT-based speech language model
│ │ ├── conformer/ # Conformer encoder components
│ │ ├── model.py # IndexTTS1 model (UnifiedVoice)
│ │ └── model_v2.py # IndexTTS2 model with emotion support
│ ├── s2mel/ # Semantic to mel-spectrogram module
│ │ ├── modules/ # Neural network modules
│ │ │ ├── bigvgan/ # BigVGAN vocoder
│ │ │ ├── campplus/ # Speaker encoder
│ │ │ └── ...
│ │ └── dac/ # DAC (Digital Audio Codec) utilities
│ ├── utils/ # Utility functions
│ │ ├── maskgct/ # MaskGCT codec models
│ │ ├── front.py # Text normalization and tokenization
│ │ └── checkpoint.py # Model checkpoint loading
│ ├── vqvae/ # VQ-VAE for audio tokenization
│ ├── cli.py # Command-line interface
│ ├── infer.py # IndexTTS1 inference (legacy)
│ └── infer_v2.py # IndexTTS2 inference with emotion support
├── checkpoints/ # Model weights directory (downloaded separately)
├── models/ # Auto-downloaded auxiliary models
├── examples/ # Sample audio prompts
├── tests/ # Test scripts
├── tools/ # Utility tools
│ ├── gpu_check.py # GPU diagnostics tool
│ └── i18n/ # Internationalization utilities
├── webui.py # Gradio-based Web UI
├── pyproject.toml # Python package configuration
└── uv.lock # Locked dependency versions
Technology Stack
Core Framework
- PyTorch 2.10+: Deep learning framework with CUDA support
- Transformers 4.52+: Hugging Face transformers for GPT2 architecture
- DeepSpeed 0.17.1 (optional): Inference acceleration
Audio Processing
- torchaudio: Audio I/O and transformations
- librosa: Audio analysis and feature extraction
- soundfile: Audio file reading/writing
- BigVGAN: Neural vocoder for high-quality audio generation
Text Processing
- sentencepiece: Text tokenization (BPE model)
- jieba: Chinese text segmentation
- g2p-en: English grapheme-to-phoneme conversion
- wetext/WeTextProcessing: Text normalization
Key Dependencies
- OmegaConf: YAML configuration management
- safetensors: Safe tensor serialization
- modelscope: Model hub for downloading Chinese models
- huggingface-hub: Model hub for international models
- gradio 5.45+: Web UI framework (optional)
Build and Development
Environment Setup
The project uses uv as the only supported package manager:
# Install uv
pip install -U uv
# Install dependencies (creates .venv automatically)
uv sync --all-extras
# For China users (use local mirror)
uv sync --all-extras --default-index "https://mirrors.aliyun.com/pypi/simple"
Available Extra Features
--extra webui: Gradio WebUI support--extra deepspeed: DeepSpeed inference acceleration--all-extras: Install all optional features
Model Download
Download models from HuggingFace:
uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints
Or from ModelScope:
uv tool install "modelscope"
modelscope download --model IndexTeam/IndexTTS-2 --local_dir checkpoints
Running the Application
Web UI:
uv run webui.py
# Options:
uv run webui.py --fp16 --deepspeed --cuda_kernel --port 7860
CLI (IndexTTS1 only):
uv run indextts "Text to synthesize" -v examples/voice_01.wav -o output.wav
Python API:
# IndexTTS2
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py
# IndexTTS1 (legacy)
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer.py
GPU Check
uv run tools/gpu_check.py
Testing
Test Files
tests/regression_test.py: Regression tests for TTS inferencetests/padding_test.py: Tests text token padding behavior
Running Tests
# Regression test
uv run tests/regression_test.py
# Padding test
uv run tests/padding_test.py checkpoints
Code Organization
Main Inference Classes
IndexTTS2 (indextts/infer_v2.py):
- Main class for IndexTTS2 inference with emotion support
- Supports speaker prompt, emotion prompt, emotion vectors, and text-based emotion control
- Key methods:
infer(): Main inference with multiple emotion control modesnormalize_emo_vec(): Normalize emotion vectors
IndexTTS (indextts/infer.py):
- Legacy IndexTTS1 inference class
- Basic zero-shot voice cloning without emotion control
Model Components
GPT Model (indextts/gpt/model_v2.py):
UnifiedVoice: Main GPT-based speech language modelGPT2InferenceModel: Inference wrapper with KV-cache support- Uses Conformer encoder for audio conditioning
- Perceiver resampler for emotion conditioning
Semantic Codec (indextts/utils/maskgct_utils.py):
- Encodes audio into semantic tokens
- Uses Wav2Vec-BERT 2.0 for feature extraction
S2MEL Module (indextts/s2mel/):
- Converts semantic tokens to mel-spectrogram
- Flow-matching based diffusion transformer
BigVGAN (indextts/BigVGAN/ and indextts/s2mel/modules/bigvgan/):
- Neural vocoder for final audio generation
- Custom CUDA kernel support for acceleration
Text Processing
TextNormalizer (indextts/utils/front.py):
- Chinese and English text normalization
- Pinyin support for pronunciation control
- Term glossary for custom pronunciations
- Email, name, and technical term patterns
TextTokenizer:
- SentencePiece BPE tokenization
- Supports 12,000 text tokens
Configuration
Main Config File (checkpoints/config.yaml)
Key configuration sections:
dataset: Audio sampling parameters (24kHz, mel-spectrogram settings)gpt: GPT model architecture (1280 dim, 24 layers, 20 heads)semantic_codec: Semantic codec parameterss2mel: S2MEL module configuration (DiT architecture)- Model checkpoint paths
Important Paths
gpt_checkpoint: GPT model weights (gpt.pth)s2mel_checkpoint: S2MEL model weights (s2mel.pth)w2v_stat: Wav2Vec statistics (wav2vec2bert_stats.pt)qwen_emo_path: Qwen emotion model path
Code Style Guidelines
Import Conventions
- Standard library imports first
- Third-party imports (torch, transformers) second
- Internal module imports last
- Use absolute imports for project modules
Type Hints
- Optional type hints for function parameters
- Use
typingmodule for complex types
Documentation
- Docstrings for classes and public methods
- Chinese comments common in text processing modules
- English comments in model architecture code
Naming Conventions
snake_casefor functions and variablesPascalCasefor classes- Private methods prefixed with
_
Development Conventions
Adding New Features
- Maintain backward compatibility with IndexTTS1
- Use OmegaConf for configuration management
- Add appropriate warnings for experimental features
- Update example cases in
examples/cases.jsonl
Device Handling
Always support multiple device types:
- CUDA (NVIDIA GPUs)
- MPS (Apple Silicon)
- XPU (Intel GPUs)
- CPU (fallback)
Example pattern:
if device is not None:
self.device = device
elif torch.cuda.is_available():
self.device = "cuda:0"
elif hasattr(torch, "mps") and torch.backends.mps.is_available():
self.device = "mps"
else:
self.device = "cpu"
Memory Optimization
- Support FP16 inference for lower VRAM usage
- Implement KV-cache for GPT inference
- Use
torch.no_grad()context for inference - Clear CUDA cache when switching devices
Security and Usage Restrictions
The project includes a DISCLAIMER file outlining usage restrictions:
- Do NOT synthesize voices of political figures or public figures without authorization
- Do NOT create content that defames, insults, or discriminates
- Do NOT use for fraud or identity theft
- Do NOT generate false information or social panic
- Do NOT use for commercial purposes without authorization
- Do NOT create inappropriate content involving minors
Version History
- IndexTTS2 (2025/09/08): Emotion control, duration control
- IndexTTS1.5 (2025/05/14): Stability improvements, better English
- IndexTTS1.0 (2025/03/25): Initial release
Useful Resources
- Paper (IndexTTS2): https://arxiv.org/abs/2506.21619
- HuggingFace Demo: https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo
- ModelScope Demo: https://modelscope.cn/studios/IndexTeam/IndexTTS-2-Demo
- GitHub: https://github.com/index-tts/index-tts
Troubleshooting
Common Issues
- CUDA errors: Ensure CUDA Toolkit 12.8+ is installed
- Slow inference: Enable
--fp16for faster GPU inference - Model download fails: Set
HF_ENDPOINT="https://hf-mirror.com"for China users - DeepSpeed fails on Windows: Skip with
--extra webuionly
Debug Mode
Run with --verbose flag or set verbose=True in Python API for detailed logging.