2026-02-15 03:26:10 +08:00

10 KiB

IndexTTS - Agent's Guide

Project Overview

IndexTTS is an industrial-level, emotionally expressive and duration-controlled auto-regressive zero-shot Text-to-Speech (TTS) system developed by the Bilibili Index Team. The project enables high-quality voice cloning and emotional speech synthesis from a single reference audio file.

Key Features

  • Zero-shot voice cloning: Clone any voice from a single audio prompt
  • Emotion control: Independent control over timbre and emotion through multiple input modalities
  • Duration control: First autoregressive TTS model with precise synthesis duration control
  • Multilingual support: Supports both Chinese and English with mixed modeling
  • Pinyin support: Fine-grained pronunciation control via Pinyin annotations

Project Structure

index-tts/
├── indextts/                 # Main Python package
│   ├── accel/               # Acceleration engine for GPT2 optimization
│   ├── BigVGAN/             # BigVGAN vocoder implementation
│   ├── gpt/                 # GPT-based speech language model
│   │   ├── conformer/       # Conformer encoder components
│   │   ├── model.py         # IndexTTS1 model (UnifiedVoice)
│   │   └── model_v2.py      # IndexTTS2 model with emotion support
│   ├── s2mel/               # Semantic to mel-spectrogram module
│   │   ├── modules/         # Neural network modules
│   │   │   ├── bigvgan/     # BigVGAN vocoder
│   │   │   ├── campplus/    # Speaker encoder
│   │   │   └── ...
│   │   └── dac/             # DAC (Digital Audio Codec) utilities
│   ├── utils/               # Utility functions
│   │   ├── maskgct/         # MaskGCT codec models
│   │   ├── front.py         # Text normalization and tokenization
│   │   └── checkpoint.py    # Model checkpoint loading
│   ├── vqvae/               # VQ-VAE for audio tokenization
│   ├── cli.py               # Command-line interface
│   ├── infer.py             # IndexTTS1 inference (legacy)
│   └── infer_v2.py          # IndexTTS2 inference with emotion support
├── checkpoints/             # Model weights directory (downloaded separately)
├── models/                  # Auto-downloaded auxiliary models
├── examples/                # Sample audio prompts
├── tests/                   # Test scripts
├── tools/                   # Utility tools
│   ├── gpu_check.py         # GPU diagnostics tool
│   └── i18n/                # Internationalization utilities
├── webui.py                 # Gradio-based Web UI
├── pyproject.toml           # Python package configuration
└── uv.lock                  # Locked dependency versions

Technology Stack

Core Framework

  • PyTorch 2.10+: Deep learning framework with CUDA support
  • Transformers 4.52+: Hugging Face transformers for GPT2 architecture
  • DeepSpeed 0.17.1 (optional): Inference acceleration

Audio Processing

  • torchaudio: Audio I/O and transformations
  • librosa: Audio analysis and feature extraction
  • soundfile: Audio file reading/writing
  • BigVGAN: Neural vocoder for high-quality audio generation

Text Processing

  • sentencepiece: Text tokenization (BPE model)
  • jieba: Chinese text segmentation
  • g2p-en: English grapheme-to-phoneme conversion
  • wetext/WeTextProcessing: Text normalization

Key Dependencies

  • OmegaConf: YAML configuration management
  • safetensors: Safe tensor serialization
  • modelscope: Model hub for downloading Chinese models
  • huggingface-hub: Model hub for international models
  • gradio 5.45+: Web UI framework (optional)

Build and Development

Environment Setup

The project uses uv as the only supported package manager:

# Install uv
pip install -U uv

# Install dependencies (creates .venv automatically)
uv sync --all-extras

# For China users (use local mirror)
uv sync --all-extras --default-index "https://mirrors.aliyun.com/pypi/simple"

Available Extra Features

  • --extra webui: Gradio WebUI support
  • --extra deepspeed: DeepSpeed inference acceleration
  • --all-extras: Install all optional features

Model Download

Download models from HuggingFace:

uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints

Or from ModelScope:

uv tool install "modelscope"
modelscope download --model IndexTeam/IndexTTS-2 --local_dir checkpoints

Running the Application

Web UI:

uv run webui.py
# Options:
uv run webui.py --fp16 --deepspeed --cuda_kernel --port 7860

CLI (IndexTTS1 only):

uv run indextts "Text to synthesize" -v examples/voice_01.wav -o output.wav

Python API:

# IndexTTS2
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py

# IndexTTS1 (legacy)
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer.py

GPU Check

uv run tools/gpu_check.py

Testing

Test Files

  • tests/regression_test.py: Regression tests for TTS inference
  • tests/padding_test.py: Tests text token padding behavior

Running Tests

# Regression test
uv run tests/regression_test.py

# Padding test
uv run tests/padding_test.py checkpoints

Code Organization

Main Inference Classes

IndexTTS2 (indextts/infer_v2.py):

  • Main class for IndexTTS2 inference with emotion support
  • Supports speaker prompt, emotion prompt, emotion vectors, and text-based emotion control
  • Key methods:
    • infer(): Main inference with multiple emotion control modes
    • normalize_emo_vec(): Normalize emotion vectors

IndexTTS (indextts/infer.py):

  • Legacy IndexTTS1 inference class
  • Basic zero-shot voice cloning without emotion control

Model Components

GPT Model (indextts/gpt/model_v2.py):

  • UnifiedVoice: Main GPT-based speech language model
  • GPT2InferenceModel: Inference wrapper with KV-cache support
  • Uses Conformer encoder for audio conditioning
  • Perceiver resampler for emotion conditioning

Semantic Codec (indextts/utils/maskgct_utils.py):

  • Encodes audio into semantic tokens
  • Uses Wav2Vec-BERT 2.0 for feature extraction

S2MEL Module (indextts/s2mel/):

  • Converts semantic tokens to mel-spectrogram
  • Flow-matching based diffusion transformer

BigVGAN (indextts/BigVGAN/ and indextts/s2mel/modules/bigvgan/):

  • Neural vocoder for final audio generation
  • Custom CUDA kernel support for acceleration

Text Processing

TextNormalizer (indextts/utils/front.py):

  • Chinese and English text normalization
  • Pinyin support for pronunciation control
  • Term glossary for custom pronunciations
  • Email, name, and technical term patterns

TextTokenizer:

  • SentencePiece BPE tokenization
  • Supports 12,000 text tokens

Configuration

Main Config File (checkpoints/config.yaml)

Key configuration sections:

  • dataset: Audio sampling parameters (24kHz, mel-spectrogram settings)
  • gpt: GPT model architecture (1280 dim, 24 layers, 20 heads)
  • semantic_codec: Semantic codec parameters
  • s2mel: S2MEL module configuration (DiT architecture)
  • Model checkpoint paths

Important Paths

  • gpt_checkpoint: GPT model weights (gpt.pth)
  • s2mel_checkpoint: S2MEL model weights (s2mel.pth)
  • w2v_stat: Wav2Vec statistics (wav2vec2bert_stats.pt)
  • qwen_emo_path: Qwen emotion model path

Code Style Guidelines

Import Conventions

  • Standard library imports first
  • Third-party imports (torch, transformers) second
  • Internal module imports last
  • Use absolute imports for project modules

Type Hints

  • Optional type hints for function parameters
  • Use typing module for complex types

Documentation

  • Docstrings for classes and public methods
  • Chinese comments common in text processing modules
  • English comments in model architecture code

Naming Conventions

  • snake_case for functions and variables
  • PascalCase for classes
  • Private methods prefixed with _

Development Conventions

Adding New Features

  1. Maintain backward compatibility with IndexTTS1
  2. Use OmegaConf for configuration management
  3. Add appropriate warnings for experimental features
  4. Update example cases in examples/cases.jsonl

Device Handling

Always support multiple device types:

  • CUDA (NVIDIA GPUs)
  • MPS (Apple Silicon)
  • XPU (Intel GPUs)
  • CPU (fallback)

Example pattern:

if device is not None:
    self.device = device
elif torch.cuda.is_available():
    self.device = "cuda:0"
elif hasattr(torch, "mps") and torch.backends.mps.is_available():
    self.device = "mps"
else:
    self.device = "cpu"

Memory Optimization

  • Support FP16 inference for lower VRAM usage
  • Implement KV-cache for GPT inference
  • Use torch.no_grad() context for inference
  • Clear CUDA cache when switching devices

Security and Usage Restrictions

The project includes a DISCLAIMER file outlining usage restrictions:

  • Do NOT synthesize voices of political figures or public figures without authorization
  • Do NOT create content that defames, insults, or discriminates
  • Do NOT use for fraud or identity theft
  • Do NOT generate false information or social panic
  • Do NOT use for commercial purposes without authorization
  • Do NOT create inappropriate content involving minors

Version History

  • IndexTTS2 (2025/09/08): Emotion control, duration control
  • IndexTTS1.5 (2025/05/14): Stability improvements, better English
  • IndexTTS1.0 (2025/03/25): Initial release

Useful Resources

Troubleshooting

Common Issues

  1. CUDA errors: Ensure CUDA Toolkit 12.8+ is installed
  2. Slow inference: Enable --fp16 for faster GPU inference
  3. Model download fails: Set HF_ENDPOINT="https://hf-mirror.com" for China users
  4. DeepSpeed fails on Windows: Skip with --extra webui only

Debug Mode

Run with --verbose flag or set verbose=True in Python API for detailed logging.