2026-02-15 03:26:10 +08:00

10 KiB

Raw Permalink Blame History

IndexTTS - Agent's Guide

Project Overview

IndexTTS is an industrial-level, emotionally expressive and duration-controlled auto-regressive zero-shot Text-to-Speech (TTS) system developed by the Bilibili Index Team. The project enables high-quality voice cloning and emotional speech synthesis from a single reference audio file.

Key Features

Zero-shot voice cloning: Clone any voice from a single audio prompt
Emotion control: Independent control over timbre and emotion through multiple input modalities
Duration control: First autoregressive TTS model with precise synthesis duration control
Multilingual support: Supports both Chinese and English with mixed modeling
Pinyin support: Fine-grained pronunciation control via Pinyin annotations

Project Structure

index-tts/
├── indextts/                 # Main Python package
│   ├── accel/               # Acceleration engine for GPT2 optimization
│   ├── BigVGAN/             # BigVGAN vocoder implementation
│   ├── gpt/                 # GPT-based speech language model
│   │   ├── conformer/       # Conformer encoder components
│   │   ├── model.py         # IndexTTS1 model (UnifiedVoice)
│   │   └── model_v2.py      # IndexTTS2 model with emotion support
│   ├── s2mel/               # Semantic to mel-spectrogram module
│   │   ├── modules/         # Neural network modules
│   │   │   ├── bigvgan/     # BigVGAN vocoder
│   │   │   ├── campplus/    # Speaker encoder
│   │   │   └── ...
│   │   └── dac/             # DAC (Digital Audio Codec) utilities
│   ├── utils/               # Utility functions
│   │   ├── maskgct/         # MaskGCT codec models
│   │   ├── front.py         # Text normalization and tokenization
│   │   └── checkpoint.py    # Model checkpoint loading
│   ├── vqvae/               # VQ-VAE for audio tokenization
│   ├── cli.py               # Command-line interface
│   ├── infer.py             # IndexTTS1 inference (legacy)
│   └── infer_v2.py          # IndexTTS2 inference with emotion support
├── checkpoints/             # Model weights directory (downloaded separately)
├── models/                  # Auto-downloaded auxiliary models
├── examples/                # Sample audio prompts
├── tests/                   # Test scripts
├── tools/                   # Utility tools
│   ├── gpu_check.py         # GPU diagnostics tool
│   └── i18n/                # Internationalization utilities
├── webui.py                 # Gradio-based Web UI
├── pyproject.toml           # Python package configuration
└── uv.lock                  # Locked dependency versions

Technology Stack

Core Framework

PyTorch 2.10+: Deep learning framework with CUDA support
Transformers 4.52+: Hugging Face transformers for GPT2 architecture
DeepSpeed 0.17.1 (optional): Inference acceleration

Audio Processing

torchaudio: Audio I/O and transformations
librosa: Audio analysis and feature extraction
soundfile: Audio file reading/writing
BigVGAN: Neural vocoder for high-quality audio generation

Text Processing

sentencepiece: Text tokenization (BPE model)
jieba: Chinese text segmentation
g2p-en: English grapheme-to-phoneme conversion
wetext/WeTextProcessing: Text normalization

Key Dependencies

OmegaConf: YAML configuration management
safetensors: Safe tensor serialization
modelscope: Model hub for downloading Chinese models
huggingface-hub: Model hub for international models
gradio 5.45+: Web UI framework (optional)

Build and Development

Environment Setup

The project uses uv as the only supported package manager:

# Install uv
pip install -U uv

# Install dependencies (creates .venv automatically)
uv sync --all-extras

# For China users (use local mirror)
uv sync --all-extras --default-index "https://mirrors.aliyun.com/pypi/simple"

Available Extra Features

--extra webui: Gradio WebUI support
--extra deepspeed: DeepSpeed inference acceleration
--all-extras: Install all optional features

Model Download

Download models from HuggingFace:

uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints

Or from ModelScope:

uv tool install "modelscope"
modelscope download --model IndexTeam/IndexTTS-2 --local_dir checkpoints

Running the Application

Web UI:

uv run webui.py
# Options:
uv run webui.py --fp16 --deepspeed --cuda_kernel --port 7860

CLI (IndexTTS1 only):

uv run indextts "Text to synthesize" -v examples/voice_01.wav -o output.wav

Python API:

# IndexTTS2
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py

# IndexTTS1 (legacy)
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer.py

GPU Check

uv run tools/gpu_check.py

Testing

Test Files

tests/regression_test.py: Regression tests for TTS inference
tests/padding_test.py: Tests text token padding behavior

Running Tests

# Regression test
uv run tests/regression_test.py

# Padding test
uv run tests/padding_test.py checkpoints

Code Organization

Main Inference Classes

IndexTTS2 (indextts/infer_v2.py):

Main class for IndexTTS2 inference with emotion support
Supports speaker prompt, emotion prompt, emotion vectors, and text-based emotion control
Key methods:
- infer(): Main inference with multiple emotion control modes
- normalize_emo_vec(): Normalize emotion vectors

IndexTTS (indextts/infer.py):

Legacy IndexTTS1 inference class
Basic zero-shot voice cloning without emotion control

Model Components

GPT Model (indextts/gpt/model_v2.py):

UnifiedVoice: Main GPT-based speech language model
GPT2InferenceModel: Inference wrapper with KV-cache support
Uses Conformer encoder for audio conditioning
Perceiver resampler for emotion conditioning

Semantic Codec (indextts/utils/maskgct_utils.py):

Encodes audio into semantic tokens
Uses Wav2Vec-BERT 2.0 for feature extraction

S2MEL Module (indextts/s2mel/):

Converts semantic tokens to mel-spectrogram
Flow-matching based diffusion transformer

BigVGAN (indextts/BigVGAN/ and indextts/s2mel/modules/bigvgan/):

Neural vocoder for final audio generation
Custom CUDA kernel support for acceleration

Text Processing

TextNormalizer (indextts/utils/front.py):

Chinese and English text normalization
Pinyin support for pronunciation control
Term glossary for custom pronunciations
Email, name, and technical term patterns

TextTokenizer:

SentencePiece BPE tokenization
Supports 12,000 text tokens

Configuration

Main Config File (`checkpoints/config.yaml`)

Key configuration sections:

dataset: Audio sampling parameters (24kHz, mel-spectrogram settings)
gpt: GPT model architecture (1280 dim, 24 layers, 20 heads)
semantic_codec: Semantic codec parameters
s2mel: S2MEL module configuration (DiT architecture)
Model checkpoint paths

Important Paths

gpt_checkpoint: GPT model weights (gpt.pth)
s2mel_checkpoint: S2MEL model weights (s2mel.pth)
w2v_stat: Wav2Vec statistics (wav2vec2bert_stats.pt)
qwen_emo_path: Qwen emotion model path

Code Style Guidelines

Import Conventions

Standard library imports first
Third-party imports (torch, transformers) second
Internal module imports last
Use absolute imports for project modules

Type Hints

Optional type hints for function parameters
Use typing module for complex types

Documentation

Docstrings for classes and public methods
Chinese comments common in text processing modules
English comments in model architecture code

Naming Conventions

snake_case for functions and variables
PascalCase for classes
Private methods prefixed with _

Development Conventions

Adding New Features

Maintain backward compatibility with IndexTTS1
Use OmegaConf for configuration management
Add appropriate warnings for experimental features
Update example cases in examples/cases.jsonl

Device Handling

Always support multiple device types:

CUDA (NVIDIA GPUs)
MPS (Apple Silicon)
XPU (Intel GPUs)
CPU (fallback)

Example pattern:

if device is not None:
    self.device = device
elif torch.cuda.is_available():
    self.device = "cuda:0"
elif hasattr(torch, "mps") and torch.backends.mps.is_available():
    self.device = "mps"
else:
    self.device = "cpu"

Memory Optimization

Support FP16 inference for lower VRAM usage
Implement KV-cache for GPT inference
Use torch.no_grad() context for inference
Clear CUDA cache when switching devices

Security and Usage Restrictions

The project includes a DISCLAIMER file outlining usage restrictions:

Do NOT synthesize voices of political figures or public figures without authorization
Do NOT create content that defames, insults, or discriminates
Do NOT use for fraud or identity theft
Do NOT generate false information or social panic
Do NOT use for commercial purposes without authorization
Do NOT create inappropriate content involving minors

Version History

IndexTTS2 (2025/09/08): Emotion control, duration control
IndexTTS1.5 (2025/05/14): Stability improvements, better English
IndexTTS1.0 (2025/03/25): Initial release

Useful Resources

Paper (IndexTTS2): https://arxiv.org/abs/2506.21619
HuggingFace Demo: https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo
ModelScope Demo: https://modelscope.cn/studios/IndexTeam/IndexTTS-2-Demo
GitHub: https://github.com/index-tts/index-tts

Troubleshooting

Common Issues

CUDA errors: Ensure CUDA Toolkit 12.8+ is installed
Slow inference: Enable --fp16 for faster GPU inference
Model download fails: Set HF_ENDPOINT="https://hf-mirror.com" for China users
DeepSpeed fails on Windows: Skip with --extra webui only

Debug Mode

Run with --verbose flag or set verbose=True in Python API for detailed logging.

10 KiB Raw Permalink Blame History

IndexTTS - Agent's Guide

Project Overview

Key Features

Project Structure

Technology Stack

Core Framework

Audio Processing

Text Processing

Key Dependencies

Build and Development

Environment Setup

Available Extra Features

Model Download

Running the Application

GPU Check

Testing

Test Files

Running Tests

Code Organization

Main Inference Classes

Model Components

Text Processing

Configuration

Main Config File (checkpoints/config.yaml)

Important Paths

Code Style Guidelines

Import Conventions

Type Hints

Documentation

Naming Conventions

Development Conventions

Adding New Features

Device Handling

Memory Optimization

Security and Usage Restrictions

Version History

Useful Resources

Troubleshooting

Common Issues

Debug Mode

10 KiB

Raw Permalink Blame History

Main Config File (`checkpoints/config.yaml`)