2026-02-15 03:26:10 +08:00

308 lines
10 KiB
Markdown

# IndexTTS - Agent's Guide
## Project Overview
**IndexTTS** is an industrial-level, emotionally expressive and duration-controlled auto-regressive zero-shot Text-to-Speech (TTS) system developed by the Bilibili Index Team. The project enables high-quality voice cloning and emotional speech synthesis from a single reference audio file.
### Key Features
- **Zero-shot voice cloning**: Clone any voice from a single audio prompt
- **Emotion control**: Independent control over timbre and emotion through multiple input modalities
- **Duration control**: First autoregressive TTS model with precise synthesis duration control
- **Multilingual support**: Supports both Chinese and English with mixed modeling
- **Pinyin support**: Fine-grained pronunciation control via Pinyin annotations
### Project Structure
```
index-tts/
├── indextts/ # Main Python package
│ ├── accel/ # Acceleration engine for GPT2 optimization
│ ├── BigVGAN/ # BigVGAN vocoder implementation
│ ├── gpt/ # GPT-based speech language model
│ │ ├── conformer/ # Conformer encoder components
│ │ ├── model.py # IndexTTS1 model (UnifiedVoice)
│ │ └── model_v2.py # IndexTTS2 model with emotion support
│ ├── s2mel/ # Semantic to mel-spectrogram module
│ │ ├── modules/ # Neural network modules
│ │ │ ├── bigvgan/ # BigVGAN vocoder
│ │ │ ├── campplus/ # Speaker encoder
│ │ │ └── ...
│ │ └── dac/ # DAC (Digital Audio Codec) utilities
│ ├── utils/ # Utility functions
│ │ ├── maskgct/ # MaskGCT codec models
│ │ ├── front.py # Text normalization and tokenization
│ │ └── checkpoint.py # Model checkpoint loading
│ ├── vqvae/ # VQ-VAE for audio tokenization
│ ├── cli.py # Command-line interface
│ ├── infer.py # IndexTTS1 inference (legacy)
│ └── infer_v2.py # IndexTTS2 inference with emotion support
├── checkpoints/ # Model weights directory (downloaded separately)
├── models/ # Auto-downloaded auxiliary models
├── examples/ # Sample audio prompts
├── tests/ # Test scripts
├── tools/ # Utility tools
│ ├── gpu_check.py # GPU diagnostics tool
│ └── i18n/ # Internationalization utilities
├── webui.py # Gradio-based Web UI
├── pyproject.toml # Python package configuration
└── uv.lock # Locked dependency versions
```
## Technology Stack
### Core Framework
- **PyTorch 2.10+**: Deep learning framework with CUDA support
- **Transformers 4.52+**: Hugging Face transformers for GPT2 architecture
- **DeepSpeed 0.17.1** (optional): Inference acceleration
### Audio Processing
- **torchaudio**: Audio I/O and transformations
- **librosa**: Audio analysis and feature extraction
- **soundfile**: Audio file reading/writing
- **BigVGAN**: Neural vocoder for high-quality audio generation
### Text Processing
- **sentencepiece**: Text tokenization (BPE model)
- **jieba**: Chinese text segmentation
- **g2p-en**: English grapheme-to-phoneme conversion
- **wetext/WeTextProcessing**: Text normalization
### Key Dependencies
- **OmegaConf**: YAML configuration management
- **safetensors**: Safe tensor serialization
- **modelscope**: Model hub for downloading Chinese models
- **huggingface-hub**: Model hub for international models
- **gradio 5.45+**: Web UI framework (optional)
## Build and Development
### Environment Setup
The project uses **uv** as the only supported package manager:
```bash
# Install uv
pip install -U uv
# Install dependencies (creates .venv automatically)
uv sync --all-extras
# For China users (use local mirror)
uv sync --all-extras --default-index "https://mirrors.aliyun.com/pypi/simple"
```
### Available Extra Features
- `--extra webui`: Gradio WebUI support
- `--extra deepspeed`: DeepSpeed inference acceleration
- `--all-extras`: Install all optional features
### Model Download
Download models from HuggingFace:
```bash
uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints
```
Or from ModelScope:
```bash
uv tool install "modelscope"
modelscope download --model IndexTeam/IndexTTS-2 --local_dir checkpoints
```
### Running the Application
**Web UI:**
```bash
uv run webui.py
# Options:
uv run webui.py --fp16 --deepspeed --cuda_kernel --port 7860
```
**CLI (IndexTTS1 only):**
```bash
uv run indextts "Text to synthesize" -v examples/voice_01.wav -o output.wav
```
**Python API:**
```bash
# IndexTTS2
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py
# IndexTTS1 (legacy)
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer.py
```
### GPU Check
```bash
uv run tools/gpu_check.py
```
## Testing
### Test Files
- `tests/regression_test.py`: Regression tests for TTS inference
- `tests/padding_test.py`: Tests text token padding behavior
### Running Tests
```bash
# Regression test
uv run tests/regression_test.py
# Padding test
uv run tests/padding_test.py checkpoints
```
## Code Organization
### Main Inference Classes
**IndexTTS2** (`indextts/infer_v2.py`):
- Main class for IndexTTS2 inference with emotion support
- Supports speaker prompt, emotion prompt, emotion vectors, and text-based emotion control
- Key methods:
- `infer()`: Main inference with multiple emotion control modes
- `normalize_emo_vec()`: Normalize emotion vectors
**IndexTTS** (`indextts/infer.py`):
- Legacy IndexTTS1 inference class
- Basic zero-shot voice cloning without emotion control
### Model Components
**GPT Model** (`indextts/gpt/model_v2.py`):
- `UnifiedVoice`: Main GPT-based speech language model
- `GPT2InferenceModel`: Inference wrapper with KV-cache support
- Uses Conformer encoder for audio conditioning
- Perceiver resampler for emotion conditioning
**Semantic Codec** (`indextts/utils/maskgct_utils.py`):
- Encodes audio into semantic tokens
- Uses Wav2Vec-BERT 2.0 for feature extraction
**S2MEL Module** (`indextts/s2mel/`):
- Converts semantic tokens to mel-spectrogram
- Flow-matching based diffusion transformer
**BigVGAN** (`indextts/BigVGAN/` and `indextts/s2mel/modules/bigvgan/`):
- Neural vocoder for final audio generation
- Custom CUDA kernel support for acceleration
### Text Processing
**TextNormalizer** (`indextts/utils/front.py`):
- Chinese and English text normalization
- Pinyin support for pronunciation control
- Term glossary for custom pronunciations
- Email, name, and technical term patterns
**TextTokenizer**:
- SentencePiece BPE tokenization
- Supports 12,000 text tokens
## Configuration
### Main Config File (`checkpoints/config.yaml`)
Key configuration sections:
- `dataset`: Audio sampling parameters (24kHz, mel-spectrogram settings)
- `gpt`: GPT model architecture (1280 dim, 24 layers, 20 heads)
- `semantic_codec`: Semantic codec parameters
- `s2mel`: S2MEL module configuration (DiT architecture)
- Model checkpoint paths
### Important Paths
- `gpt_checkpoint`: GPT model weights (`gpt.pth`)
- `s2mel_checkpoint`: S2MEL model weights (`s2mel.pth`)
- `w2v_stat`: Wav2Vec statistics (`wav2vec2bert_stats.pt`)
- `qwen_emo_path`: Qwen emotion model path
## Code Style Guidelines
### Import Conventions
- Standard library imports first
- Third-party imports (torch, transformers) second
- Internal module imports last
- Use absolute imports for project modules
### Type Hints
- Optional type hints for function parameters
- Use `typing` module for complex types
### Documentation
- Docstrings for classes and public methods
- Chinese comments common in text processing modules
- English comments in model architecture code
### Naming Conventions
- `snake_case` for functions and variables
- `PascalCase` for classes
- Private methods prefixed with `_`
## Development Conventions
### Adding New Features
1. Maintain backward compatibility with IndexTTS1
2. Use OmegaConf for configuration management
3. Add appropriate warnings for experimental features
4. Update example cases in `examples/cases.jsonl`
### Device Handling
Always support multiple device types:
- CUDA (NVIDIA GPUs)
- MPS (Apple Silicon)
- XPU (Intel GPUs)
- CPU (fallback)
Example pattern:
```python
if device is not None:
self.device = device
elif torch.cuda.is_available():
self.device = "cuda:0"
elif hasattr(torch, "mps") and torch.backends.mps.is_available():
self.device = "mps"
else:
self.device = "cpu"
```
### Memory Optimization
- Support FP16 inference for lower VRAM usage
- Implement KV-cache for GPT inference
- Use `torch.no_grad()` context for inference
- Clear CUDA cache when switching devices
## Security and Usage Restrictions
The project includes a **DISCLAIMER** file outlining usage restrictions:
- Do NOT synthesize voices of political figures or public figures without authorization
- Do NOT create content that defames, insults, or discriminates
- Do NOT use for fraud or identity theft
- Do NOT generate false information or social panic
- Do NOT use for commercial purposes without authorization
- Do NOT create inappropriate content involving minors
## Version History
- **IndexTTS2** (2025/09/08): Emotion control, duration control
- **IndexTTS1.5** (2025/05/14): Stability improvements, better English
- **IndexTTS1.0** (2025/03/25): Initial release
## Useful Resources
- Paper (IndexTTS2): https://arxiv.org/abs/2506.21619
- HuggingFace Demo: https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo
- ModelScope Demo: https://modelscope.cn/studios/IndexTeam/IndexTTS-2-Demo
- GitHub: https://github.com/index-tts/index-tts
## Troubleshooting
### Common Issues
1. **CUDA errors**: Ensure CUDA Toolkit 12.8+ is installed
2. **Slow inference**: Enable `--fp16` for faster GPU inference
3. **Model download fails**: Set `HF_ENDPOINT="https://hf-mirror.com"` for China users
4. **DeepSpeed fails on Windows**: Skip with `--extra webui` only
### Debug Mode
Run with `--verbose` flag or set `verbose=True` in Python API for detailed logging.