308 lines
10 KiB
Markdown
308 lines
10 KiB
Markdown
# IndexTTS - Agent's Guide
|
|
|
|
## Project Overview
|
|
|
|
**IndexTTS** is an industrial-level, emotionally expressive and duration-controlled auto-regressive zero-shot Text-to-Speech (TTS) system developed by the Bilibili Index Team. The project enables high-quality voice cloning and emotional speech synthesis from a single reference audio file.
|
|
|
|
### Key Features
|
|
- **Zero-shot voice cloning**: Clone any voice from a single audio prompt
|
|
- **Emotion control**: Independent control over timbre and emotion through multiple input modalities
|
|
- **Duration control**: First autoregressive TTS model with precise synthesis duration control
|
|
- **Multilingual support**: Supports both Chinese and English with mixed modeling
|
|
- **Pinyin support**: Fine-grained pronunciation control via Pinyin annotations
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
index-tts/
|
|
├── indextts/ # Main Python package
|
|
│ ├── accel/ # Acceleration engine for GPT2 optimization
|
|
│ ├── BigVGAN/ # BigVGAN vocoder implementation
|
|
│ ├── gpt/ # GPT-based speech language model
|
|
│ │ ├── conformer/ # Conformer encoder components
|
|
│ │ ├── model.py # IndexTTS1 model (UnifiedVoice)
|
|
│ │ └── model_v2.py # IndexTTS2 model with emotion support
|
|
│ ├── s2mel/ # Semantic to mel-spectrogram module
|
|
│ │ ├── modules/ # Neural network modules
|
|
│ │ │ ├── bigvgan/ # BigVGAN vocoder
|
|
│ │ │ ├── campplus/ # Speaker encoder
|
|
│ │ │ └── ...
|
|
│ │ └── dac/ # DAC (Digital Audio Codec) utilities
|
|
│ ├── utils/ # Utility functions
|
|
│ │ ├── maskgct/ # MaskGCT codec models
|
|
│ │ ├── front.py # Text normalization and tokenization
|
|
│ │ └── checkpoint.py # Model checkpoint loading
|
|
│ ├── vqvae/ # VQ-VAE for audio tokenization
|
|
│ ├── cli.py # Command-line interface
|
|
│ ├── infer.py # IndexTTS1 inference (legacy)
|
|
│ └── infer_v2.py # IndexTTS2 inference with emotion support
|
|
├── checkpoints/ # Model weights directory (downloaded separately)
|
|
├── models/ # Auto-downloaded auxiliary models
|
|
├── examples/ # Sample audio prompts
|
|
├── tests/ # Test scripts
|
|
├── tools/ # Utility tools
|
|
│ ├── gpu_check.py # GPU diagnostics tool
|
|
│ └── i18n/ # Internationalization utilities
|
|
├── webui.py # Gradio-based Web UI
|
|
├── pyproject.toml # Python package configuration
|
|
└── uv.lock # Locked dependency versions
|
|
```
|
|
|
|
## Technology Stack
|
|
|
|
### Core Framework
|
|
- **PyTorch 2.10+**: Deep learning framework with CUDA support
|
|
- **Transformers 4.52+**: Hugging Face transformers for GPT2 architecture
|
|
- **DeepSpeed 0.17.1** (optional): Inference acceleration
|
|
|
|
### Audio Processing
|
|
- **torchaudio**: Audio I/O and transformations
|
|
- **librosa**: Audio analysis and feature extraction
|
|
- **soundfile**: Audio file reading/writing
|
|
- **BigVGAN**: Neural vocoder for high-quality audio generation
|
|
|
|
### Text Processing
|
|
- **sentencepiece**: Text tokenization (BPE model)
|
|
- **jieba**: Chinese text segmentation
|
|
- **g2p-en**: English grapheme-to-phoneme conversion
|
|
- **wetext/WeTextProcessing**: Text normalization
|
|
|
|
### Key Dependencies
|
|
- **OmegaConf**: YAML configuration management
|
|
- **safetensors**: Safe tensor serialization
|
|
- **modelscope**: Model hub for downloading Chinese models
|
|
- **huggingface-hub**: Model hub for international models
|
|
- **gradio 5.45+**: Web UI framework (optional)
|
|
|
|
## Build and Development
|
|
|
|
### Environment Setup
|
|
|
|
The project uses **uv** as the only supported package manager:
|
|
|
|
```bash
|
|
# Install uv
|
|
pip install -U uv
|
|
|
|
# Install dependencies (creates .venv automatically)
|
|
uv sync --all-extras
|
|
|
|
# For China users (use local mirror)
|
|
uv sync --all-extras --default-index "https://mirrors.aliyun.com/pypi/simple"
|
|
```
|
|
|
|
### Available Extra Features
|
|
- `--extra webui`: Gradio WebUI support
|
|
- `--extra deepspeed`: DeepSpeed inference acceleration
|
|
- `--all-extras`: Install all optional features
|
|
|
|
### Model Download
|
|
|
|
Download models from HuggingFace:
|
|
```bash
|
|
uv tool install "huggingface-hub[cli,hf_xet]"
|
|
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints
|
|
```
|
|
|
|
Or from ModelScope:
|
|
```bash
|
|
uv tool install "modelscope"
|
|
modelscope download --model IndexTeam/IndexTTS-2 --local_dir checkpoints
|
|
```
|
|
|
|
### Running the Application
|
|
|
|
**Web UI:**
|
|
```bash
|
|
uv run webui.py
|
|
# Options:
|
|
uv run webui.py --fp16 --deepspeed --cuda_kernel --port 7860
|
|
```
|
|
|
|
**CLI (IndexTTS1 only):**
|
|
```bash
|
|
uv run indextts "Text to synthesize" -v examples/voice_01.wav -o output.wav
|
|
```
|
|
|
|
**Python API:**
|
|
```bash
|
|
# IndexTTS2
|
|
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py
|
|
|
|
# IndexTTS1 (legacy)
|
|
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer.py
|
|
```
|
|
|
|
### GPU Check
|
|
```bash
|
|
uv run tools/gpu_check.py
|
|
```
|
|
|
|
## Testing
|
|
|
|
### Test Files
|
|
- `tests/regression_test.py`: Regression tests for TTS inference
|
|
- `tests/padding_test.py`: Tests text token padding behavior
|
|
|
|
### Running Tests
|
|
```bash
|
|
# Regression test
|
|
uv run tests/regression_test.py
|
|
|
|
# Padding test
|
|
uv run tests/padding_test.py checkpoints
|
|
```
|
|
|
|
## Code Organization
|
|
|
|
### Main Inference Classes
|
|
|
|
**IndexTTS2** (`indextts/infer_v2.py`):
|
|
- Main class for IndexTTS2 inference with emotion support
|
|
- Supports speaker prompt, emotion prompt, emotion vectors, and text-based emotion control
|
|
- Key methods:
|
|
- `infer()`: Main inference with multiple emotion control modes
|
|
- `normalize_emo_vec()`: Normalize emotion vectors
|
|
|
|
**IndexTTS** (`indextts/infer.py`):
|
|
- Legacy IndexTTS1 inference class
|
|
- Basic zero-shot voice cloning without emotion control
|
|
|
|
### Model Components
|
|
|
|
**GPT Model** (`indextts/gpt/model_v2.py`):
|
|
- `UnifiedVoice`: Main GPT-based speech language model
|
|
- `GPT2InferenceModel`: Inference wrapper with KV-cache support
|
|
- Uses Conformer encoder for audio conditioning
|
|
- Perceiver resampler for emotion conditioning
|
|
|
|
**Semantic Codec** (`indextts/utils/maskgct_utils.py`):
|
|
- Encodes audio into semantic tokens
|
|
- Uses Wav2Vec-BERT 2.0 for feature extraction
|
|
|
|
**S2MEL Module** (`indextts/s2mel/`):
|
|
- Converts semantic tokens to mel-spectrogram
|
|
- Flow-matching based diffusion transformer
|
|
|
|
**BigVGAN** (`indextts/BigVGAN/` and `indextts/s2mel/modules/bigvgan/`):
|
|
- Neural vocoder for final audio generation
|
|
- Custom CUDA kernel support for acceleration
|
|
|
|
### Text Processing
|
|
|
|
**TextNormalizer** (`indextts/utils/front.py`):
|
|
- Chinese and English text normalization
|
|
- Pinyin support for pronunciation control
|
|
- Term glossary for custom pronunciations
|
|
- Email, name, and technical term patterns
|
|
|
|
**TextTokenizer**:
|
|
- SentencePiece BPE tokenization
|
|
- Supports 12,000 text tokens
|
|
|
|
## Configuration
|
|
|
|
### Main Config File (`checkpoints/config.yaml`)
|
|
|
|
Key configuration sections:
|
|
- `dataset`: Audio sampling parameters (24kHz, mel-spectrogram settings)
|
|
- `gpt`: GPT model architecture (1280 dim, 24 layers, 20 heads)
|
|
- `semantic_codec`: Semantic codec parameters
|
|
- `s2mel`: S2MEL module configuration (DiT architecture)
|
|
- Model checkpoint paths
|
|
|
|
### Important Paths
|
|
- `gpt_checkpoint`: GPT model weights (`gpt.pth`)
|
|
- `s2mel_checkpoint`: S2MEL model weights (`s2mel.pth`)
|
|
- `w2v_stat`: Wav2Vec statistics (`wav2vec2bert_stats.pt`)
|
|
- `qwen_emo_path`: Qwen emotion model path
|
|
|
|
## Code Style Guidelines
|
|
|
|
### Import Conventions
|
|
- Standard library imports first
|
|
- Third-party imports (torch, transformers) second
|
|
- Internal module imports last
|
|
- Use absolute imports for project modules
|
|
|
|
### Type Hints
|
|
- Optional type hints for function parameters
|
|
- Use `typing` module for complex types
|
|
|
|
### Documentation
|
|
- Docstrings for classes and public methods
|
|
- Chinese comments common in text processing modules
|
|
- English comments in model architecture code
|
|
|
|
### Naming Conventions
|
|
- `snake_case` for functions and variables
|
|
- `PascalCase` for classes
|
|
- Private methods prefixed with `_`
|
|
|
|
## Development Conventions
|
|
|
|
### Adding New Features
|
|
1. Maintain backward compatibility with IndexTTS1
|
|
2. Use OmegaConf for configuration management
|
|
3. Add appropriate warnings for experimental features
|
|
4. Update example cases in `examples/cases.jsonl`
|
|
|
|
### Device Handling
|
|
Always support multiple device types:
|
|
- CUDA (NVIDIA GPUs)
|
|
- MPS (Apple Silicon)
|
|
- XPU (Intel GPUs)
|
|
- CPU (fallback)
|
|
|
|
Example pattern:
|
|
```python
|
|
if device is not None:
|
|
self.device = device
|
|
elif torch.cuda.is_available():
|
|
self.device = "cuda:0"
|
|
elif hasattr(torch, "mps") and torch.backends.mps.is_available():
|
|
self.device = "mps"
|
|
else:
|
|
self.device = "cpu"
|
|
```
|
|
|
|
### Memory Optimization
|
|
- Support FP16 inference for lower VRAM usage
|
|
- Implement KV-cache for GPT inference
|
|
- Use `torch.no_grad()` context for inference
|
|
- Clear CUDA cache when switching devices
|
|
|
|
## Security and Usage Restrictions
|
|
|
|
The project includes a **DISCLAIMER** file outlining usage restrictions:
|
|
- Do NOT synthesize voices of political figures or public figures without authorization
|
|
- Do NOT create content that defames, insults, or discriminates
|
|
- Do NOT use for fraud or identity theft
|
|
- Do NOT generate false information or social panic
|
|
- Do NOT use for commercial purposes without authorization
|
|
- Do NOT create inappropriate content involving minors
|
|
|
|
## Version History
|
|
|
|
- **IndexTTS2** (2025/09/08): Emotion control, duration control
|
|
- **IndexTTS1.5** (2025/05/14): Stability improvements, better English
|
|
- **IndexTTS1.0** (2025/03/25): Initial release
|
|
|
|
## Useful Resources
|
|
|
|
- Paper (IndexTTS2): https://arxiv.org/abs/2506.21619
|
|
- HuggingFace Demo: https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo
|
|
- ModelScope Demo: https://modelscope.cn/studios/IndexTeam/IndexTTS-2-Demo
|
|
- GitHub: https://github.com/index-tts/index-tts
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
1. **CUDA errors**: Ensure CUDA Toolkit 12.8+ is installed
|
|
2. **Slow inference**: Enable `--fp16` for faster GPU inference
|
|
3. **Model download fails**: Set `HF_ENDPOINT="https://hf-mirror.com"` for China users
|
|
4. **DeepSpeed fails on Windows**: Skip with `--extra webui` only
|
|
|
|
### Debug Mode
|
|
Run with `--verbose` flag or set `verbose=True` in Python API for detailed logging.
|