index-tts2-ForDgxSpark/AGENTS.md

# IndexTTS - Agent's Guide

## Project Overview

**IndexTTS** is an industrial-level, emotionally expressive and duration-controlled auto-regressive zero-shot Text-to-Speech (TTS) system developed by the Bilibili Index Team. The project enables high-quality voice cloning and emotional speech synthesis from a single reference audio file.

### Key Features
- **Zero-shot voice cloning**: Clone any voice from a single audio prompt
- **Emotion control**: Independent control over timbre and emotion through multiple input modalities
- **Duration control**: First autoregressive TTS model with precise synthesis duration control
- **Multilingual support**: Supports both Chinese and English with mixed modeling
- **Pinyin support**: Fine-grained pronunciation control via Pinyin annotations

### Project Structure

```
index-tts/
├── indextts/                 # Main Python package
│   ├── accel/               # Acceleration engine for GPT2 optimization
│   ├── BigVGAN/             # BigVGAN vocoder implementation
│   ├── gpt/                 # GPT-based speech language model
│   │   ├── conformer/       # Conformer encoder components
│   │   ├── model.py         # IndexTTS1 model (UnifiedVoice)
│   │   └── model_v2.py      # IndexTTS2 model with emotion support
│   ├── s2mel/               # Semantic to mel-spectrogram module
│   │   ├── modules/         # Neural network modules
│   │   │   ├── bigvgan/     # BigVGAN vocoder
│   │   │   ├── campplus/    # Speaker encoder
│   │   │   └── ...
│   │   └── dac/             # DAC (Digital Audio Codec) utilities
│   ├── utils/               # Utility functions
│   │   ├── maskgct/         # MaskGCT codec models
│   │   ├── front.py         # Text normalization and tokenization
│   │   └── checkpoint.py    # Model checkpoint loading
│   ├── vqvae/               # VQ-VAE for audio tokenization
│   ├── cli.py               # Command-line interface
│   ├── infer.py             # IndexTTS1 inference (legacy)
│   └── infer_v2.py          # IndexTTS2 inference with emotion support
├── checkpoints/             # Model weights directory (downloaded separately)
├── models/                  # Auto-downloaded auxiliary models
├── examples/                # Sample audio prompts
├── tests/                   # Test scripts
├── tools/                   # Utility tools
│   ├── gpu_check.py         # GPU diagnostics tool
│   └── i18n/                # Internationalization utilities
├── webui.py                 # Gradio-based Web UI
├── pyproject.toml           # Python package configuration
└── uv.lock                  # Locked dependency versions
```

## Technology Stack

### Core Framework
- **PyTorch 2.10+**: Deep learning framework with CUDA support
- **Transformers 4.52+**: Hugging Face transformers for GPT2 architecture
- **DeepSpeed 0.17.1** (optional): Inference acceleration

### Audio Processing
- **torchaudio**: Audio I/O and transformations
- **librosa**: Audio analysis and feature extraction
- **soundfile**: Audio file reading/writing
- **BigVGAN**: Neural vocoder for high-quality audio generation

### Text Processing
- **sentencepiece**: Text tokenization (BPE model)
- **jieba**: Chinese text segmentation
- **g2p-en**: English grapheme-to-phoneme conversion
- **wetext/WeTextProcessing**: Text normalization

### Key Dependencies
- **OmegaConf**: YAML configuration management
- **safetensors**: Safe tensor serialization
- **modelscope**: Model hub for downloading Chinese models
- **huggingface-hub**: Model hub for international models
- **gradio 5.45+**: Web UI framework (optional)

## Build and Development

### Environment Setup

The project uses **uv** as the only supported package manager:

```bash
# Install uv
pip install -U uv

# Install dependencies (creates .venv automatically)
uv sync --all-extras

# For China users (use local mirror)
uv sync --all-extras --default-index "https://mirrors.aliyun.com/pypi/simple"
```

### Available Extra Features
- `--extra webui`: Gradio WebUI support
- `--extra deepspeed`: DeepSpeed inference acceleration
- `--all-extras`: Install all optional features

### Model Download

Download models from HuggingFace:
```bash
uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints
```

Or from ModelScope:
```bash
uv tool install "modelscope"
modelscope download --model IndexTeam/IndexTTS-2 --local_dir checkpoints
```

### Running the Application

**Web UI:**
```bash
uv run webui.py
# Options:
uv run webui.py --fp16 --deepspeed --cuda_kernel --port 7860
```

**CLI (IndexTTS1 only):**
```bash
uv run indextts "Text to synthesize" -v examples/voice_01.wav -o output.wav
```

**Python API:**
```bash
# IndexTTS2
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer_v2.py

# IndexTTS1 (legacy)
PYTHONPATH="$PYTHONPATH:." uv run indextts/infer.py
```

### GPU Check
```bash
uv run tools/gpu_check.py
```

## Testing

### Test Files
- `tests/regression_test.py`: Regression tests for TTS inference
- `tests/padding_test.py`: Tests text token padding behavior

### Running Tests
```bash
# Regression test
uv run tests/regression_test.py

# Padding test
uv run tests/padding_test.py checkpoints
```

## Code Organization

### Main Inference Classes

**IndexTTS2** (`indextts/infer_v2.py`):
- Main class for IndexTTS2 inference with emotion support
- Supports speaker prompt, emotion prompt, emotion vectors, and text-based emotion control
- Key methods:
  - `infer()`: Main inference with multiple emotion control modes
  - `normalize_emo_vec()`: Normalize emotion vectors

**IndexTTS** (`indextts/infer.py`):
- Legacy IndexTTS1 inference class
- Basic zero-shot voice cloning without emotion control

### Model Components

**GPT Model** (`indextts/gpt/model_v2.py`):
- `UnifiedVoice`: Main GPT-based speech language model
- `GPT2InferenceModel`: Inference wrapper with KV-cache support
- Uses Conformer encoder for audio conditioning
- Perceiver resampler for emotion conditioning

**Semantic Codec** (`indextts/utils/maskgct_utils.py`):
- Encodes audio into semantic tokens
- Uses Wav2Vec-BERT 2.0 for feature extraction

**S2MEL Module** (`indextts/s2mel/`):
- Converts semantic tokens to mel-spectrogram
- Flow-matching based diffusion transformer

**BigVGAN** (`indextts/BigVGAN/` and `indextts/s2mel/modules/bigvgan/`):
- Neural vocoder for final audio generation
- Custom CUDA kernel support for acceleration

### Text Processing

**TextNormalizer** (`indextts/utils/front.py`):
- Chinese and English text normalization
- Pinyin support for pronunciation control
- Term glossary for custom pronunciations
- Email, name, and technical term patterns

**TextTokenizer**:
- SentencePiece BPE tokenization
- Supports 12,000 text tokens

## Configuration

### Main Config File (`checkpoints/config.yaml`)

Key configuration sections:
- `dataset`: Audio sampling parameters (24kHz, mel-spectrogram settings)
- `gpt`: GPT model architecture (1280 dim, 24 layers, 20 heads)
- `semantic_codec`: Semantic codec parameters
- `s2mel`: S2MEL module configuration (DiT architecture)
- Model checkpoint paths

### Important Paths
- `gpt_checkpoint`: GPT model weights (`gpt.pth`)
- `s2mel_checkpoint`: S2MEL model weights (`s2mel.pth`)
- `w2v_stat`: Wav2Vec statistics (`wav2vec2bert_stats.pt`)
- `qwen_emo_path`: Qwen emotion model path

## Code Style Guidelines

### Import Conventions
- Standard library imports first
- Third-party imports (torch, transformers) second
- Internal module imports last
- Use absolute imports for project modules

### Type Hints
- Optional type hints for function parameters
- Use `typing` module for complex types

### Documentation
- Docstrings for classes and public methods
- Chinese comments common in text processing modules
- English comments in model architecture code

### Naming Conventions
- `snake_case` for functions and variables
- `PascalCase` for classes
- Private methods prefixed with `_`

## Development Conventions

### Adding New Features
1. Maintain backward compatibility with IndexTTS1
2. Use OmegaConf for configuration management
3. Add appropriate warnings for experimental features
4. Update example cases in `examples/cases.jsonl`

### Device Handling
Always support multiple device types:
- CUDA (NVIDIA GPUs)
- MPS (Apple Silicon)
- XPU (Intel GPUs)
- CPU (fallback)

Example pattern:
```python
if device is not None:
    self.device = device
elif torch.cuda.is_available():
    self.device = "cuda:0"
elif hasattr(torch, "mps") and torch.backends.mps.is_available():
    self.device = "mps"
else:
    self.device = "cpu"
```

### Memory Optimization
- Support FP16 inference for lower VRAM usage
- Implement KV-cache for GPT inference
- Use `torch.no_grad()` context for inference
- Clear CUDA cache when switching devices

## Security and Usage Restrictions

The project includes a **DISCLAIMER** file outlining usage restrictions:
- Do NOT synthesize voices of political figures or public figures without authorization
- Do NOT create content that defames, insults, or discriminates
- Do NOT use for fraud or identity theft
- Do NOT generate false information or social panic
- Do NOT use for commercial purposes without authorization
- Do NOT create inappropriate content involving minors

## Version History

- **IndexTTS2** (2025/09/08): Emotion control, duration control
- **IndexTTS1.5** (2025/05/14): Stability improvements, better English
- **IndexTTS1.0** (2025/03/25): Initial release

## Useful Resources

- Paper (IndexTTS2): https://arxiv.org/abs/2506.21619
- HuggingFace Demo: https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo
- ModelScope Demo: https://modelscope.cn/studios/IndexTeam/IndexTTS-2-Demo
- GitHub: https://github.com/index-tts/index-tts

## Troubleshooting

### Common Issues
1. **CUDA errors**: Ensure CUDA Toolkit 12.8+ is installed
2. **Slow inference**: Enable `--fp16` for faster GPU inference
3. **Model download fails**: Set `HF_ENDPOINT="https://hf-mirror.com"` for China users
4. **DeepSpeed fails on Windows**: Skip with `--extra webui` only

### Debug Mode
Run with `--verbose` flag or set `verbose=True` in Python API for detailed logging.