Yrom 35b6514ee5
Enhance text normalization and tokenization
- Introduced `de_tokenized_by_CJK_char` for restoring original text from tokenized format.
- Added `TextTokenizer` class for improved tokenization, including sentence splitting and handling of special tokens.
- Enhanced `TextNormalizer` to handle names and pinyin tones with placeholder mechanisms.
- Added regression tests for new features in `regression_test.py`.
2025-04-24 20:28:44 +08:00
..
2025-04-18 18:09:13 +08:00