TTSizer 🎙️✨

Transform Raw Audio/Video into Production-Ready TTS Datasets

Watch the TTSizer Demo & See It In Action: (The demo above showcases the AnimeVox Character TTS Corpus, a dataset created using TTSizer.)

🎯 What It Does

TTSizer automates the tedious process of creating high-quality Text-To-Speech datasets from raw media. Input a video or audio file, and get back perfectly aligned audio-text pairs for each speaker.

✨ Key Features

🎯 End-to-End Automation: From raw media files to cleaned, TTS-ready datasets
🗣️ Advanced Multi-Speaker Diarization: Handles complex audio with multiple speakers
🤖 State-of-the-Art Models - MelBandRoformer, Gemini, CTC-Aligner, Wespeaker
🧐 Quality Control: Automatic outlier detection and flagging
⚙️ Fully Configurable: Control every aspect via config.yaml

📊 Pipeline Flow

graph LR A[🎬 Raw Media] --> B[🎤 Extract Audio] B --> C[🔇 Vocal Separation] C --> D[🔊 Normalize Volume] D --> E[✍️ Speaker Diarization] E --> F[⏱️ Forced Alignment] F --> G[🧐 Outlier Detection] G --> H[🚩 ASR Validation] H --> I[✅ TTS Dataset]

🏃 Quick Start

1. Clone & Install

git clone https://github.com/taresh18/TTSizer.git cd TTSizer pip install -r requirements.txt

2. Setup Models & API Key

Download pre-trained models (see Setup Guide)
Add GEMINI_API_KEY to .env file in the project root:

GEMINI_API_KEY="YOUR_API_KEY_HERE"

3. Configure

Edit configs/config.yaml:

project_setup: video_input_base_dir: "/path/to/your/videos" output_base_dir: "/path/to/output" target_speaker_labels: ["Speaker1", "Speaker2"]

4. Run TTSizer!

python -m ttsizer.main

🛠️ Setup & Installation

Click to expand detailed setup instructions

Prerequisites

Python 3.9+
CUDA enabled GPU (>4GB VRAM)
FFmpeg (Must be installed and accessible in your system's PATH)
Google Gemini API key

Manual Model Downloads

Vocal Extraction: Download kimmel_unwa_ft2_bleedless.ckpt from HuggingFace
Speaker Embeddings: Download from wespeaker-voxceleb-resnet293-LM

Update model paths in config.yaml.

⚙️ Advanced Configuration

Click for pipeline control and other advanced options

Selective Stage Execution

You can control which parts of the pipeline run, useful for debugging or reprocessing:

pipeline_control: run_only_stage: "ctc_align" # Run specific stage only start_stage: "llm_diarize" # Start from specific stage  end_stage: "outlier_detect" # Stop at specific stage

🏗️ Project Structure

The project is organized as follows:

TTSizer/ ├── configs/ │ └── config.yaml # Pipeline & model configurations ├── ttsizer/ │ ├── __init__.py │ ├── main.py # Main script to run the pipeline │ │── core/ # Core components of the pipeline │ ├── models/ # Vocal removal models │ └── utils/ # Utility programs ├── .env # For API keys ├── README.md # This file ├── requirements.txt # Python package dependencies └── weights/ # For storing downloaded model weights (gitignored)

📜 License

This project is released under the Apache License 2.0. See the LICENSE file for details.

📚 References

Vocals Extraction pcunwa/Kim-Mel-Band-Roformer-FT by Unwa
Forced Alignment: ctc-forced-aligner by MahmoudAshraf97
ASR: NVIDIA NeMo Parakeet
Speaker Embeddings: Wespeaker/wespeaker-voxceleb-resnet293-LM from Wespeaker

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
configs		configs
ttsizer		ttsizer
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TTSizer 🎙️✨

Transform Raw Audio/Video into Production-Ready TTS Datasets

🎯 What It Does

✨ Key Features

📊 Pipeline Flow

🏃 Quick Start

1. Clone & Install

2. Setup Models & API Key

3. Configure

4. Run TTSizer!

🛠️ Setup & Installation

Prerequisites

Manual Model Downloads

⚙️ Advanced Configuration

Selective Stage Execution

🏗️ Project Structure

📜 License

📚 References

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

taresh18/TTSizer

Folders and files

Latest commit

History

Repository files navigation

TTSizer 🎙️✨

Transform Raw Audio/Video into Production-Ready TTS Datasets

🎯 What It Does

✨ Key Features

📊 Pipeline Flow

🏃 Quick Start

1. Clone & Install

2. Setup Models & API Key

3. Configure

4. Run TTSizer!

🛠️ Setup & Installation

Prerequisites

Manual Model Downloads

⚙️ Advanced Configuration

Selective Stage Execution

🏗️ Project Structure

📜 License

📚 References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages