Skip to content

taresh18/TTSizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TTSizer πŸŽ™οΈβœ¨

Transform Raw Audio/Video into Production-Ready TTS Datasets

License: Apache 2.0 Python Version

Watch the TTSizer Demo & See It In Action: TTSizer Demo Video (The demo above showcases the AnimeVox Character TTS Corpus, a dataset created using TTSizer.)

🎯 What It Does

TTSizer automates the tedious process of creating high-quality Text-To-Speech datasets from raw media. Input a video or audio file, and get back perfectly aligned audio-text pairs for each speaker.

✨ Key Features

🎯 End-to-End Automation: From raw media files to cleaned, TTS-ready datasets
πŸ—£οΈ Advanced Multi-Speaker Diarization: Handles complex audio with multiple speakers
πŸ€– State-of-the-Art Models - MelBandRoformer, Gemini, CTC-Aligner, Wespeaker
🧐 Quality Control: Automatic outlier detection and flagging
βš™οΈ Fully Configurable: Control every aspect via config.yaml

πŸ“Š Pipeline Flow

graph LR A[🎬 Raw Media] --> B[🎀 Extract Audio] B --> C[πŸ”‡ Vocal Separation] C --> D[πŸ”Š Normalize Volume] D --> E[✍️ Speaker Diarization] E --> F[⏱️ Forced Alignment] F --> G[🧐 Outlier Detection] G --> H[🚩 ASR Validation] H --> I[βœ… TTS Dataset] 
Loading

πŸƒ Quick Start

1. Clone & Install

git clone https://github.com/taresh18/TTSizer.git cd TTSizer pip install -r requirements.txt

2. Setup Models & API Key

  • Download pre-trained models (see Setup Guide)
  • Add GEMINI_API_KEY to .env file in the project root:
GEMINI_API_KEY="YOUR_API_KEY_HERE"

3. Configure

Edit configs/config.yaml:

project_setup: video_input_base_dir: "/path/to/your/videos" output_base_dir: "/path/to/output" target_speaker_labels: ["Speaker1", "Speaker2"]

4. Run TTSizer!

python -m ttsizer.main

πŸ› οΈ Setup & Installation

Click to expand detailed setup instructions

Prerequisites

  • Python 3.9+
  • CUDA enabled GPU (>4GB VRAM)
  • FFmpeg (Must be installed and accessible in your system's PATH)
  • Google Gemini API key

Manual Model Downloads

  1. Vocal Extraction: Download kimmel_unwa_ft2_bleedless.ckpt from HuggingFace
  2. Speaker Embeddings: Download from wespeaker-voxceleb-resnet293-LM

Update model paths in config.yaml.

βš™οΈ Advanced Configuration

Click for pipeline control and other advanced options

Selective Stage Execution

You can control which parts of the pipeline run, useful for debugging or reprocessing:

pipeline_control: run_only_stage: "ctc_align" # Run specific stage only start_stage: "llm_diarize" # Start from specific stage  end_stage: "outlier_detect" # Stop at specific stage

πŸ—οΈ Project Structure

The project is organized as follows:

TTSizer/ β”œβ”€β”€ configs/ β”‚ └── config.yaml # Pipeline & model configurations β”œβ”€β”€ ttsizer/ β”‚ β”œβ”€β”€ __init__.py β”‚ β”œβ”€β”€ main.py # Main script to run the pipeline β”‚ │── core/ # Core components of the pipeline β”‚ β”œβ”€β”€ models/ # Vocal removal models β”‚ └── utils/ # Utility programs β”œβ”€β”€ .env # For API keys β”œβ”€β”€ README.md # This file β”œβ”€β”€ requirements.txt # Python package dependencies └── weights/ # For storing downloaded model weights (gitignored) 

πŸ“œ License

This project is released under the Apache License 2.0. See the LICENSE file for details.

πŸ“š References

About

πŸŽ™οΈ Automatically transcribe audio/video into high-quality, speaker-specific Text-To-Speech datasets ✨

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages