ASR Pipeline for vad, chunking and transcription of Indian languages.
This pipeline processes audio files through a series of stages:
- Voice Activity Detection (VAD): Removes silence and detects regions of speech and breaks audio by them.
- Audio Chunking: Splits audio into smaller chunks.
- Transcription with Force Alignment: Transcribes audio and aligns words with timestamps to a json.
- Speaker Diarization: Identifies unique speakers in the audio using embeddings and cosine similarity threshold.
Run the script asr_pipeline.py and enter the stage number when prompted to process audio files step by step. Enter 0 to exit.