Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First commit of youtube transcriber #2

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

juancopi81
Copy link
Collaborator

I started to work on the youtube transcriber using Whisper. I added the DTOs, some JsonSerializar, loaderiterator and started with the transcriber class.

@juancopi81 juancopi81 changed the title First commit of youtube transcriber [WIP] First commit of youtube transcriber Oct 20, 2022
@juancopi81 juancopi81 changed the title [WIP] First commit of youtube transcriber First commit of youtube transcriber Oct 25, 2022
@juancopi81
Copy link
Collaborator Author

Hi @matallanas I think now it is in a good shape for a first merge and start working from there with git issues to improve it!

Copy link
Owner

@matallanas matallanas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job. It is a bit too much of the things we need but it seeams a really good job. I will try to maybe condese a bit the code and compact it a bit more. Also I suggested to modify the download process with another library that makes it easier.

from youtube_transcriber.datapipeline import create_hardcoded_data_pipeline
from youtube_transcriber.threadeddatapipeline import ThreadedDataPipeline

NUM_THREADS = 2
Copy link
Owner

@matallanas matallanas Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put this as a parameter to increase the number of threads to process everything. I was thinking that could also be a problem if we need to load in each thread a whisper model. It is going to take a lot of memory. What do you think?


NUM_THREADS = 2

def parse_args():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this maybe we can use click as a library to generate the cli.

for thread in threads:
thread.join()

# Fetch entries and print them
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the database how about if it is optional, so people with low resources can decide to not create it? It is better to keep it simplier.

hf_dataset_identifier = "Whispering-GPT/test_whisper"

dataset = Dataset.from_sql("SELECT CHANNEL_NAME, URL, TITLE, DESCRIPTION, TRANSCRIPTION, SEGMENTS FROM VIDEO", connection)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before pushing the dataset, we need to check that the current dataset does not have the processed videos. Also this can be done as a pprevious step to see if the video is already processed if not then transcribe or translate.

create_db("test.db")

# Create necessary resources
yt_video_processor = YoutubeVideoPreprocessor(mode="channel_name",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe before processing everything check if already has been processed. It is a good check to not process the entire stuff.

Comment on lines +1 to +4
class DifferentNumberOfArgumentsError(Exception):

def __init__(self, message: str) -> None:
self.message = message
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class doesn't give anything so maybe we can delete it.

Comment on lines +1 to +46
from pathlib import Path
from typing import List, Dict, Optional

from youtube_transcriber.loading.serialization import Serializer

class LoaderIterator:
"""Iterator that loads data from multiple files in batches."""

def __init__(self,
serializer: Serializer,
num_files_per_iteration: int,
load_paths: Optional[List[Path]] = None) -> None:
self.serializer = serializer
self.num_files_per_iteration = num_files_per_iteration
self._load_paths = load_paths
self._current_iteration = None

@property
def load_paths(self) -> Optional[List[Path]]:
return self._load_paths

@load_paths.setter
def load_paths(self, load_paths: List[Path]) -> None:
self._load_paths = load_paths

def __iter__(self):
self._current_iteration = 0
return self

def __next__(self) -> List[Dict]:
if self._did_load_all_batches():
raise StopIteration
data_batch = self._load_data_batch()
self._current_iteration += 1
return data_batch

def _did_load_all_batches(self) -> bool:
if self._current_iteration >= len(self._load_paths) / self.num_files_per_iteration:
return True
return False

def _load_data_batch(self) -> List[Dict]:
start_index = self._current_iteration * self.num_files_per_iteration
stop_index = start_index + self.num_files_per_iteration
return [self.serializer.load(load_path) for load_path in
self._load_paths[start_index:stop_index] if load_path.exists()]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isa this class really necessary. As I understood it load all the files and then process then in the datapipeline so maybe we can make it simplier. Maybe there is something I miss from here.


from youtube_transcriber.datapipeline import DataPipeline

logging.basicConfig(level=logging.INFO, format="(%(threadName)-5s) %(message)s")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be initialize at the beggining of the main program

Comment on lines +1 to +31
from typing import Any

from pytube import YouTube

from youtube_transcriber.video import YoutubeVideo
from youtube_transcriber.utils import accepts_types
from youtube_transcriber.transforming.transform import Transform

class AddTitleTransform(Transform):
"""
Transform a Video object using PyTube. Adds title to YouTube video DTO.
It's a concrete Transform.
"""

@accepts_types(YoutubeVideo)
def apply(self, video: YoutubeVideo) -> YoutubeVideo:
yt = YouTube(video.url)

video_With_title_params = {
"channel_name": video.channel_name,
"url": video.url,
"title": self._get_video_title(yt),
"description": video.description,
"transcription": video.transcription,
"segments": video.segments
}

return YoutubeVideo(**video_With_title_params)

def _get_video_title(self, yt: Any) -> str:
return str(yt.title)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge all the transformation description, title, etc. So we can simplify the code?

Comment on lines +1 to +66
import os
from pathlib import Path
from typing import Any
from collections import OrderedDict

from pytube import YouTube
import whisper

from youtube_transcriber.transforming.transform import Transform
from youtube_transcriber.video import YoutubeVideo
from youtube_transcriber.utils import accepts_types

class WhisperTransform(Transform):
"""
Transform a Video object using Whisper model. It's a
concrete Transform.
Args:
model (`str`):
Size of Whisper model. Can be tiny, base (default), small, medium, and large.
without_timestamps (`bool`, defaults to `False`):
To add phrase-level timestamps.
"""

def __init__(self, model: str="base", without_timestamps: bool=False) -> None:
self.model = whisper.load_model(model)
self.without_timestamps = without_timestamps

@accepts_types(YoutubeVideo)
def apply(self, video: YoutubeVideo) -> YoutubeVideo:
"""Creates a new video with transcriptions created by Whisper.
"""
# Create a YouTube object
yt = YouTube(video.url)

# Get audio from video
try:
audio_file = self._get_audio_from_video(yt)

except Exception as e:
print(f"Exception: {e}")

result = self.model.transcribe(audio_file,
without_timestamps=self.without_timestamps)
transcription = result["text"]

data = []
for seg in result['segments']:
data.append(OrderedDict({'start': seg['start'], 'end': seg['end'],'text': seg['text']}))

os.remove(audio_file)

return YoutubeVideo(channel_name = video.channel_name,
url = video.url,
title = video.title,
description = video.description,
transcription = transcription,
segments = data)

def _get_audio_from_video(self, yt: Any) -> Path:
# TODO: Add credits
video = yt.streams.filter(only_audio=True).first()
out_file = video.download(output_path=".")
base, _ = os.path.splitext(out_file)
new_file = base + ".mp3"
os.rename(out_file, new_file)
return new_file
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transformation is enough with it we can create the transcripts and then send it to the hub. How do you see it?

…ideo. Add parameter so users can select which whisper model to user
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants