-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First commit of youtube transcriber #2
base: main
Are you sure you want to change the base?
Conversation
… to just using YouTubeVideo. This was is easier to add more features
Hi @matallanas I think now it is in a good shape for a first merge and start working from there with git issues to improve it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good job. It is a bit too much of the things we need but it seeams a really good job. I will try to maybe condese a bit the code and compact it a bit more. Also I suggested to modify the download process with another library that makes it easier.
from youtube_transcriber.datapipeline import create_hardcoded_data_pipeline | ||
from youtube_transcriber.threadeddatapipeline import ThreadedDataPipeline | ||
|
||
NUM_THREADS = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can put this as a parameter to increase the number of threads to process everything. I was thinking that could also be a problem if we need to load in each thread a whisper model. It is going to take a lot of memory. What do you think?
|
||
NUM_THREADS = 2 | ||
|
||
def parse_args(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this maybe we can use click as a library to generate the cli.
for thread in threads: | ||
thread.join() | ||
|
||
# Fetch entries and print them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the database how about if it is optional, so people with low resources can decide to not create it? It is better to keep it simplier.
hf_dataset_identifier = "Whispering-GPT/test_whisper" | ||
|
||
dataset = Dataset.from_sql("SELECT CHANNEL_NAME, URL, TITLE, DESCRIPTION, TRANSCRIPTION, SEGMENTS FROM VIDEO", connection) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before pushing the dataset, we need to check that the current dataset does not have the processed videos. Also this can be done as a pprevious step to see if the video is already processed if not then transcribe or translate.
create_db("test.db") | ||
|
||
# Create necessary resources | ||
yt_video_processor = YoutubeVideoPreprocessor(mode="channel_name", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe before processing everything check if already has been processed. It is a good check to not process the entire stuff.
class DifferentNumberOfArgumentsError(Exception): | ||
|
||
def __init__(self, message: str) -> None: | ||
self.message = message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class doesn't give anything so maybe we can delete it.
from pathlib import Path | ||
from typing import List, Dict, Optional | ||
|
||
from youtube_transcriber.loading.serialization import Serializer | ||
|
||
class LoaderIterator: | ||
"""Iterator that loads data from multiple files in batches.""" | ||
|
||
def __init__(self, | ||
serializer: Serializer, | ||
num_files_per_iteration: int, | ||
load_paths: Optional[List[Path]] = None) -> None: | ||
self.serializer = serializer | ||
self.num_files_per_iteration = num_files_per_iteration | ||
self._load_paths = load_paths | ||
self._current_iteration = None | ||
|
||
@property | ||
def load_paths(self) -> Optional[List[Path]]: | ||
return self._load_paths | ||
|
||
@load_paths.setter | ||
def load_paths(self, load_paths: List[Path]) -> None: | ||
self._load_paths = load_paths | ||
|
||
def __iter__(self): | ||
self._current_iteration = 0 | ||
return self | ||
|
||
def __next__(self) -> List[Dict]: | ||
if self._did_load_all_batches(): | ||
raise StopIteration | ||
data_batch = self._load_data_batch() | ||
self._current_iteration += 1 | ||
return data_batch | ||
|
||
def _did_load_all_batches(self) -> bool: | ||
if self._current_iteration >= len(self._load_paths) / self.num_files_per_iteration: | ||
return True | ||
return False | ||
|
||
def _load_data_batch(self) -> List[Dict]: | ||
start_index = self._current_iteration * self.num_files_per_iteration | ||
stop_index = start_index + self.num_files_per_iteration | ||
return [self.serializer.load(load_path) for load_path in | ||
self._load_paths[start_index:stop_index] if load_path.exists()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isa this class really necessary. As I understood it load all the files and then process then in the datapipeline so maybe we can make it simplier. Maybe there is something I miss from here.
|
||
from youtube_transcriber.datapipeline import DataPipeline | ||
|
||
logging.basicConfig(level=logging.INFO, format="(%(threadName)-5s) %(message)s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be initialize at the beggining of the main program
from typing import Any | ||
|
||
from pytube import YouTube | ||
|
||
from youtube_transcriber.video import YoutubeVideo | ||
from youtube_transcriber.utils import accepts_types | ||
from youtube_transcriber.transforming.transform import Transform | ||
|
||
class AddTitleTransform(Transform): | ||
""" | ||
Transform a Video object using PyTube. Adds title to YouTube video DTO. | ||
It's a concrete Transform. | ||
""" | ||
|
||
@accepts_types(YoutubeVideo) | ||
def apply(self, video: YoutubeVideo) -> YoutubeVideo: | ||
yt = YouTube(video.url) | ||
|
||
video_With_title_params = { | ||
"channel_name": video.channel_name, | ||
"url": video.url, | ||
"title": self._get_video_title(yt), | ||
"description": video.description, | ||
"transcription": video.transcription, | ||
"segments": video.segments | ||
} | ||
|
||
return YoutubeVideo(**video_With_title_params) | ||
|
||
def _get_video_title(self, yt: Any) -> str: | ||
return str(yt.title) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge all the transformation description, title, etc. So we can simplify the code?
import os | ||
from pathlib import Path | ||
from typing import Any | ||
from collections import OrderedDict | ||
|
||
from pytube import YouTube | ||
import whisper | ||
|
||
from youtube_transcriber.transforming.transform import Transform | ||
from youtube_transcriber.video import YoutubeVideo | ||
from youtube_transcriber.utils import accepts_types | ||
|
||
class WhisperTransform(Transform): | ||
""" | ||
Transform a Video object using Whisper model. It's a | ||
concrete Transform. | ||
Args: | ||
model (`str`): | ||
Size of Whisper model. Can be tiny, base (default), small, medium, and large. | ||
without_timestamps (`bool`, defaults to `False`): | ||
To add phrase-level timestamps. | ||
""" | ||
|
||
def __init__(self, model: str="base", without_timestamps: bool=False) -> None: | ||
self.model = whisper.load_model(model) | ||
self.without_timestamps = without_timestamps | ||
|
||
@accepts_types(YoutubeVideo) | ||
def apply(self, video: YoutubeVideo) -> YoutubeVideo: | ||
"""Creates a new video with transcriptions created by Whisper. | ||
""" | ||
# Create a YouTube object | ||
yt = YouTube(video.url) | ||
|
||
# Get audio from video | ||
try: | ||
audio_file = self._get_audio_from_video(yt) | ||
|
||
except Exception as e: | ||
print(f"Exception: {e}") | ||
|
||
result = self.model.transcribe(audio_file, | ||
without_timestamps=self.without_timestamps) | ||
transcription = result["text"] | ||
|
||
data = [] | ||
for seg in result['segments']: | ||
data.append(OrderedDict({'start': seg['start'], 'end': seg['end'],'text': seg['text']})) | ||
|
||
os.remove(audio_file) | ||
|
||
return YoutubeVideo(channel_name = video.channel_name, | ||
url = video.url, | ||
title = video.title, | ||
description = video.description, | ||
transcription = transcription, | ||
segments = data) | ||
|
||
def _get_audio_from_video(self, yt: Any) -> Path: | ||
# TODO: Add credits | ||
video = yt.streams.filter(only_audio=True).first() | ||
out_file = video.download(output_path=".") | ||
base, _ = os.path.splitext(out_file) | ||
new_file = base + ".mp3" | ||
os.rename(out_file, new_file) | ||
return new_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This transformation is enough with it we can create the transcripts and then send it to the hub. How do you see it?
…ideo. Add parameter so users can select which whisper model to user
I started to work on the youtube transcriber using Whisper. I added the DTOs, some JsonSerializar, loaderiterator and started with the transcriber class.