First commit of youtube transcriber #2

juancopi81 · 2022-10-19T16:06:00Z

I started to work on the youtube transcriber using Whisper. I added the DTOs, some JsonSerializar, loaderiterator and started with the transcriber class.

…Tube URL

… to just using YouTubeVideo. This was is easier to add more features

juancopi81 · 2022-10-25T18:42:14Z

Hi @matallanas I think now it is in a good shape for a first merge and start working from there with git issues to improve it!

matallanas

Very good job. It is a bit too much of the things we need but it seeams a really good job. I will try to maybe condese a bit the code and compact it a bit more. Also I suggested to modify the download process with another library that makes it easier.

matallanas · 2022-11-01T16:42:04Z

youtube_transcriber/main.py

+from youtube_transcriber.datapipeline import create_hardcoded_data_pipeline
+from youtube_transcriber.threadeddatapipeline import ThreadedDataPipeline
+
+NUM_THREADS = 2


We can put this as a parameter to increase the number of threads to process everything. I was thinking that could also be a problem if we need to load in each thread a whisper model. It is going to take a lot of memory. What do you think?

matallanas · 2022-11-01T16:42:53Z

youtube_transcriber/main.py

+
+NUM_THREADS = 2
+
+def parse_args():


In this maybe we can use click as a library to generate the cli.

matallanas · 2022-11-01T16:44:43Z

youtube_transcriber/main.py

+    for thread in threads:
+        thread.join()
+
+    # Fetch entries and print them


About the database how about if it is optional, so people with low resources can decide to not create it? It is better to keep it simplier.

matallanas · 2022-11-01T16:46:12Z

youtube_transcriber/main.py

+    hf_dataset_identifier = "Whispering-GPT/test_whisper"
+
+    dataset = Dataset.from_sql("SELECT CHANNEL_NAME, URL, TITLE, DESCRIPTION, TRANSCRIPTION, SEGMENTS FROM VIDEO", connection)
+


Before pushing the dataset, we need to check that the current dataset does not have the processed videos. Also this can be done as a pprevious step to see if the video is already processed if not then transcribe or translate.

matallanas · 2022-11-01T19:01:41Z

youtube_transcriber/main.py

+    create_db("test.db")
+
+    # Create necessary resources
+    yt_video_processor = YoutubeVideoPreprocessor(mode="channel_name",


Maybe before processing everything check if already has been processed. It is a good check to not process the entire stuff.

matallanas · 2022-11-03T15:46:02Z

youtube_transcriber/errors.py

+class DifferentNumberOfArgumentsError(Exception):
+
+    def __init__(self, message: str) -> None:
+        self.message = message


This class doesn't give anything so maybe we can delete it.

matallanas · 2022-11-03T15:48:21Z

youtube_transcriber/loading/loaderiterator.py

+from pathlib import Path
+from typing import List, Dict, Optional
+
+from youtube_transcriber.loading.serialization import Serializer
+
+class LoaderIterator:
+    """Iterator that loads data from multiple files in batches."""
+
+    def __init__(self,
+                 serializer: Serializer,
+                 num_files_per_iteration: int,
+                 load_paths: Optional[List[Path]] = None) -> None:
+        self.serializer = serializer
+        self.num_files_per_iteration = num_files_per_iteration
+        self._load_paths = load_paths
+        self._current_iteration = None
+
+    @property
+    def load_paths(self) -> Optional[List[Path]]:
+        return self._load_paths
+
+    @load_paths.setter
+    def load_paths(self, load_paths: List[Path]) -> None:
+        self._load_paths = load_paths
+
+    def __iter__(self):
+        self._current_iteration = 0
+        return self
+
+    def __next__(self) -> List[Dict]:
+        if self._did_load_all_batches():
+            raise StopIteration
+        data_batch = self._load_data_batch()
+        self._current_iteration += 1
+        return data_batch
+
+    def _did_load_all_batches(self) -> bool:
+        if self._current_iteration >= len(self._load_paths) / self.num_files_per_iteration:
+            return True
+        return False
+
+    def _load_data_batch(self) -> List[Dict]:
+        start_index = self._current_iteration * self.num_files_per_iteration
+        stop_index = start_index + self.num_files_per_iteration
+        return [self.serializer.load(load_path) for load_path in
+                self._load_paths[start_index:stop_index] if load_path.exists()]


Isa this class really necessary. As I understood it load all the files and then process then in the datapipeline so maybe we can make it simplier. Maybe there is something I miss from here.

matallanas · 2022-11-03T15:56:08Z

youtube_transcriber/threadeddatapipeline.py

+
+from youtube_transcriber.datapipeline import DataPipeline
+
+logging.basicConfig(level=logging.INFO, format="(%(threadName)-5s) %(message)s")


Maybe this could be initialize at the beggining of the main program

matallanas · 2022-11-03T15:58:02Z

youtube_transcriber/transforming/addtitletransform.py

+from typing import Any
+
+from pytube import YouTube
+
+from youtube_transcriber.video import YoutubeVideo
+from youtube_transcriber.utils import accepts_types
+from youtube_transcriber.transforming.transform import Transform
+
+class AddTitleTransform(Transform):
+    """
+    Transform a Video object using PyTube. Adds title to YouTube video DTO. 
+    It's a concrete Transform.
+    """
+
+    @accepts_types(YoutubeVideo)
+    def apply(self, video: YoutubeVideo) -> YoutubeVideo:
+        yt = YouTube(video.url)
+
+        video_With_title_params = {
+            "channel_name": video.channel_name,
+            "url": video.url,
+            "title": self._get_video_title(yt),
+            "description": video.description,
+            "transcription": video.transcription,
+            "segments": video.segments
+        }
+
+        return YoutubeVideo(**video_With_title_params)
+
+    def _get_video_title(self, yt: Any) -> str:
+            return str(yt.title)


Can we merge all the transformation description, title, etc. So we can simplify the code?

matallanas · 2022-11-03T16:00:21Z

youtube_transcriber/transforming/whispertransform.py

+import os
+from pathlib import Path
+from typing import Any
+from collections import OrderedDict
+
+from pytube import YouTube
+import whisper
+
+from youtube_transcriber.transforming.transform import Transform
+from youtube_transcriber.video import YoutubeVideo
+from youtube_transcriber.utils import accepts_types
+
+class WhisperTransform(Transform):
+    """
+    Transform a Video object using Whisper model. It's a
+    concrete Transform.
+    Args:
+        model (`str`):
+            Size of Whisper model. Can be tiny, base (default), small, medium, and large.
+        without_timestamps (`bool`, defaults to `False`):
+            To add phrase-level timestamps.
+    """
+
+    def __init__(self, model: str="base", without_timestamps: bool=False) -> None:
+        self.model = whisper.load_model(model)
+        self.without_timestamps = without_timestamps
+
+    @accepts_types(YoutubeVideo) 
+    def apply(self, video: YoutubeVideo) -> YoutubeVideo:
+        """Creates a new video with transcriptions created by Whisper.
+        """
+        # Create a YouTube object
+        yt = YouTube(video.url)
+
+        # Get audio from video
+        try:
+            audio_file = self._get_audio_from_video(yt)
+
+        except Exception as e:
+            print(f"Exception: {e}")
+
+        result = self.model.transcribe(audio_file, 
+                                       without_timestamps=self.without_timestamps)
+        transcription = result["text"]
+
+        data = []
+        for seg in result['segments']:
+            data.append(OrderedDict({'start': seg['start'], 'end': seg['end'],'text': seg['text']}))
+
+        os.remove(audio_file)
+
+        return YoutubeVideo(channel_name = video.channel_name,
+                            url = video.url,
+                            title = video.title,
+                            description = video.description,
+                            transcription = transcription,
+                            segments = data)
+
+    def _get_audio_from_video(self, yt: Any) -> Path:
+        # TODO: Add credits
+        video = yt.streams.filter(only_audio=True).first()
+        out_file = video.download(output_path=".")
+        base, _ = os.path.splitext(out_file)
+        new_file = base + ".mp3"
+        os.rename(out_file, new_file)
+        return new_file


This transformation is enough with it we can create the transcripts and then send it to the hub. How do you see it?

…ideo. Add parameter so users can select which whisper model to user

juancopi81 added 3 commits October 19, 2022 11:01

First commit of youtube transcriber

a701319

Minor bug in name of param

1f44452

Add apply method to whisper transcriber class

f769c26

juancopi81 changed the title ~~First commit of youtube transcriber~~ [WIP] First commit of youtube transcriber Oct 20, 2022

juancopi81 added 20 commits October 20, 2022 07:34

Add some tests

bd91f0b

Add title and description to video DTO

6f10627

Some changes on how to retrieve the title and description from an You…

de8b2bf

…Tube URL

Finished whispertranscriber Transcriber and add tests, working fine

f75b008

Fix minor bug in TranscribedVideo docs

4da7306

Change structure of Video objects. From raw_video to transcribe_video…

a3cffb1

… to just using YouTubeVideo. This was is easier to add more features

Fix minor formatting bugs

4980c7f

Add batchtransformer to apply different transformation to many videos

112255a

Add tests to batchtransformer, working fine!

ec7cc9c

Fix minor bug, misspelling

5d99f47

Add storing component to save videos in SQLite db

1ce148f

Add preprocessor and datapipeline modules, plus tests

e6770e0

Add video preprocessor a datapipeline

96d9005

Add some tests to datapipeline and utils

e473f6d

Remove test files

c6208e6

Change things to add threading capabilities to data pipeline

e6d3d83

Add main file with cli inputs

5f5adf0

Add upload dataset to HF hub

0b6a02e

Change number of threads

81356a0

Add requirements file

3927e4b

juancopi81 changed the title ~~[WIP] First commit of youtube transcriber~~ First commit of youtube transcriber Oct 25, 2022

matallanas requested changes Nov 3, 2022

View reviewed changes

Add token to upload dataset. Add check to see if ds has already the v…

0457039

…ideo. Add parameter so users can select which whisper model to user

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First commit of youtube transcriber #2

First commit of youtube transcriber #2

juancopi81 commented Oct 19, 2022

juancopi81 commented Oct 25, 2022

matallanas left a comment

matallanas Nov 1, 2022 •

edited

Loading

matallanas Nov 1, 2022

matallanas Nov 1, 2022

matallanas Nov 1, 2022

matallanas Nov 1, 2022

matallanas Nov 3, 2022

matallanas Nov 3, 2022

matallanas Nov 3, 2022

matallanas Nov 3, 2022

matallanas Nov 3, 2022

		hf_dataset_identifier = "Whispering-GPT/test_whisper"

		dataset = Dataset.from_sql("SELECT CHANNEL_NAME, URL, TITLE, DESCRIPTION, TRANSCRIPTION, SEGMENTS FROM VIDEO", connection)


		from youtube_transcriber.datapipeline import DataPipeline

		logging.basicConfig(level=logging.INFO, format="(%(threadName)-5s) %(message)s")

First commit of youtube transcriber #2

Are you sure you want to change the base?

First commit of youtube transcriber #2

Conversation

juancopi81 commented Oct 19, 2022

juancopi81 commented Oct 25, 2022

matallanas left a comment

Choose a reason for hiding this comment

matallanas Nov 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matallanas Nov 1, 2022 •

edited

Loading