Skip to content

Graf3x/ThorsCodex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Thor's Codex

Thor's Codex is a project that aims to index all of Pirate Software's available VODs from YouTube and the content of the TTS Discord channel. The project provides a search engine for the words said during these VODs, making it easier to find specific moments and information. In the future, we plan to run a local RAG and LLM to summarize large parts of the VODs. This page hosts the HTML and JavaScript to interact with the application.

Annoucements

  • The Full text search from Cosmos leaves some things to be desired, it's BM25 and that just doesn't work for phrases.
  • There is a config button in the top right to allow you to use Azure's search service. Which isn't working correctly either. That is the real reason i added a configure button, because it's not as good as the already not quite what i want cosmos search.
  • I am adding a vectorizor now... it's taking a while, there is a large chunk of data for this.
  • Yes, i know about monogodb, i'll be looking into that next.

Project Overview

  • Project Name: Thor's Codex
  • Description: A static HTML/JS page to host an index of Pirate Software's VODs and Discord content, providing a searchable interface.
  • Future Plans: Implement a local RAG and LLM to summarize VOD content.

Features

  • Indexes PirateSoftware's YouTube VODs and Discord channel content (only the tts section, soon to be added).
  • Provides a search engine to find specific words said during VODs.
  • Give context to each search result using an LLM to summerize a portion of the video.
  • This repo is just the static files to host the HTML and JavaScript to interact with the application.

LLM Summerization

Here is list of what i am currently doing & what i want to do with AI/LLM's in this project. I will do more if i can figure out how to do it cost-effectively. Currently there is over 4m sentences from thor and 100's of hours of video/audio. I may look at local LLM's to do some of the future ideas listed below.

  • Currently chunking the videos into 30 minute segements for display/performance purposes
  • The LLM is reading the transcript that is generated by YT to make the summary.
  • Future enhancements: LLM Audio transcript - strip out the audio track of each video and send it up to an LLM to do a transcription on that. This will remove the [__] curse filtering that is happening right now.
  • Future enhancements: Ground the LLM in all of the transcripts and additional info before doing summerization
  • Future enhancements: Improve summaries by running them through different LLM models, to change up the "Same intro every time" to each summary.
  • Future enhancements: Add even more context to summerization by taking out frames of the video and having them parsed and fed into the prompt.

Architecture

This system uses a combination of:

  • Static HTML/JS pages : This repo, to provide the user interface.
  • ThorsCodex.Transcripts.to.CSV : Quick app to gather transcript text.
  • Serverless functions: API's to route requests and scale based on traffic load. These services will later down-step traffic to different experences for searching based on server load/traffic. Those experences are listed below:

Searching

  • Current Search Experence: NoSQL DB and a Full Text search, it uses BM25 which from my understanding can not handle multiword phrases. Currently working around this by making "cursed quest" into "cursed" "quest" and saying it needs to find all of the strings.
  • Advanced Experience: Uses RAG and LLM to return responses. (Hybrid Search?)
  • Prime Experience: Uses asure ai search or elastic/solar or similar tools for searches.
  • Fallback Experience: Uses a normal DB search (ie: contains) as an initial, simple fallback.

Known Issues

  • Using "word1 word2" will not return that exact phrase. This is a current limitation of the searching algo. It uses BM25 and that, from my understanding, can not handle multi word phrases. I break the words up and tell it to do a "ALL" right now as a work around till the newer search experence is added.
  • Using "" results in a shorter than expected word list of videos. Non-quoted words hit a summery list of words used in a segment of a video, so it is able to index a few 100 documents instead of a few 1000 and return very quickly. When you use a quote and other words you are saying "I want this exact word and other words in my transcript result". So i do the searching differently and it takes a significantly more resources. To keep this reasonable for the user and the free service, i return less results per search button click.

Getting Started

To get started with Thor's Codex, follow these steps:

  1. Clone the repository:

    git clone https://github.com/yourusername/thorscodex.git
  2. Navigate to the project directory:

    cd thorscodex
  3. Open the index.html file in your web browser to interact with the application.

Contributing

We welcome contributions to improve Thor's Codex. Feel free to submit pull requests or open issues for any bugs or feature requests.

License

This project is licensed under the MIT License.

Contact

For questions or suggestions, please reach out to us via the repository's issue tracker.


Happy searching through Pirate Software's archives!

About

Static html files so the goblins can get their bytes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published