title | emoji | colorFrom | colorTo | sdk | sdk_version | app_file | pinned |
---|---|---|---|---|---|---|---|
Multimodal AI Assistant |
🤖 |
blue |
teal |
streamlit |
1.36.0 |
python main.py |
false |
The Multilingual AI Assistant project leverages cutting-edge technologies to provide a versatile assistant capable of handling text, image, and video inputs. Powered by Google's Gemini-1.5-pro
model and integrated with gTTS
(Google Text-to-Speech), PIL
(Python Imaging Library), and SpeechRecognition
, this assistant offers a seamless user experience across multiple modalities.
- At the heart of Gemini 1.5 Pro's advancements is its
1M token context window
. This feature allows the model to analyze and understand inputs of unparalleled length, enabling deeper and more nuanced content generation and data analysis. - Gemini 1.5 Pro leverages the
Mixture of Experts (MoE)
architecture, enhancing its ability to process and respond to complex queries efficiently. This architecture allows the model to delegate tasks to specialized components, significantly boosting its performance and versatility. - With its
vast context window
and advanced architecture, Gemini 1.5 Pro demonstratessuperior information retrieval capabilities
. This allows for more precise and relevant outputs, particularly in tasks requiring the extraction of specific information from large datasets.
Google Gemini-1.5-pro
: Utilized for generating contextually relevant responses across various inputs such as text, images, and videos.
gTTS (Google Text-to-Speech)
: Converts textual responses into natural-sounding speech for user interaction.
PIL (Python Imaging Library)
: Handles image processing tasks, allowing the assistant to analyze and interpret visual content.
SpeechRecognition
: Enables voice input recognition, enhancing user interaction through spoken commands.
google-generativeai
: Provides tools for interacting with Google's generative AI services
Streamlit
: Used for developing the interactive user interface (UI), making it easy to deploy and visualize outputs.
Python(3.10.9)
: Primary programming language used for implementing the backend logic and integrating various components.
Other libraries
: Includes standard Python libraries for file handling, audio manipulation, and API integrations.
Multimodal Inputs
: Supports text queries, image uploads, and video submissions for generating informative responses.
Voice Interaction
: Enables users to interact with the assistant using voice commands, which are processed using SpeechRecognition.
Real-time Response Generation
: Utilizes Gemini-pro's capabilities to generate context-aware responses in real-time, enhancing user engagement and satisfaction.
Speech Output
: Converts textual responses into audio format using gTTS, allowing users to listen to responses directly.
-
Input Handling
-
Speech Recognition (SpeechRecognition):
- Functionality: Converts spoken language into text.
- Tool: Utilizes various speech recognition engines and APIs.
- Implementation: Integrated to capture user queries through voice input.
-
Text Input:
- Functionality: Accepts textual queries directly from users.
- Implementation: Utilizes Streamlit for creating interactive input fields.
-
File Upload (Streamlit):
- Functionality: Allows users to upload images and videos.
- Implementation: Streamlit's file uploader widget integrated for image and video input.
-
-
Data Processing
-
Audio Processing (PyAudio):
- Functionality: Enables recording and processing of audio data.
- Tool: Interfaces with the PortAudio library.
- Implementation: Used for capturing and handling voice inputs.
-
Image Processing (PIL - Python Imaging Library):
- Functionality: Performs operations on image data, such as opening, manipulating, and saving.
- Implementation: Used to handle image uploads and processing before model input.
-
Video Processing (Streamlit):
- Functionality: Handles video file uploads and processing.
- Implementation: Streamlit utilized to manage video file uploads and subsequent actions.
-
-
Model Integration
- Google Generative AI Services (google-generativeai):
- Functionality: Accesses and interacts with Google's generative AI models.
- Tool: Facilitates model integration for generating responses based on input queries.
- Implementation: Used to generate content or responses based on user inputs, both text and multimodal.
- Google Generative AI Services (google-generativeai):
-
Output Generation
- Text-to-Speech (gTTS - Google Text-to-Speech):
- Functionality: Converts generated text responses into spoken audio.
- Tool: Utilizes Google's Text-to-Speech API.
- Implementation: Generates audio output for user queries and responses.
- Text-to-Speech (gTTS - Google Text-to-Speech):
-
User Interface (Streamlit and Markdown)
-
Streamlit:
- Functionality: Framework for building and deploying interactive web applications.
- Implementation: Used to create a user-friendly interface for inputting queries, displaying responses, and managing multimedia inputs.
-
Markdown:
- Functionality: Formats and structures text in the application interface.
- Implementation: Used to enhance readability and structure in the Streamlit application's content.
-
-
Input Acquisition:
- Handles various types of input:
- Voice input converted to text using SpeechRecognition.
- Textual queries entered directly by the user.
- Multimedia uploads (images and videos) managed through Streamlit.
- Handles various types of input:
-
Data Preparation:
- Processes input data:
- Audio data handled by PyAudio for voice inputs.
- Images processed using PIL to prepare them for model input.
- Videos managed and processed via Streamlit for subsequent operations.
- Processes input data:
-
Model Execution:
- Utilizes Google's generative AI services (google-generativeai):
- Receives processed inputs (text, images, videos).
- Generates responses or content based on the received inputs.
- Utilizes Google's generative AI services (google-generativeai):
-
Output Generation:
- Converts generated text responses into speech:
- gTTS used to generate audio output from textual responses.
- Audio files made available for listening and download through Streamlit.
- Converts generated text responses into speech:
-
User Interface:
- Streamlit manages the entire workflow:
- Provides interactive widgets for input handling (text fields, file uploaders).
- Displays generated responses (text and audio) in a user-friendly interface.
- Enables multimedia playback and download functionalities.
- Streamlit manages the entire workflow:
- Clone the repo .
- Make a virtual environment .
- visit Google AI Studio to get your API key .
- Install the dependencies
pip install -r requirements.txt
. - Run the server
python template.py
. - Run the server
python setup.py
after changing the information. - Finally, run the server
python app.py
. - If facing issues, message me on LinkedIn .
Big kudos to Google DeepMind 🚀 for crafting this cutting-edge model and championing open-source innovation !!