This project implements a web application that generates a search query based on a combination of an uploaded image and a user-provided query. The application integrates several components, including BLIP for generating image captions, LLaMA for refining user queries, and Zenserp API for returning relevant search results.
The goal is to develop an intelligent pipeline that takes an image of any object along with a user-defined prompt as input. The prompt specifies the user's requirement, such as finding similar objects or variations based on the image. For instance, if the input image is a towel, the prompt might request "show me a similar towel" or "show me towels of the same type but in different colors." The pipeline should then process the input image and prompt, and return relevant web search results that align with the user's specific needs. This system aims to enhance search efficiency by combining visual inputs with contextual user queries.
This project addresses the problem by implementing:
- Image Captioning: Uses the BLIP model to analyze the uploaded image and generate a relevant caption.
- Query Refinement: Leverages the LLaMA model to refine the user’s text query by integrating the image caption and the user’s original prompt.
- Search Results: Fetches results based on the refined query using the Zenserp API, providing a list of relevant websites.
- User Input: The user uploads an image and provides a query or prompt.
- BLIP Model: The application processes the uploaded image using the BLIP (Bootstrapping Language-Image Pretraining) model to generate a caption for the image.
- LLaMA Model: This caption is combined with the user’s query, and the LLaMA model refines the combined input to produce a single-line search query.
- Zenserp API: The refined query is then used to search for relevant web results through the Zenserp API.
- Results Display: The app displays the generated search query and provides a list of search results with clickable links.
The project requires several Python libraries and external services:
- PyTorch: For working with machine learning models.
- Transformers: To load and interact with the BLIP and LLaMA models.
- Gradio: A simple UI framework to create the web interface.
- Pyngrok: To provide public access to the Gradio app.
- Requests: To handle API calls for fetching the image and querying Zenserp.
- Zenserp API: An external service to fetch search results based on a query.
-
Clone the Repository:
git clone https://github.com/your-username/image-search-query-generator.git cd image-search-query-generator
-
Install the Dependencies:
pip install -r requirements.txt
-
Get API Keys:
- Zenserp API Key: Sign up at Zenserp to get an API key and replace the placeholder in the code with your key.
-
Ngrok Token (Optional): If you want to run the Gradio app publicly, sign up at Ngrok and set up your auth token using:
ngrok config add-authtoken YOUR_AUTH_TOKEN
-
Run the Application:
python app.py
-
Input: The function takes two inputs:
- Uploaded Image: An image uploaded by the user (PIL format).
- User Prompt: A question or search query provided by the user.
-
Steps:
-
Image Captioning (BLIP):
- The function uses the BLIP processor to convert the image into a caption. The caption provides contextual information about the image.
-
Query Refinement (LLaMA):
- The BLIP-generated caption and user’s prompt are combined and passed to the LLaMA model.
- The LLaMA model refines the input into a concise search query that integrates both the visual and text information.
-
Fetching Search Results:
- The refined query is sent to the Zenserp API, which returns search results (titles, URLs, etc.) in JSON format.
- The function formats the results into a user-friendly output with clickable links.
-
Return: The function returns the final refined query, search results, and image caption.
-
-
Inputs:
gr.Image
: Allows the user to upload an image for analysis.gr.Textbox
: Lets the user input their search prompt or question.
-
Outputs:
gr.Markdown
: Displays the formatted search results as clickable links.gr.Textbox
: Shows the refined search query and the caption generated by BLIP.
- Pyngrok: Allows you to share the app publicly using Ngrok. After setting up the Ngrok token, the app can be accessed via a public URL.
- Upload an Image: Drag and drop an image or upload one from your device.
- Enter a Query: Input a text query that you'd like to refine based on the image context.
- View Results: The app will generate a search query based on the image and your text, then display relevant search results.
- Image: An image of a black t-shirt with a Superman logo.
- User Prompt: “Find similar t-shirts.”
- Refined Query: “Black Superman logo t-shirt buy.”
- Search Results: The app will return several links to stores or articles where similar t-shirts are available.