💡 [REQUEST] - <title> How to Create the Multimodal embedding for text, image and videos using this model. #506
Closed
Description
起始日期 | Start Date
No response
实现PR | Implementation PR
No response
相关Issues | Reference Issues
No response
摘要 | Summary
I want to create embeddings for text, image and videos using MiniCPM model like LlaVa model. How to create the multimodal embedding using this model.
基本示例 | Basic Example
There are few multimodal embedding like CLIP, LlaVa which can be used to create embeddings for text, images as well as videos.
code ="""from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
import torch
# Load the pre-trained model
model_path = "MLLM path "
model = AutoModel.from_pretrained(model_path, trust_remote_code = True, device_map = device)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code = True)
processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code = True)
# Preprocess the text
text = "This is a sample text"
inputs = processor(text, return_tensors="pt")
# Generate text embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, 0, :]
# Use the embeddings
print(embeddings.shape) """
缺陷 | Drawbacks
I am trying to do the same in this model but I am facing an error while doing this.
未解决问题 | Unresolved questions
No response