Skip to content

💡 [REQUEST] - <title> How to Create the Multimodal embedding for text, image and videos using this model. #506

Closed
@vimal00r

Description

起始日期 | Start Date

No response

实现PR | Implementation PR

No response

相关Issues | Reference Issues

No response

摘要 | Summary

I want to create embeddings for text, image and videos using MiniCPM model like LlaVa model. How to create the multimodal embedding using this model.

基本示例 | Basic Example

There are few multimodal embedding like CLIP, LlaVa which can be used to create embeddings for text, images as well as videos.

code ="""from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
import torch

            # Load the pre-trained  model
            model_path = "MLLM path "
            model = AutoModel.from_pretrained(model_path, trust_remote_code = True, device_map = device)
            tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code = True)
            processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code = True)
            
            # Preprocess the text
            text = "This is a sample text"
            inputs = processor(text, return_tensors="pt")
            
            # Generate text embeddings
            with torch.no_grad():
                outputs = model(**inputs)
                embeddings = outputs.last_hidden_state[:, 0, :]
            
            # Use the embeddings
            print(embeddings.shape)   """

缺陷 | Drawbacks

I am trying to do the same in this model but I am facing an error while doing this.

未解决问题 | Unresolved questions

No response

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions