-
Zhejiang University
- HangZhou
Lists (6)
Sort Name ascending (A-Z)
Starred repositories
Codebase for Aria - an Open Multimodal Native MoE
Make huge neural nets fit in memory
A paper list of some recent works about Token Compress for Vit and VLM
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
g1: Using Llama-3.1 70b on Groq to create o1-like reasoning chains
Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Tr…
Official inference repo for FLUX.1 models
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
High-resolution models for human tasks.
Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Controllable video and image Generation, SVD, Animate Anyone, ControlNet, ControlNeXt, LoRA
LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
llama3 implementation one matrix multiplication at a time
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
Official Implementation of ICLR'24: Kosmos-G: Generating Images in Context with Multimodal Large Language Models
ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation (ICCV 2023, Oral)
AIGC-interview/CV-interview/LLMs-interview面试问题与答案集合仓,同时包含工作和科研过程中的新想法、新问题、新资源与新项目
The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
InstantID: Zero-shot Identity-Preserving Generation in Seconds 🔥
Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"
(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.
Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.