Cosmos World Foundation Model Platform for Physical AI

nvidia/cosmos 7 Jan 2025

We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications.

Position

6,422
6.56 stars / hour

TransPixar: Advancing Text-to-Video Generation with Transparency

wileewang/TransPixar 6 Jan 2025

Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education.

Text-to-Video Generation Video Generation

547
2.84 stars / hour

LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

bytedance/LatentSync 12 Dec 2024

Since we did not change the overall training framework of SyncNet, our experience can also be applied to other lip sync and audio-driven portrait animation methods that utilize SyncNet.

Portrait Animation

1,567
2.59 stars / hour

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

automl/tabpfn 5 Jul 2022

We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods.

AutoML Bayesian Inference +5

1,917
2.51 stars / hour

Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks

hhhuang/cag 20 Dec 2024

With the advent of large language models (LLMs) featuring significantly extended context windows, this paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval.

RAG Retrieval

664
1.83 stars / hour

Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

igl-hkust/diffusionasshader 7 Jan 2025

Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images.

Video Generation

286
2.10 stars / hour

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

VITA-MLLM/VITA 3 Jan 2025

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction.

1,810
1.80 stars / hour

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

ictnlp/llava-mini 7 Jan 2025

To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens.

Visual Question Answering (VQA) Zero-Shot Video Question Answer

189
1.48 stars / hour

KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation

openspg/kag 10 Sep 2024

The recently developed retrieval-augmented generation (RAG) technology has enabled the efficient construction of domain-specific applications.

Knowledge Graphs Question Answering +2

3,952
1.55 stars / hour

NVILA: Efficient Frontier Visual Language Models

nvlabs/vila 5 Dec 2024

This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy.

2,698
1.47 stars / hour