Trending Research

Cosmos World Foundation Model Platform for Physical AI

nvidia/cosmos • • 7 Jan 2025

We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications.

Position

6,422

6.56 stars / hour

Paper
Code

TransPixar: Advancing Text-to-Video Generation with Transparency

wileewang/TransPixar • • 6 Jan 2025

Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education.

Text-to-Video Generation Video Generation

547

2.84 stars / hour

Paper
Code

LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

bytedance/LatentSync • • 12 Dec 2024

Since we did not change the overall training framework of SyncNet, our experience can also be applied to other lip sync and audio-driven portrait animation methods that utilize SyncNet.

Portrait Animation

1,567

2.59 stars / hour

Paper
Code

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

automl/tabpfn • • 5 Jul 2022

We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods.

AutoML Bayesian Inference +5

1,917

2.51 stars / hour

Paper
Code

Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks

hhhuang/cag • • 20 Dec 2024

With the advent of large language models (LLMs) featuring significantly extended context windows, this paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval.

RAG Retrieval

664

1.83 stars / hour

Paper
Code

Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

igl-hkust/diffusionasshader • 7 Jan 2025

Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images.

Video Generation

286

2.10 stars / hour

Paper
Code

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

VITA-MLLM/VITA • • 3 Jan 2025

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction.

1,810

1.80 stars / hour

Paper
Code

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

ictnlp/llava-mini • • 7 Jan 2025

To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens.

Ranked #8 on Zero-Shot Video Question Answer on ActivityNet-QA