Downloads | Wenhao Chai

Featured Codebases

Contributor·Benchmark

Terminal-Bench

A benchmark for AI agents on ~100 terminal tasks.

View on GitHub

Contributor·Tracking

SAMURAI

Zero-shot visual object tracking with motion-aware memory, built on SAM.

View on GitHub

Contributor·Eval

LMMs-Eval

Evaluation suite accelerating the development of large multimodal models.

View on GitHub

Lead·CVPR 2024

MovieChat

Large multimodal models for long-form video understanding with memory mechanism.

View on GitHub

Lead·ICCV 2023

StableVideo

Text-driven, consistency-aware diffusion video editing.

View on GitHub

Featured Datasets & Benchmarks

2026·Benchmark

BabyVision: Visual Reasoning Beyond Language

A benchmark for visual reasoning that evaluates fundamental visual skills independent of language shortcuts.

HuggingFace

2026·Benchmark

UEval: A Benchmark for Unified Multimodal Generation

A 1,000-example benchmark for evaluating models that generate both images and text with rubric-based scoring.

HuggingFace

2026·Benchmark

FrontierCS

An open-ended benchmark for challenging computer science problems with objective, fine-grained evaluation.

Project

2025·Benchmark

Dense Information Video Evaluation (DIVE)

First benchmark dedicated to Dense Video Understanding, focusing on QA-driven high-frame-rate comprehension.

Project

NeurIPS 2025·Benchmark

LiveCodeBench Pro

Models like o3-high, o4-mini, and Gemini 2.5 Pro score 0% on hard competitive programming problems.

Project

ICCVW 2025·Benchmark

Video-MMLU

A massive benchmark designed to evaluate LMMs in understanding Multi-Discipline Lectures.

Project

2025·Dataset

TEMPURA

1M reasoning samples about causal event relationships with fine-grained, timestamped descriptions of untrimmed videos.

Project

NeurIPS 2025·Benchmark

Reasoning-Informed Visual Editing (RISE)

First benchmark for reasoning-informed visual editing across four reasoning types: temporal, causal, spatial, logical.

arXiv

CVPR 2025·Dataset

Science T2I

Over 20k image pairs for training a language-guided reward model for text-to-image alignment with scientific knowledge.

HuggingFace

ICLR 2025·Dataset

VDC & AuroraCap Trainset

First benchmark for detailed video captioning — 1k+ videos with significantly longer captions plus training recipes.

HuggingFace

ECCV 2024·Dataset

RT-Pose

Human pose estimation dataset with calibrated radar ADC data, 4D radar tensors, stereo RGB images, and LiDAR.

HuggingFace

CVPR 2024·Dataset

MovieChat-1K

Manually labeled long-video QA and caption dataset — 1,000 videos, each longer than ten thousand frames.

HuggingFace

Featured Surveys

Tech Report · 91pp

Featured LaTeX Templates

Overleaf

Downloads & Resources

Featured Codebases

Terminal-Bench

SAMURAI

LMMs-Eval

MovieChat

StableVideo

Featured Datasets & Benchmarks

BabyVision: Visual Reasoning Beyond Language

UEval: A Benchmark for Unified Multimodal Generation

FrontierCS

Dense Information Video Evaluation (DIVE)

LiveCodeBench Pro

Video-MMLU

TEMPURA

Reasoning-Informed Visual Editing (RISE)

Science T2I

VDC & AuroraCap Trainset

RT-Pose

MovieChat-1K

Featured Surveys

An Empirical Study of GPT-4o Image Generation Capabilities

A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision

Deep Learning Methods for Small Molecule Drug Discovery: A Survey

Awesome-list: Vector Quantized Variational Autoencoder (VQ-VAE)

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Featured LaTeX Templates

arXiv Template

Poster Template

Curriculum Vitae (CV) Template

Project Page Template

Blog Post Template

CVPR / ICCV Rebuttal Template

Slides Template