Research

Research

Wenhao Chai

Currently, I am interested in several topics, including but not limited to:

  • Efficient architectures for long-sequence modeling in video (both understanding and generation), language, and other modalities, using techniques such as linear attention, state space models (SSMs), RNNs, hybrid models, or sparse attention mechanisms. My previous work including MovieChat at CVPR 2024, AuroraCap and VDC at ICLR 2025, and LongVidRWKV at CVPRW 2025.
  • Foundations and applications of generative models, such as denoising diffusion probabilistic models (DDPMs), flow matching, and both discrete and continuous variants of diffusion models applied to text generation and scientific domains. My previous work including StableVideo at ICCV 2023, DTPM at CVPR 2024, and Science-T2I at CVPR 2025.
  • Spatial and video understanding through efficient architectures, novel paradigm design, reinforcement learning approaches, and synthetic data generation for training. My previous work including STEVE at ECCV 2024, and Dynamic Token Compression at CVPR 2025.
  • Unified models for both multi-modal understanding and generation in terms of architecture design, training data, and benchmarks. My previous work including Dream Engine, RISE, and An Empirical Study are also available.
  • Benchmarking and evaluation, which must be designed to be non-trivial (State-of-the-art models achieve a accuracy of less than 20%), meaningful (in real-world applications), robust (less annotation errors), and rich in analysis (expert involved). This is, without a doubt, a highly non-trivial endeavor. My previous works often create new benchmarks and evaluation metrics along with the technical contributions.

The following presents my comprehensive research experience and areas of focus, along with a timeline highlighting the periods when I was most actively engaged in each field. The template is from here.


Large Multi-Modal Models (06/2023 - Present)

How to efficiently build and evaluate large multi-modal models?

How to involve large multi-modal models in embodied agent system?

Show/Hide Work on LMMs


Generative Models (03/2023 - Present)

How to generate high-quality images, videos and 3D worlds?

How to control and evaluate the generated content?

Show/Hide Work on Generative Models


Human Pose and Motion (08/2022 - 08/2023)

How to estimate human pose and motion from images and videos?

How to generate realistic and controllable human motion?

Show/Hide Work on Human Pose and Motion




Organized

Workshops | Tutorials | Talks