Research

Wenhao Chai

Publication

Currently, I am interested in several topics, including but not limited to:

Foundations and applications of generative models, such as denoising diffusion probabilistic models (DDPMs), flow matching, and both discrete and continuous variants of diffusion models applied to text generation and scientific domains. My previous work including StableVideo at ICCV 2023, DTPM at CVPR 2024, Science-T2I at CVPR 2025, and DiffPO at ACL 2025.
Efficient architectures for long context modeling in video (both understanding and generation), language, and other modalities, using techniques such as linear attention, state space models (SSMs), RNNs, hybrid models, or sparse attention mechanisms. My previous work including MovieChat at CVPR 2024, AuroraCap and VDC at ICLR 2025, and AuroraLong at ICCV 2025.
Spatial and video understanding through efficient architectures, novel paradigm design, reinforcement learning approaches, and synthetic data generation for training. My previous work including STEVE at ECCV 2024, Dynamic Token Compression at CVPR 2025, and ToSA at IROS 2025.
Unified models for both multi-modal understanding and generation in terms of architecture design, training data, and benchmarks. My previous work including Dream Engine, RISE, and An Empirical Study are also available.
Benchmarking and evaluation, which must be designed to be non-trivial (State-of-the-art models achieve a accuracy of less than 20%, for example, LiveCodeBench Pro), meaningful (in real-world applications), robust (less annotation errors), and rich in analysis (expert involved). This is, without a doubt, a highly non-trivial endeavor. My previous works often create new benchmarks and evaluation metrics along with the technical contributions.