Research | Talk
Research
Wenhao Chai
Currently, I am interested in several topics, including but not limited to:
- Efficient architectures for long-context modeling in video (both understanding and generation), language, and other modalities, using techniques such as linear attention, state space models (SSMs), RNNs, hybrid models, or sparse attention mechanisms. My previous work including MovieChat at CVPR 2024, AuroraCap and VDC at ICLR 2025, and AuroraLong at ICCV 2025. In terms of long-context applications, I am also interested in spatial understanding through efficient architectures, novel paradigm design, reinforcement learning approaches, and synthetic data generation for training. My previous work including STEVE at ECCV 2024, Dynamic Token Compression at CVPR 2025, and ToSA at IROS 2025. Also check the slides.
- Foundations and applications of generative models, such as denoising diffusion probabilistic models (DDPMs), flow matching, and both discrete and continuous variants of diffusion models applied to text generation and scientific domains. Beyond this, I am also interested in visual tokenizers. My ideal tokenizer should be 1D, non-patch-based, semantic, variable-length, and capable of evaluation beyond training, but we are still far from achieving this goal. My previous work including StableVideo at ICCV 2023, DTPM at CVPR 2024, Science-T2I at CVPR 2025, and DiffPO at ACL 2025. Also check the slides.
- Unified models for both multi-modal understanding and generation in terms of architecture design, training data, and benchmarks. I believe this should be solved by test-time scaling in reasoning unified models. Taking one step forward, the reasoning process should contain some visual draft output, and then the model will try to see the draft and fix/improve it. Or we can decompose the instruction to several subtasks and the model just needs to generate one-by-one building upon each visual draft. That's very similar to the reasoning process that happens in text. My previous work including Dream Engine (predecessor of Qwen-Image), RISE, and An Empirical Study are also available.
- Benchmarking and evaluation, which must be designed to be non-trivial (State-of-the-art models achieve a accuracy of less than 20%, for example, LiveCodeBench Pro), meaningful (in real-world applications), robust (less annotation errors), and rich in analysis (expert involved). This is, without a doubt, a highly non-trivial endeavor. My previous works often create new benchmarks and evaluation metrics along with the technical contributions.
- Fundamental problems in network training, which I have recently become interested in, including matrix-form optimizers (muon), hyperparameter transfer (muP), model merging, recurrent networks, cross-layer KV cache sharing, test-time training, slow-fast weights, and other core training methodologies.
My research philosophy and approach are guided by several core principles:
- RIGHT and interesting first, not so-called novelty
- Work on fundamental things, but not naive
- Insight first, then experiments
- If someone else can do it better (compute, expertise), I won't do
- Extensible projects first
- Work on general tasks, not specialized
- (for benchmark) HARD, real-world, or synthetic
- I read papers everyday
Featured
Videos

Video-MMLU [Project Page]

EMMOE [Project Page]

AuroraCap [Project Page]

SAMURAI [Project Page]

Ego3DT [Paper]

STEVE [Project Page]

StableVideo [Hugging Face Demo]

MovieChat [Project Page]

UniAP [Paper]
Invited
Talks

Towards Universal Animal Perception in Vision
1st Workshop on Imageomics: Discovering Biological Knowledge from Images using AI at AAAI 2024
Vancouver, Canada
Organized
Workshops | Tutorials

5th International Workshop on Multimodal Video Agent
CVPR 2025, Nashville, TN
Workshop Organizer (Track 1A and 1B)

4th International Workshop on Long-form Video Understanding: Towards Multimodal AI Assistant and Copilot
CVPR 2024, Seattle, WA
Workshop Organizer (Track 1)