Research | Talk
Research
Wenhao Chai
I believe that long-context multimodal modeling is a key path toward making advanced AI (whether AGI or ASI) truly useful for everyone. Today we are closer than ever, but major challenges remain. Broadly, I see two core problems: encoding and decoding.
Encoding. An AI system must be able to perceive and understand long-context multimodal content—for example, an entire day of human activity or the full history of a project. Such contexts naturally mix text, video, images, audio, code, actions, and more. My work on this, this, this, and this pushes models toward deeper video understanding. Yet current systems still struggle with this goal; many even fail on long-context text-only tasks. This raises a central question: are we limited mainly by data, or also by model architectures and training strategies?
A further challenge is the tension between context and weights. A model’s understanding of the current context can be distorted by its pretraining. For example, a model trained heavily on PyTorch 1.0 documentation may mis-handle PyTorch 2.0 codebases. If an AI system always trusts its pretraining, it becomes less adaptable. If it always follows the given context, it becomes easy to manipulate and unsafe. This leads to an important question:
Can an AI system continuously and selectively update itself from long-context signals at deployment time?
I believe this is a promising and still underexplored direction.
Decoding. An AI system must also be able to generate long-context multimodal content—ideally within a single end-to-end model. Current LLM-based systems are strong at long-form text generation, but for multimodal outputs (e.g., images and video), only a few systems such as Nano Banana have reached practical usability.
This reveals many open questions in today’s dominant paradigms.
Is it enough to use diffusion for visual generation while staying autoregressive for text?
Can we bring multimodal reasoning strategies into visual generation?
Is there a better visual tokenization method than patch-based representations?
Can fully end-to-end training outperform diffusion?
And can reinforcement learning be equally powerful for multimodal generation?
Efficiency is also critical. Modern architectures such as sparse attention, linear attention, and hybrid models are still underexplored, yet are likely essential for scaling.
Benchmarking long-context multimodal models is itself difficult. Even evaluating generated visual content alone is challenging—we still lack reliable alternatives to human preference for video evaluation. My work on this and this aims to improve evaluation for complex multimodal and scientific content.
By the year of 2025, long-context multimodal modeling has reached a stage where its potential is clear and within reach. However, significant work is still needed to fully realize this vision—and to bring powerful, reliable AI into the hands of everyone.
My research are guided by several core principles:
- RIGHT and interesting first, not so-called novelty
- Define the task at least one year ahead
- Work on fundamental things, but not naive
- Insight first, then experiments
- Extensible projects first
- Work on general tasks, not specialized
- For benchmark: HARD, real-world, controlled, or even synthetic
- I read
arXivpapers everyday - The fundamental approach in deep learning is to identify pretraining tasks and construct downstream tasks that fit the form of pretraining.
Featured
Videos
Video-MMLU [Project Page]
EMMOE [Project Page]
AuroraCap [Project Page]
SAMURAI [Project Page]
Ego3DT [Paper]
STEVE [Project Page]
StableVideo [Hugging Face Demo]
MovieChat [Project Page]
UniAP [Paper]
Invited
Talks
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
NeurIPS 2025 Oral, Dec 2025
Better and Longer Video Understanding
Sky9 Fellowship, Oct 2025
2077AI and Abaka AI, Sept 2025
Bitdeer AI, Aug 2025
Towards Universal Animal Perception in Vision
1st Workshop on Imageomics: Discovering Biological Knowledge from Images using AI at AAAI 2024
Vancouver, Canada
Organized
Workshops | Tutorials
5th International Workshop on Multimodal Video Agent
CVPR 2025, Nashville, TN
Workshop Organizer (Track 1A and 1B)
4th International Workshop on Long-form Video Understanding: Towards Multimodal AI Assistant and Copilot
CVPR 2024, Seattle, WA
Workshop Organizer (Track 1)