Research | Talk
Research
Wenhao Chai
I believe that long-context multimodal modeling is the essential path to bringing AI (whether AGI or ASI) to everyone, and we have never been closer to this goal than we are today. Broadly speaking, I see two fundamental challenges we must solve: encoding and decoding.
Encoding. An AI system must be capable of perceiving and understanding long-context multimodal content—for example, an entire day of human activity or the full historical record of a project. Such contexts naturally interleave text, video, images, audio, code, actions, and more. I have contributed to this direction through my work on this, this, this, and this, which pushes AI systems toward deeper video understanding. Yet it is evident that today's models are still far from achieving this capability; many even fail on long-context, text-only tasks. Is this purely a data limitation, or are architectural and training-strategy bottlenecks also to blame?
Another subtle challenge is the conflict between context and weights. A model's interpretation of the current context is often interfered with by its pretraining knowledge. For instance, a model pretrained heavily on PyTorch 1.0 documentation may struggle with handling PyTorch 2.0 codebases. If an AI system always follows its pretrained knowledge, it becomes less usable. But if it always obeys the context, it becomes manipulable and unsafe. This raises an important question:
Can an AI system continually and selectively update itself through long-context signals at deployment time?
I believe this is a compelling and underexplored direction.
Decoding. An AI system must also be able to generate long-context multimodal content—ideally within a single end-to-end model. Current LLM-based systems are impressive at generating long-form text, but for multimodal outputs (e.g., images), only a few systems like Nano Banana have reached practical usability.
I see many open questions in today's dominant paradigms.
Is it truly satisfactory to use diffusion for visual generation while remaining autoregressive for text?
Why can't we bring multimodal reasoning paradigms into visual generation?
Is there a better visual tokenization strategy beyond patch-wise representations?
Can end-to-end training outperform diffusion?
And can reinforcement learning be equally powerful for multimodal generation?
Beyond algorithmic paradigms, efficiency is impossible to ignore. Modern architectures such as sparse attention, linear attention, and hybrid models remain underexplored and urgently needed for scaling.
At the same time, benchmarking long-context multimodal modeling is intrinsically difficult. Even evaluating generated visual content alone is challenging—today we still lack reliable alternatives to human preference for assessing video generation. My contributions in this area include this and this.
By 2025, long-context multimodal modeling has reached a point where its potential is tangible and within sight. Yet significant effort is still required to truly achieve it—and ultimately bring AI into the hands of everyone.
My research philosophy and approach are guided by several core principles:
- RIGHT and interesting first, not so-called novelty
- Define the task at least one year ahead
- Work on fundamental things, but not naive
- Insight first, then experiments
- Extensible projects first
- Work on general tasks, not specialized
- For benchmark: HARD, real-world, controlled, or even synthetic
- I read
arXivpapers everyday - The fundamental approach in deep learning is to identify pretraining tasks and construct downstream tasks that fit the form of pretraining.
Featured
Videos
Video-MMLU [Project Page]
EMMOE [Project Page]
AuroraCap [Project Page]
SAMURAI [Project Page]
Ego3DT [Paper]
STEVE [Project Page]
StableVideo [Hugging Face Demo]
MovieChat [Project Page]
UniAP [Paper]
Invited
Talks
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
NeurIPS 2025 Oral, Dec 2025
Better and Longer Video Understanding
Sky9 Fellowship, Oct 2025
2077AI and Abaka AI, Sept 2025
Bitdeer AI, Aug 2025
Towards Universal Animal Perception in Vision
1st Workshop on Imageomics: Discovering Biological Knowledge from Images using AI at AAAI 2024
Vancouver, Canada
Organized
Workshops | Tutorials
5th International Workshop on Multimodal Video Agent
CVPR 2025, Nashville, TN
Workshop Organizer (Track 1A and 1B)
4th International Workshop on Long-form Video Understanding: Towards Multimodal AI Assistant and Copilot
CVPR 2024, Seattle, WA
Workshop Organizer (Track 1)