Research | Talk

Research

Wenhao Chai

I believe that long-context multimodal modeling is the essential path to bringing AI (whether AGI or ASI) to everyone, and we have never been closer to this goal than we are today. Broadly speaking, I see two fundamental challenges we must solve: encoding and decoding.

Encoding. An AI system must be capable of perceiving and understanding long-context multimodal content—for example, an entire day of human activity or the full historical record of a project. Such contexts naturally interleave text, video, images, audio, code, actions, and more. I have contributed to this direction through my work on this, this, this, and this, which pushes AI systems toward deeper video understanding. Yet it is evident that today's models are still far from achieving this capability; many even fail on long-context, text-only tasks. Is this purely a data limitation, or are architectural and training-strategy bottlenecks also to blame?

Another subtle challenge is the conflict between context and weights. A model's interpretation of the current context is often interfered with by its pretraining knowledge. For instance, a model pretrained heavily on PyTorch 1.0 documentation may struggle with handling PyTorch 2.0 codebases. If an AI system always follows its pretrained knowledge, it becomes less usable. But if it always obeys the context, it becomes manipulable and unsafe. This raises an important question: Can an AI system continually and selectively update itself through long-context signals at deployment time? I believe this is a compelling and underexplored direction.

Decoding. An AI system must also be able to generate long-context multimodal content—ideally within a single end-to-end model. Current LLM-based systems are impressive at generating long-form text, but for multimodal outputs (e.g., images), only a few systems like Nano Banana have reached practical usability.

I see many open questions in today's dominant paradigms. Is it truly satisfactory to use diffusion for visual generation while remaining autoregressive for text? Why can't we bring multimodal reasoning paradigms into visual generation? Is there a better visual tokenization strategy beyond patch-wise representations? Can end-to-end training outperform diffusion? And can reinforcement learning be equally powerful for multimodal generation?

Beyond algorithmic paradigms, efficiency is impossible to ignore. Modern architectures such as sparse attention, linear attention, and hybrid models remain underexplored and urgently needed for scaling.

At the same time, benchmarking long-context multimodal modeling is intrinsically difficult. Even evaluating generated visual content alone is challenging—today we still lack reliable alternatives to human preference for assessing video generation. My contributions in this area include this and this.

By 2025, long-context multimodal modeling has reached a point where its potential is tangible and within sight. Yet significant effort is still required to truly achieve it—and ultimately bring AI into the hands of everyone.

My research philosophy and approach are guided by several core principles:

  • RIGHT and interesting first, not so-called novelty
  • Define the task at least one year ahead
  • Work on fundamental things, but not naive
  • Insight first, then experiments
  • Extensible projects first
  • Work on general tasks, not specialized
  • For benchmark: HARD, real-world, controlled, or even synthetic
  • I read arXiv papers everyday
  • The fundamental approach in deep learning is to identify pretraining tasks and construct downstream tasks that fit the form of pretraining.




Invited

Talks

Better and Longer Video Understanding

Sky9 Fellowship, Oct 2025
2077AI and Abaka AI, Sept 2025
Bitdeer AI, Aug 2025

Slides

Towards Universal Animal Perception in Vision

1st Workshop on Imageomics: Discovering Biological Knowledge from Images using AI at AAAI 2024

Vancouver, Canada

Organized

Workshops | Tutorials

5th International Workshop on Multimodal Video Agent

CVPR 2025, Nashville, TN

Workshop Organizer (Track 1A and 1B)