Research
Currently, I am interested in several topics, including but not limited to:
- Efficient architectures for long-sequence modeling in video (both understanding and generation), language, and other modalities, using techniques such as linear attention, state space models (SSMs), RNNs, hybrid models, or sparse attention mechanisms. My previous work including MovieChat at CVPR 2024, AuroraCap and VDC at ICLR 2025, and LongVidRWKV at CVPRW 2025.
- Foundations and applications of generative models, such as denoising diffusion probabilistic models (DDPMs), flow matching, and both discrete and continuous variants of diffusion models applied to text generation and scientific domains. My previous work including StableVideo at ICCV 2023, DTPM at CVPR 2024, and Science-T2I at CVPR 2025.
- Spatial and video understanding through efficient architectures, novel paradigm design, reinforcement learning approaches, and synthetic data generation for training. My previous work including STEVE at ECCV 2024, and Dynamic Token Compression at CVPR 2025.
- Unified models for both multi-modal understanding and generation in terms of architecture design, training data, and benchmarks. My previous work including Dream Engine, RISE, and An Empirical Study are also available.
- Benchmarking and evaluation, which must be designed to be non-trivial (State-of-the-art models achieve a accuracy of less than 20%), meaningful (in real-world applications), robust (less annotation errors), and rich in analysis (expert involved). This is, without a doubt, a highly non-trivial endeavor. My previous works often create new benchmarks and evaluation metrics along with the technical contributions.
The following presents my comprehensive research experience and areas of focus, along with a timeline highlighting the periods when I was most actively engaged in each field. The template is from here.
How to efficiently build and evaluate large multi-modal models?
How to involve large multi-modal models in embodied agent system?
-
Video Understanding
- Long Video with Memory MovieChat
- Gated Memory MovieChat+
- Video Detailed Captioning AuroraCap
- Transfer from Image to Video MTransLLAMA
- + RWKV LongVidRWKV
- Lecture Benchmark Video-MMLU
- Masked Prediction TEMPURA
How to generate high-quality images, videos and 3D worlds?
How to control and evaluate the generated content?
-
Image
- Style Transfer in Fassion Diffashion
- Restoration with Diffusion Prior DTPM
- + Reinforcement Learning VersaT2I
- Science Benchmark Science-T2I
- + LMM Dream Engine
-
Video
- Video Editing with Layered Representation StableVideo
How to estimate human pose and motion from images and videos?
How to generate realistic and controllable human motion?
Featured
Videos

Video-MMLU [Project Page]

EMMOE [Project Page]

AuroraCap [Project Page]

SAMURAI [Project Page]

Ego3DT [Paper]

STEVE [Project Page]

StableVideo [Hugging Face Demo]

MovieChat [Project Page]

UniAP [Paper]
Organized
Workshops | Tutorials | Talks

5th International Workshop on Multimodal Video Agent
CVPR 2025, Nashville, TN
Workshop Organizer (Track 1A and 1B)

4th International Workshop on Long-form Video Understanding: Towards Multimodal AI Assistant and Copilot
CVPR 2024, Seattle, WA
Workshop Organizer (Track 1)

1st Workshop on Imageomics: Discovering Biological Knowledge from Images using AI
AAAI 2024, Vancouver, Canada
Invited Talk