Research
Currently, I am interested in several topics, including but not limited to:
- Foundations and applications of generative models, such as denoising diffusion probabilistic models (DDPMs), flow matching, and both discrete and continuous variants of diffusion models applied to text generation and scientific domains. My previous work including StableVideo at ICCV 2023, DTPM at CVPR 2024, Science-T2I at CVPR 2025, and DiffPO at ACL 2025.
- Efficient architectures for long context modeling in video (both understanding and generation), language, and other modalities, using techniques such as linear attention, state space models (SSMs), RNNs, hybrid models, or sparse attention mechanisms. My previous work including MovieChat at CVPR 2024, AuroraCap and VDC at ICLR 2025, and AuroraLong at ICCV 2025.
- Spatial and video understanding through efficient architectures, novel paradigm design, reinforcement learning approaches, and synthetic data generation for training. My previous work including STEVE at ECCV 2024, Dynamic Token Compression at CVPR 2025, and ToSA at IROS 2025.
- Unified models for both multi-modal understanding and generation in terms of architecture design, training data, and benchmarks. My previous work including Dream Engine, RISE, and An Empirical Study are also available.
- Benchmarking and evaluation, which must be designed to be non-trivial (State-of-the-art models achieve a accuracy of less than 20%, for example, LiveCodeBench Pro), meaningful (in real-world applications), robust (less annotation errors), and rich in analysis (expert involved). This is, without a doubt, a highly non-trivial endeavor. My previous works often create new benchmarks and evaluation metrics along with the technical contributions.
Featured
Videos

Video-MMLU [Project Page]

EMMOE [Project Page]

AuroraCap [Project Page]

SAMURAI [Project Page]

Ego3DT [Paper]

STEVE [Project Page]

StableVideo [Hugging Face Demo]

MovieChat [Project Page]

UniAP [Paper]
Organized
Workshops | Tutorials | Talks

5th International Workshop on Multimodal Video Agent
CVPR 2025, Nashville, TN
Workshop Organizer (Track 1A and 1B)

4th International Workshop on Long-form Video Understanding: Towards Multimodal AI Assistant and Copilot
CVPR 2024, Seattle, WA
Workshop Organizer (Track 1)

1st Workshop on Imageomics: Discovering Biological Knowledge from Images using AI
AAAI 2024, Vancouver, Canada
Invited Talk