About
Wenhao Chai
Docs (updated on 03/19/2025):
CV
|
Email
Research:
Google Scholar
|
GitHub
|
Hugging Face
Social Media:
Twitter
|
Instagram
|
LinkedIn
|
Zhihu
|
Xiaohongshu
Wenhao Chai is currently a graduate student at University of Washington, with Information Processing Lab advised by Prof. Jenq-Neng Hwang. Previously, he was an undergradate student at Zhejiang University, with CVNext Lab advised by Prof. Gaoang Wang. He is fortunate to work with Prof. Christopher D Manning at Stanford University, and have worked with Prof. Saining Xie and Prof. Yilun Du. He has internship at Pika Labs and Microsoft Research Asia. His research spans a wide range of topics in computer vision and deep learning. He has published related papers in top-tier conferences and journals such as ICLR, CVPR, ICCV, ECCV, and AAAI. He has also organized workshops and tutorials at CVPR and AAAI, and served as a reviewer for NeurIPS, ICLR, ICML, CVPR, ICCV, ECCV, AAAI, COLM, AISTATS, and IJCV.
Check Out
News and Highlights
- To junior master/undergraduate students: if you would like to chat about life, career plan, or research ideas related to AI/ML, feel free to send me zoom / google meet invitation via email (wchai [at] uw [dot] edu) to schedule a meeting. I will dedicate at least 30 mins every week for such meetings. It would be great if there were a resume or webpage for me to learn about you before our conversation! I encourage students from underrepresented groups to reach out.
- We are hosting Discord server among professors and students for arXiv daily sharing and research discussion. Join us.
- 2025 Fall CS Ph.D. application record.
- I am making a schedule for CVPR 2025, from June 11th to June 15th, 2025, at the Music City Center in Nashville, TN. Message me if you'd like to join a road trip, coffee chat, or arrange a meal together.
- 04/2025: We host CVPR 2025 Video Understanding Challenge @ LOVEU.
- 03/2025: I graduate from University of Washington with the thesis of Large Multi-Model Models for Video Captioning.
- 02/2025: Three papers accepted to CVPR 2025.
- 01/2025: Two papers accepted to ICLR 2025.
- 12/2024: Two papers accepted to AAAI 2025.
- 07/2024: Two papers accepted to ECCV 2024.
- 06/2024: One technique report accepted to CVPR 2024 workshop @ NTIRE.
- 06/2024: I work with Pika Labs as intern to develop next-generation video understanding and generation models.
- 05/2024: One paper accepted to CVPR 2024 workshop @ Embodied AI.
- 04/2024: We host CVPR 2024 Long-form Video Understanding Challenge @ LOVEU.
- 04/2024: Invited talk at AgentX seminar about our STEVE series works.
- 03/2024: One paper accepted to ICLR 2024 workshop @ LLM Agents.
- 02/2024: Two papers accepted to CVPR 2024 with one highlight (2.81%).
- 02/2024: Invited talk at AAAI 2024 workshop @ IMAGEOMICS.
- 12/2023: One paper accepted to AAAI 2024.
- 09/2023: One paper accepted to ICCV 2023 workshop @ TNGCV-DataComp.
- 07/2023: Two papers accepted to ICCV 2023.
Recent
Projects
* Equal contribution. † Project lead. ‡ Corresponding author.
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai†, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher D. Manning
International Conference on Learning Representations (ICLR), 2025
Project Page
|
Paper
|
Video
|
Model
|
Benchmark
|
Leaderboard
|
Poster
|
Code
AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang‡
arXiv preprint, 2024
Project Page
|
Paper
|
Video
|
Raw Result
|
Code
SAMURAI is a zero-shot visual tracking framework that adapts Segment Anything Model (SAM) for visual tracking with motion-aware memory.
StableVideo: Text-driven Consistency-aware Diffusion Video Editing
Wenhao Chai, Xun Guo‡, Gaoang Wang, Yan Lu
International Conference on Computer Vision (ICCV), 2023
Project Page
|
Paper
|
Video
|
Demo
|
Code
We tackle introduce temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the new objects.
MovieChat: From Dense Token to Sparse Memory in Long Video Understanding
Enxin Song*, Wenhao Chai*†, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang‡
Computer Vision and Pattern Recognition (CVPR), 2024
Project Page
|
Paper
|
Blog
|
Video
|
Dataset
|
Leaderboard
|
Code
MovieChat achieves state-of-the-art performace in extra long video (more than 10K frames) understanding by introducing memory mechanism.