Video-MMLU

A Massive Multi-Discipline Lecture Understanding Benchmark

arXiv

GitHub

Huggingface

Wenhao Chai

Princeton University

Author 2

Princeton University

Author 3

Princeton University

Author 4

Princeton University

Video-MMLU pushes LMMs to the limits—can the model really understand real-world lectures?

Section One

short description.

Section One

short description.

Section One

Example One (Image Top)

This is an example of content that can be displayed in a card group layout. Card groups have equal height columns with no gutters.

Example Two (Image Bottom)

This is an example of content that can be displayed in a card group layout. Card groups have equal height columns with no gutters.

FAQs

Video-MMLU is a benchmark designed to evaluate large multimodal models on their ability to understand and reason about real-world lecture videos across multiple domains and disciplines.

Unlike text-only or image-only benchmarks, Video-MMLU specifically tests comprehension of educational video content, requiring models to integrate visual, auditory, and temporal information to answer challenging questions.

The benchmark dataset and evaluation code are available through our GitHub repository. Researchers can use it to test their models and compare results with existing baselines.

Video-MMLU

A Massive Multi-Discipline Lecture Understanding Benchmark

Video-MMLU pushes LMMs to the limits—can the model really understand real-world lectures?

Section One

Section One

Section One

Example One (Image Top)

Example Two (Image Bottom)

FAQs

Dataset Viewer