https://www.nature.com/articles/s40494-025-01897-3
Abstract
Although evaluation benchmarks for general multimodal large language models (MLLMs) are increasingly prevalent, the systematic evaluation of their capabilities for processing ancient texts remains underdeveloped. Ancient books, as cultural heritage artifacts, integrate rich textual and visual elements. Due to their unique cross-linguistic complexity and multimodal composition, they pose challenges to the evaluation of the multifaceted capabilities of MLLMs. To address this issue, we propose benchmarking the ancient book capabilities of MLLMs (BABMLLM), a specialized benchmark designed to evaluate their performance specifically within the domain of ancient books. This benchmark comprises seven curated datasets, enabling comprehensive evaluation across four core tasks relevant to ancient book processing: ancient book translation, text recognition, image captioning, and image-text consistency judgment. Furthermore, BABMLLM provides a standardized reference for evaluating MLLMs in the context of ancient books and establishes a foundation for selecting suitable base models for subsequent domain-specific development.
Journal: npj Heritage Science
- DOI: 10.1038/s40494-025-01897-3