We introduce CulturalPangea-7B, a culturally-aware multilingual multimodal language model (MLLM) specifically fine-tuned to interpret and reason about long-tail cultural entities from around the world. CulturalPangea-7B is trained on the CulturalGround dataset, containing 22M open-ended and 8M multiple-choice high-quality, culturally-rich Visual Question Answering (VQA) pairs spanning 42 countries and 39 languages. CulturalPangea-7B is evaluated on PangeaBench, ALMBench, and MERLIN, demonstrating superior performance on culture-focused benchmarks. CulturalPangea-7B achieves state-of-the-art performance among open models on cultural understanding tasks without degrading performance on mainstream vision-language tasks.
CulturalGround is structured around three key aspects:
CulturalPangea-7B addresses the common issue of MLLMs misinterpreting long-tail cultural entities by directly grounding the model in diverse cultural knowledge. Starting from the powerful Pangea-7B model, CulturalPangea is further trained on our CulturalGround dataset, achieving state-of-the-art performance among open models on culture-focused benchmarks without degrading performance on mainstream vision-language tasks. Our aim is to facilitate the development of culturally-grounded multilingual MLLMs that can better understand and reason about diverse cultural contexts worldwide.
Click to jump to each section.
Grounding multilingual multimodal LLMs in diverse cultural knowledge presents unique challenges. We developed CulturalGround, a large-scale, culturally-rich dataset specifically designed for training culturally-aware models. CulturalGround contains 22 million open-ended and 8 million multiple-choice high-quality Visual Question Answering (VQA) pairs spanning 42 countries and 39 languages. The dataset was meticulously curated from Wikidata, focusing on culturally significant entities often underrepresented in standard training corpora. By training on this data, CulturalPangea achieves state-of-the-art performance among open models on culture-focused benchmarks without degrading performance on mainstream vision-language tasks. Figure 2 shows the distribution of CulturalGround across countries and languages. We implemented a scalable pipeline that leverages Wikidata to identify cultural concepts, gather corresponding images from Wikimedia Commons, and automatically generate factually grounded VQA pairs.
CulturalGround was constructed using a scalable pipeline that leverages Wikidata to identify culturally significant entities from 42 countries. We collected 1-3 images per entity from Wikimedia Commons and generated questions based on 76 cultural properties. The dataset comprises over 1.8 million unique entities with nearly 2.9 million associated images. This approach ensures comprehensive coverage of diverse cultural contexts while maintaining factual accuracy through Wikidata's structured knowledge.
CulturalGround provides extensive coverage across countries and languages, with Germany (2.8M), France (2.7M), and the United Kingdom (2.1M) having the largest representation. The dataset includes three main splits: Open-Ended VQA (22M unfiltered), Multiple-Choice Questions (8M unfiltered), and filtered versions (14.2M OE, 6.6M MCQ) for higher quality training.
Cultural Entity Identification. We systematically identified culturally significant entities from Wikidata for each of the 42 countries, focusing on people, places, organizations, and other cultural artifacts. Entities were selected based on their cultural relevance and availability of high-quality images in Wikimedia Commons. This approach ensures that our dataset captures diverse cultural perspectives while maintaining factual grounding through structured knowledge bases.
Multilingual Question-Answer Generation. For each entity-image pair, we generated multiple QA instances using template-based approaches covering 76 different cultural properties. Questions span various aspects including biographical information, cultural significance, historical context, and visual attributes. The multilingual nature is achieved by generating questions and answers in all languages associated with each country, resulting in comprehensive coverage across 39 languages. One entity can have same question in different languages, enhancing cross-lingual understanding.
Cultural Relevance Filtering. To ensure high-quality training data, we perform filtering including question-answer-image cultural relevance scoring and factual verification. The filtered splits remove low-quality or irrelevant QA pairs, resulting in cleaner subsets suitable for training high-performance models. This multi-stage approach balances dataset scale with quality, providing both comprehensive coverage and reliable training signals.
CulturalGround is used to fine-tune CulturalPangea-7B from the base Pangea-7B model. The training process uses 13 million open-ended and 5 million multiple-choice questions from the unfiltered splits. During training, only the connector and language model components are fine-tuned while keeping the vision encoder frozen, following established practices for efficient multimodal fine-tuning. The training data from CulturalGround is interleaved with original Pangea instruction data to maintain general capabilities while enhancing cultural understanding.
CulturalPangea-7B is a fine-tuned version of Pangea-7B, specifically designed to interpret and reason about long-tail cultural entities from around the world. The model is based on the LLaVA-Next architecture with Qwen2-7B-Instruct as the language model backbone and uses a frozen CLIP-ViT-Large vision encoder. CulturalPangea-7B is fine-tuned on the CulturalGround dataset using 18 million VQA pairs (13M open-ended + 5M multiple-choice) from the filtered splits. During training, only the connector and language model components are fine-tuned while keeping the vision encoder frozen, ensuring efficient adaptation to cultural knowledge while preserving visual understanding capabilities.
To assess the capabilities of CulturalPangea-7B across cultural understanding, entity recognition, and multilingual multimodal tasks, we conduct comprehensive evaluation on multiple benchmarks. We evaluate on PangeaBench (comprehensive multilingual multimodal evaluation), ALMBench (cultural understanding), and MERLIN (entity recognition). The evaluation demonstrates CulturalPangea's superior performance on culture-focused benchmarks while maintaining strong performance on general multilingual tasks.
We evaluate CulturalPangea-7B against several state-of-the-art open source baselines, including English-centric models (Llava-Next-7B, Molmo-7B-D, Llama3.2-11B) and multilingual models (PaliGemma-3B, mBLIP-mT0-XL, AyaVision-8B, Pangea-7B). Our evaluation covers multiple benchmarks: PangeaBench for comprehensive multilingual multimodal evaluation, ALMBench for cultural understanding, and MERLIN for entity recognition. We integrate our evaluation tasks into lmms-eval framework for consistent evaluation across models. Table 1 shows the comprehensive performance comparison across all evaluated tasks.
| Models | Cultural Understanding | Entity Recognition | Multilingual VQA | Captioning | Average | |||
|---|---|---|---|---|---|---|---|---|
| CVQA | MARVL | ALM | MERLIN | MAXM | M3Exam | XM100 | Avg | |
| Llava-Next-7B | 40.9 | 50.9 | 42.4 | 34.1 | 21.4 | 28.4 | 15.5 | 33.4 |
| Molmo-7B-D | 58.7 | 54.9 | 49.1 | 42.9 | 37.5 | 39.1 | 6.0 | 41.2 |
| Llama3.2-11B | 69.6 | 58.1 | 56.6 | 49.1 | 43.9 | 36.6 | 5.8 | 45.7 |
| PaliGemma-3B | 42.5 | 52.2 | 35.7 | 13.1 | 19.9 | 25.6 | 0.6 | 27.1 |
| mBLIP-mT0-XL | 37.5 | 66.7 | 36.9 | 15.8 | 36.8 | 25.0 | 6.8 | 32.2 |
| AyaVision-8B | 50.8 | 64.5 | 55.1 | 55.3 | 52.1 | 41.7 | 10.0 | 47.1 |
| Pangea-7B | 56.9 | 78.7 | 59.9 | 66.0 | 53.3 | 42.0 | 29.7 | 55.3 |
| CulturalPangea-7B | 59.1 | 80.3 | 63.5 | 81.1 | 53.9 | 46.7 | 36.9 | 60.3 |
| Δ over Pangea-7B | +2.2 | +1.6 | +3.6 | +15.1 | +0.6 | +4.7 | +7.2 | +5.0 |
Key Results: CulturalPangea-7B demonstrates significant improvements over its base model Pangea-7B across all evaluation benchmarks. Most notably, it achieves a +15.1 point improvement on MERLIN (entity recognition), +7.2 points on XM100 (captioning), and +4.7 points on M3Exam (multilingual VQA). The model maintains strong performance on cultural understanding tasks while substantially improving on entity recognition and multilingual capabilities. CulturalPangea maintains excellent english performance not only in cultural settings, but also in general tasks as shown in Figure 3. Overall, CulturalPangea-7B achieves the highest average performance (60.3) among all evaluated models, demonstrating the effectiveness of cultural grounding.
| Model | SC | AS | EG | YO | GU | BH | LA | SI | SA | DA | GL | AF | IC | AZ | SH | SK | FI | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pangea-7B | 28.3 | 40.5 | 63.6 | 21.5 | 35.1 | 49.7 | 19.2 | 37.0 | 64.4 | 59.9 | 65.3 | 58.8 | 45.3 | 51.1 | 26.2 | 38.4 | 41.7 | 45.0 |
| CulturalPangea-7B | 39.4 | 50.9 | 68.3 | 25.8 | 39.1 | 53.2 | 22.0 | 39.6 | 66.8 | 62.2 | 66.9 | 60.3 | 46.5 | 52.2 | 26.8 | 39.0 | 42.1 | 48.3 |
| Δ Gain | +11.1 | +10.4 | +4.7 | +4.3 | +4.0 | +3.5 | +2.9 | +2.6 | +2.4 | +2.3 | +1.5 | +1.5 | +1.3 | +1.1 | +0.7 | +0.6 | +0.4 | +3.3 |
CulturalPangea demonstrates cross-cultural and lingual transfer on languages that are not in CulturalGround. To analyze transfer behaviors across cultures and languages, we compare performance with baseline on 17 languages from ALMBench. As shown in Table 2, our model consistently shows improvements over Pangea-7B. This trend suggests that the model is especially adept at transferring knowledge to languages with limited training data, alleviating the typical drop-off seen in low-resource settings.
As additional culturally grounded data is introduced during training, CulturalPangea's performance steadily improves on all culture-sensitive benchmarks (shown in Figure 4), while its general multilingual vision–language proficiency is concurrently preserved and even enhanced, as shown in Figure 5. This outcome indicates that our interleaved training strategy successfully avoided catastrophic forgetting by continuously mixing standard multilingual examples into the cultural fine-tuning process. In essence, this approach parallels replay-based continual learning, wherein revisiting earlier tasks helps maintain broad competence.
CulturalPangea demonstrates substantial gains on ALMBench for languages with limited training data, with improvements most pronounced for resource-poor languages within CulturalGround. As shown in Figure 6, we observe absolute accuracy gains of 15.0% points on Sinhala, 10.9% on Hebrew, and 9.1% on Irish. Additional low-resource languages exhibit consistent improvements: Tamil (+6.3), Amharic (+5.3), Bengali (+4.8), and Telugu (+4.2). These gains occur without sacrificing performance elsewhere—nearly all languages improve, with only Norwegian (-0.5) showing negligible regressions. The largest improvements in traditionally underrepresented languages indicate that culturally-aware grounding effectively scales to the long tail of languages, enhancing multilingual inclusivity.
The model's gains vary substantially across cultural domains, with culturally rich categories showing the largest improvements. As shown in Figure 7, Heritage achieves the highest relative improvement at 11.5% (from 64.4% to 71.8%), followed by Media at 10.6% and Food at 9.2%. Other culturally salient domains like Architecture (7.1%), Economy (7.0%), and Music (6.2%) also demonstrate substantial gains. These results indicate that domains requiring broad cultural knowledge and context benefit most from our approach. In contrast, generic visual domains show minimal improvement or slight regression. Sketch decreases by 0.6%, while Meme improves only marginally at 2.31%. Similarly, Festivals (0.73%) and Religion (2.08%) show limited gains. This pattern confirms that improvements concentrate in genuine cultural understanding areas, while abstract visual content or highly localized traditions remain challenging. The largest accuracy gains occur in well-represented cultural domains, reinforcing the value of targeting cultural knowledge in model training.
Following
CulturalGround covers a wide range of long-tail cultural entities across different domains, showcasing culturally-rich visual question-answering pairs spanning 42 countries and 39 languages.
CulturalGround treats long-tail entities as first-class citizens. As shown in the entity connectivity distribution, most entities in the dataset have only a small number of connections, such as very few incoming or outgoing links in the knowledge graph or minimal Wikipedia backlinks. Consequently, median link counts are extremely low, indicating that the typical entity in CulturalGround is sparsely connected. This distribution is highly right-skewed, with a long tail: a handful of entities are linked to many others, but the vast majority are referenced only a few times. Such patterns underscore the dataset's focus on culturally specific, niche entities that lie beyond the well-connected head of popular or globally known concepts. Notably, a substantial portion of the entities included have no dedicated Wikipedia page at all. The prevalence of entries lacking Wikipedia articles further highlights the dataset's extension into culturally important yet under-documented regions of knowledge, reinforcing the long-tail presence.
We introduced CulturalPangea, a culturally-aware multilingual multimodal large language model specifically designed to interpret and reason about long-tail cultural entities from around the world. By leveraging CulturalGround, our newly curated dataset containing 22 million open-ended and 8 million multiple-choice culturally-rich VQA pairs spanning 42 countries and 39 languages, we demonstrated significant improvements in cultural understanding and entity recognition. Our comprehensive evaluation using PangeaBench, ALMBench, and MERLIN revealed CulturalPangea's superior performance compared to existing open-source models, achieving state-of-the-art results on culture-focused benchmarks without degrading performance on mainstream vision-language tasks. The model addresses the common issue of MLLMs misinterpreting cultural entities by directly grounding the model in diverse cultural knowledge from Wikidata. We make CulturalPangea and the CulturalGround dataset fully available to facilitate future research in culturally-aware multimodal AI.
The authors would like to thank the Cambrian team for their project webpage template. Cover image and data/model icons co-created with AI.
@misc{nyandwi2025groundingmultilingualmultimodalllms,
title={Grounding Multilingual Multimodal LLMs With Cultural Knowledge},
author={Jean de Dieu Nyandwi and Yueqi Song and Simran Khanuja and Graham Neubig},
year={2025},
eprint={2508.07414},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.07414}
}