CulturalGround

Grounding Multilingual Multimodal LLMs
With Cultural Knowledge

Instruction Tuning Data Icon
CulturalGround: 22M open-ended and 8M multiple-choices culturally-rich VQA pairs spanning 42 countries and 39 languages from Wikidata.
CulturalPangea Icon
CulturalPangea-7B: a culturally-aware multilingual multimodal LLM fine-tuned from Pangea-7B on CulturalGround.
Benchmarking Icon
Comprehensive Evaluation: evaluated on PangeaBench, ALMBench, and MERLIN for cultural understanding and multilingual capabilities.
Teaser Image

Corresponding to: {jeandedi,gneubig}@cs.cmu.edu

We introduce CulturalPangea-7B, a culturally-aware multilingual multimodal language model (MLLM) specifically fine-tuned to interpret and reason about long-tail cultural entities from around the world. CulturalPangea-7B is trained on the CulturalGround dataset, containing 22M open-ended and 8M multiple-choice high-quality, culturally-rich Visual Question Answering (VQA) pairs spanning 42 countries and 39 languages. CulturalPangea-7B is evaluated on PangeaBench, ALMBench, and MERLIN, demonstrating superior performance on culture-focused benchmarks. CulturalPangea-7B achieves state-of-the-art performance among open models on cultural understanding tasks without degrading performance on mainstream vision-language tasks.

teaser
Figure 1: Our data curation pipeline involves gathering culturally relevant entities from the Wikidata knowledge base, creating several questions and answers about each entity, rephrasing them using an LLM, and filtering low-quality samples using a VLM.

CulturalGround is structured around three key aspects:

  1. §CulturalGround Dataset: We present CulturalGround, a large-scale dataset with over 22M open-ended and 8M multiple-choice culturally-rich VQA pairs spanning 42 countries and 39 languages, meticulously curated from Wikidata. Figure 2 shows the data distribution of CulturalGround.
  2. §CulturalPangea-7B: a culturally-aware multilingual multimodal LLM fine-tuned from Pangea-7B to interpret long-tail cultural entities.
  3. §Comprehensive Evaluation: We evaluate CulturalPangea-7B on PangeaBench, ALMBench, and MERLIN, demonstrating superior performance on culture-focused benchmarks.

train data distribution
Figure 2: The distribution of samples across languages highlighting percentages of each language in CulturalGround.

CulturalPangea-7B addresses the common issue of MLLMs misinterpreting long-tail cultural entities by directly grounding the model in diverse cultural knowledge. Starting from the powerful Pangea-7B model, CulturalPangea is further trained on our CulturalGround dataset, achieving state-of-the-art performance among open models on culture-focused benchmarks without degrading performance on mainstream vision-language tasks. Our aim is to facilitate the development of culturally-grounded multilingual MLLMs that can better understand and reason about diverse cultural contexts worldwide.

Data Logo CulturalGround
(Cultural Dataset)
CulturalPangea Logo CulturalPangea-7B
(Cultural Modeling)
Eval Logo Comprehensive
Evaluation
Samples Logo Long-tail and Culturally Diverse Entities

Click to jump to each section.


CulturalGround Dataset

Grounding multilingual multimodal LLMs in diverse cultural knowledge presents unique challenges. We developed CulturalGround, a large-scale, culturally-rich dataset specifically designed for training culturally-aware models. CulturalGround contains 22 million open-ended and 8 million multiple-choice high-quality Visual Question Answering (VQA) pairs spanning 42 countries and 39 languages. The dataset was meticulously curated from Wikidata, focusing on culturally significant entities often underrepresented in standard training corpora. By training on this data, CulturalPangea achieves state-of-the-art performance among open models on culture-focused benchmarks without degrading performance on mainstream vision-language tasks. Figure 2 shows the distribution of CulturalGround across countries and languages. We implemented a scalable pipeline that leverages Wikidata to identify cultural concepts, gather corresponding images from Wikimedia Commons, and automatically generate factually grounded VQA pairs.

Dataset Construction Pipeline

CulturalGround was constructed using a scalable pipeline that leverages Wikidata to identify culturally significant entities from 42 countries. We collected 1-3 images per entity from Wikimedia Commons and generated questions based on 76 cultural properties. The dataset comprises over 1.8 million unique entities with nearly 2.9 million associated images. This approach ensures comprehensive coverage of diverse cultural contexts while maintaining factual accuracy through Wikidata's structured knowledge.

Data Distribution and Quality

CulturalGround provides extensive coverage across countries and languages, with Germany (2.8M), France (2.7M), and the United Kingdom (2.1M) having the largest representation. The dataset includes three main splits: Open-Ended VQA (22M unfiltered), Multiple-Choice Questions (8M unfiltered), and filtered versions (14.2M OE, 6.6M MCQ) for higher quality training.

Cultural Entity Identification. We systematically identified culturally significant entities from Wikidata for each of the 42 countries, focusing on people, places, organizations, and other cultural artifacts. Entities were selected based on their cultural relevance and availability of high-quality images in Wikimedia Commons. This approach ensures that our dataset captures diverse cultural perspectives while maintaining factual grounding through structured knowledge bases.

Multilingual Question-Answer Generation. For each entity-image pair, we generated multiple QA instances using template-based approaches covering 76 different cultural properties. Questions span various aspects including biographical information, cultural significance, historical context, and visual attributes. The multilingual nature is achieved by generating questions and answers in all languages associated with each country, resulting in comprehensive coverage across 39 languages. One entity can have same question in different languages, enhancing cross-lingual understanding.

Cultural Relevance Filtering. To ensure high-quality training data, we perform filtering including question-answer-image cultural relevance scoring and factual verification. The filtered splits remove low-quality or irrelevant QA pairs, resulting in cleaner subsets suitable for training high-performance models. This multi-stage approach balances dataset scale with quality, providing both comprehensive coverage and reliable training signals.

Training Data

CulturalGround is used to fine-tune CulturalPangea-7B from the base Pangea-7B model. The training process uses 13 million open-ended and 5 million multiple-choice questions from the unfiltered splits. During training, only the connector and language model components are fine-tuned while keeping the vision encoder frozen, following established practices for efficient multimodal fine-tuning. The training data from CulturalGround is interleaved with original Pangea instruction data to maintain general capabilities while enhancing cultural understanding.

Culturally Grounded Multimodal Dataset Curation Algorithm
Inputs: Knowledge base 𝒦, regions R, languages L, cultural properties P
1. Cultural Entity Selection: Extract entities E' ⊆ E connected to target regions via culturally meaningful properties, ensuring multilingual label coverage
2. Image Collection: Gather image sets ℐe for each entity from Wikidata P18 property and Wikimedia Commons categories
3. Template-Based QA Generation: Create factual question-answer pairs (q(l)e,p, a(l)e,p) using language-specific templates for each entity-property-language combination
4. LLM Refinement: Improve fluency and cultural naturalness of QA pairs using large language models while preserving factual accuracy and removing entity name leakage
5. Image-Text Relevance Filtering: Apply VLM-based binary filtering to ensure meaningful alignment between images and refined QA pairs
Output: Dataset D of (image, question, answer) triplets (ie, q'(l)e,p, a'(l)e,p) spanning 39 languages, 42 regions, ~22M samples
Data Curation Summary: Our data curation pipeline transforms structured knowledge from Wikidata into culturally grounded multimodal training data through systematic entity selection, image collection, template QAs generation, LLM refinement, and relevance filtering.

CulturalPangea-7B

CulturalPangea-7B is a fine-tuned version of Pangea-7B, specifically designed to interpret and reason about long-tail cultural entities from around the world. The model is based on the LLaVA-Next architecture with Qwen2-7B-Instruct as the language model backbone and uses a frozen CLIP-ViT-Large vision encoder. CulturalPangea-7B is fine-tuned on the CulturalGround dataset using 18 million VQA pairs (13M open-ended + 5M multiple-choice) from the filtered splits. During training, only the connector and language model components are fine-tuned while keeping the vision encoder frozen, ensuring efficient adaptation to cultural knowledge while preserving visual understanding capabilities.

Comprehensive Evaluation

To assess the capabilities of CulturalPangea-7B across cultural understanding, entity recognition, and multilingual multimodal tasks, we conduct comprehensive evaluation on multiple benchmarks. We evaluate on PangeaBench (comprehensive multilingual multimodal evaluation), ALMBench (cultural understanding), and MERLIN (entity recognition). The evaluation demonstrates CulturalPangea's superior performance on culture-focused benchmarks while maintaining strong performance on general multilingual tasks.

Evaluation Results

We evaluate CulturalPangea-7B against several state-of-the-art open source baselines, including English-centric models (Llava-Next-7B, Molmo-7B-D, Llama3.2-11B) and multilingual models (PaliGemma-3B, mBLIP-mT0-XL, AyaVision-8B, Pangea-7B). Our evaluation covers multiple benchmarks: PangeaBench for comprehensive multilingual multimodal evaluation, ALMBench for cultural understanding, and MERLIN for entity recognition. We integrate our evaluation tasks into lmms-eval framework for consistent evaluation across models. Table 1 shows the comprehensive performance comparison across all evaluated tasks.

Models Cultural Understanding Entity Recognition Multilingual VQA Captioning Average
CVQA MARVL ALM MERLIN MAXM M3Exam XM100 Avg
Llava-Next-7B 40.9 50.9 42.4 34.1 21.4 28.4 15.5 33.4
Molmo-7B-D 58.7 54.9 49.1 42.9 37.5 39.1 6.0 41.2
Llama3.2-11B 69.6 58.1 56.6 49.1 43.9 36.6 5.8 45.7
PaliGemma-3B 42.5 52.2 35.7 13.1 19.9 25.6 0.6 27.1
mBLIP-mT0-XL 37.5 66.7 36.9 15.8 36.8 25.0 6.8 32.2
AyaVision-8B 50.8 64.5 55.1 55.3 52.1 41.7 10.0 47.1
Pangea-7B 56.9 78.7 59.9 66.0 53.3 42.0 29.7 55.3
CulturalPangea-7B 59.1 80.3 63.5 81.1 53.9 46.7 36.9 60.3
Δ over Pangea-7B +2.2 +1.6 +3.6 +15.1 +0.6 +4.7 +7.2 +5.0
Table 1: Performance comparison across models on cultural understanding (CVQA, MARVL, ALM), entity recognition (MERLIN), multilingual VQA (MAXM, M3Exam), and captioning (XM100). The best-performing model on each dataset is in bold and the second best is underlined.

Key Results: CulturalPangea-7B demonstrates significant improvements over its base model Pangea-7B across all evaluation benchmarks. Most notably, it achieves a +15.1 point improvement on MERLIN (entity recognition), +7.2 points on XM100 (captioning), and +4.7 points on M3Exam (multilingual VQA). The model maintains strong performance on cultural understanding tasks while substantially improving on entity recognition and multilingual capabilities. CulturalPangea maintains excellent english performance not only in cultural settings, but also in general tasks as shown in Figure 3. Overall, CulturalPangea-7B achieves the highest average performance (60.3) among all evaluated models, demonstrating the effectiveness of cultural grounding.

english vs multilingual
Figure 3: Overall performance comparison in english and multilingual understanding.

Discussion

Cross-Cultural and Cross-Lingual Transfer

Model SCASEGYOGUBH LASISADAGLAF ICAZSHSKFIAvg
Pangea-7B 28.340.563.621.535.149.7 19.237.064.459.965.358.8 45.351.126.238.441.745.0
CulturalPangea-7B 39.450.968.325.839.153.2 22.039.666.862.266.960.3 46.552.226.839.042.148.3
Δ Gain +11.1+10.4+4.7 +4.3+4.0+3.5 +2.9+2.6+2.4 +2.3+1.5+1.5 +1.3+1.1+0.7 +0.6+0.4+3.3
Table 2: Cross-lingual performance on ALM-Bench. Language codes: SC = Scots Gaelic, AS = Assamese, EG = Egyptian Arabic, YO = Yoruba, GU = Gujarati, BH = Bhojpuri, LA = Lao, SI = Sindhi, SA = Saudi Arabic, DA = Danish, GL = Galician, AF = Afrikaans, IC = Icelandic, AZ = Azerbaijani, SH = Shona, SK = Sanskrit, FI = Filipino.

CulturalPangea demonstrates cross-cultural and lingual transfer on languages that are not in CulturalGround. To analyze transfer behaviors across cultures and languages, we compare performance with baseline on 17 languages from ALMBench. As shown in Table 2, our model consistently shows improvements over Pangea-7B. This trend suggests that the model is especially adept at transferring knowledge to languages with limited training data, alleviating the typical drop-off seen in low-resource settings.

Cultural Data Scaling and General Skill Preservation

As additional culturally grounded data is introduced during training, CulturalPangea's performance steadily improves on all culture-sensitive benchmarks (shown in Figure 4), while its general multilingual vision–language proficiency is concurrently preserved and even enhanced, as shown in Figure 5. This outcome indicates that our interleaved training strategy successfully avoided catastrophic forgetting by continuously mixing standard multilingual examples into the cultural fine-tuning process. In essence, this approach parallels replay-based continual learning, wherein revisiting earlier tasks helps maintain broad competence.

cultural_progress
Figure 4: Training performance curves on four culture‐centric benchmarks (CVQA, MaRVL, ALM‐Bench, and MERLIN) show accuracy steadily rising as CulturalPangea is trained on more culturally grounded data versus the baseline model. Higher training step counts—i.e., greater exposure to CulturalGround consistently translate into improved accuracy.
multilingual_progress
Figure 5: Training performance curves on three general multilingual benchmarks and their overall average show accuracy improvements as CulturalPangea is exposed on more data, compared to the Pangea‑7B.

Does CulturalGround Help Low-Resource Languages?

CulturalPangea demonstrates substantial gains on ALMBench for languages with limited training data, with improvements most pronounced for resource-poor languages within CulturalGround. As shown in Figure 6, we observe absolute accuracy gains of 15.0% points on Sinhala, 10.9% on Hebrew, and 9.1% on Irish. Additional low-resource languages exhibit consistent improvements: Tamil (+6.3), Amharic (+5.3), Bengali (+4.8), and Telugu (+4.2). These gains occur without sacrificing performance elsewhere—nearly all languages improve, with only Norwegian (-0.5) showing negligible regressions. The largest improvements in traditionally underrepresented languages indicate that culturally-aware grounding effectively scales to the long tail of languages, enhancing multilingual inclusivity.

low_resource
Figure 6: CulturalPangea achieves the largest gains on underrepresented languages, demonstrating effective scaling to the long tail of languages.

Which Cultural Domains Benefit Most?

The model's gains vary substantially across cultural domains, with culturally rich categories showing the largest improvements. As shown in Figure 7, Heritage achieves the highest relative improvement at 11.5% (from 64.4% to 71.8%), followed by Media at 10.6% and Food at 9.2%. Other culturally salient domains like Architecture (7.1%), Economy (7.0%), and Music (6.2%) also demonstrate substantial gains. These results indicate that domains requiring broad cultural knowledge and context benefit most from our approach. In contrast, generic visual domains show minimal improvement or slight regression. Sketch decreases by 0.6%, while Meme improves only marginally at 2.31%. Similarly, Festivals (0.73%) and Religion (2.08%) show limited gains. This pattern confirms that improvements concentrate in genuine cultural understanding areas, while abstract visual content or highly localized traditions remain challenging. The largest accuracy gains occur in well-represented cultural domains, reinforcing the value of targeting cultural knowledge in model training.

low_resource
Figure 7: Absolute accuracy gains of CulturalPangea over the baseline across 18 ALM‑Bench cultural domains. Improvements cluster in culture‑rich categories (Media, Heritage, Music), while Sketch and Meme offer minimal or negative change.

Performance Gains via Checkpoint Merging

Following , we merge 5 strong CulturePangea checkpoints from different training stages using the TIES method, which recovers complementary model strengths often lost during continual training. Although linear and DARE‑TIES variants show comparable results, TIES yields the highest average accuracy. As shown in Figure 8, using our strongest early checkpoint as the base outperforms using the original Pangea‑7B, evidence that the mixed‑data regime had already mitigated catastrophic forgetting. The merged model improves mean accuracy by roughly +0.8 points over the best model, illustrating the value of checkpoint combination.

merge
Figure 8: Accuracy improvements from merging checkpoints. CP stands for CulturalPangea.

Long-tail and Culturally Diverse Entities

CulturalGround covers a wide range of long-tail cultural entities across different domains, showcasing culturally-rich visual question-answering pairs spanning 42 countries and 39 languages.

CulturalGround (Japanese)
Shinnakagawa-machi Station
Shinnakagawa-machi Station
Entity Info
Language: Japanese
Entity name: 新中川町停留場 (Shinnakagawa-machi Station)
Country: Japan
Wikidata ID: Q11501118
Question type: Entity-level question
Template question: この画像に表示されているものは何ですか?(What is the entity shown in this image?)
Template answer: この画像に表示されているのは新中川町停留場です。長崎県長崎市にある長崎電気軌道の路面電車停留場. (The entity shown in the image is Shin Nakagawa-Machi Station, tram station in Nagasaki, Nagasaki prefecture, Japan.)
Question
画像に映っている停留場はどこにありますか
(What type of transportation facility is shown in this image?)
Answer
画像に映っているのは新中川町停留場で、長崎県長崎市にある長崎電気軌道の路面電車停留場です。
(Shin Nakagawa-Machi Station is a tram station located in Nagasaki, Nagasaki prefecture, Japan.)
CulturalGround (Spanish)
Castro archaeological site
castro
Entity Info
Language: Spanish
Entity name: castro
Country: Spain
Wikidata ID: Q5050823
Question type: Property-level question
Property: P2596 (culture)
Template question: ¿Con qué cultura está asociada esta entidad? (Which culture is this entity associated with?)
Template answer: El castro está asociado a la cultura castreña, una cultura arqueológica ibérica. (The castro is associated with the Castro culture, an Iberian archaeological culture.)
Question
¿A qué cultura o tradición arqueológica se asocia normalmente este tipo de asentamiento?
(What culture built these kinds of settlements?)
Answer
El castro está asociado a la cultura castreña, una cultura arqueológica propia del noroeste de España. Estos asentamientos fortificados son característicos de la época prerromana y reflejan las formas de vida de las comunidades locales antes de la llegada del Imperio Romano.
(This castro is associated with the Castro culture, an archaeological culture primarily found in northern Spain. These fortified settlements were often constructed before the Roman period and represent a significant part of the region's pre-Roman history.)
CulturalGround (English)
Christ Church
Christ Church
Entity Info
Language: English
Entity name: Christ Church
Country: United Kingdom
Wikidata ID: Q19362878
Question type: Property-level question
Property: P140 (religion or worldview)
Template question: Which religion is associated with this entity?
Template answer: Christ Church is associated with Anglicanism, a Christian tradition developing out of the practices, liturgy and identity of the Church of England.
Question
What Christian tradition does this church belong to?
Answer
Christ Church belongs to the Anglican tradition, which developed from the practices and identity of the Church of England. As part of the Anglican faith, Christ Church observes a blend of Catholic and Protestant traditions and is common throughout the United Kingdom.

CulturalGround treats long-tail entities as first-class citizens. As shown in the entity connectivity distribution, most entities in the dataset have only a small number of connections, such as very few incoming or outgoing links in the knowledge graph or minimal Wikipedia backlinks. Consequently, median link counts are extremely low, indicating that the typical entity in CulturalGround is sparsely connected. This distribution is highly right-skewed, with a long tail: a handful of entities are linked to many others, but the vast majority are referenced only a few times. Such patterns underscore the dataset's focus on culturally specific, niche entities that lie beyond the well-connected head of popular or globally known concepts. Notably, a substantial portion of the entities included have no dedicated Wikipedia page at all. The prevalence of entries lacking Wikipedia articles further highlights the dataset's extension into culturally important yet under-documented regions of knowledge, reinforcing the long-tail presence.

Wikidata Incoming Links Distribution
Entity incoming links distribution showing long-tail characteristics
Wikidata Outgoing Links Distribution
Entity outgoing links distribution showing long-tail characteristics
Wikipedia Presence by Country
Wikipedia presence distribution across countries showing prevalence of entities lacking Wikipedia articles

Conclusion

We introduced CulturalPangea, a culturally-aware multilingual multimodal large language model specifically designed to interpret and reason about long-tail cultural entities from around the world. By leveraging CulturalGround, our newly curated dataset containing 22 million open-ended and 8 million multiple-choice culturally-rich VQA pairs spanning 42 countries and 39 languages, we demonstrated significant improvements in cultural understanding and entity recognition. Our comprehensive evaluation using PangeaBench, ALMBench, and MERLIN revealed CulturalPangea's superior performance compared to existing open-source models, achieving state-of-the-art results on culture-focused benchmarks without degrading performance on mainstream vision-language tasks. The model addresses the common issue of MLLMs misinterpreting cultural entities by directly grounding the model in diverse cultural knowledge from Wikidata. We make CulturalPangea and the CulturalGround dataset fully available to facilitate future research in culturally-aware multimodal AI.

Acknowledgement

The authors would like to thank the Cambrian team for their project webpage template. Cover image and data/model icons co-created with AI.

BibTeX

@misc{nyandwi2025groundingmultilingualmultimodalllms,
  title={Grounding Multilingual Multimodal LLMs With Cultural Knowledge},
  author={Jean de Dieu Nyandwi and Yueqi Song and Simran Khanuja and Graham Neubig},
  year={2025},
  eprint={2508.07414},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2508.07414}
}