We introduce Pangea-7B, a fully open multilingual multimodal language model (MLLM) designed to bridge multilingual and multicultural gaps in visual understanding tasks. Pangea-7B is trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. Pangea-7B is evaluated on PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. As demonstrated in Figure 1, Pangea-7B demonstrates state-of-the-art results, outperforming existing open models in multilingual and culturally diverse contexts.
Pangea is structured around three key aspects, each offering important insights into the design space of MLLMs:
Pangea-7B not only sets a new multilingual multicultural multimodal model, but also acts as an extensive, open-source guide for developing instruction-tuned multilingual and multicultural MLLMs. Pangea is completely open-source, including model weights, instruction-tuning data PangeaIns, and instruction-tuning and evaluation code. Our aim is to facilitate the development of inclusive and robust multilingual MLLMs, prompting equity and accessibility across a broader linguistic and cultural spectrum.
Click to jump to each section.
There are four major challenges in training a multilingual MLLM:
1) Data scarcity: High-quality multilingual multimodal data is scarce, especially in low-resource languages, which makes it difficult to create large-scale training data.
2) Cultural nuances: Visual interpretations are context-dependent and vary across cultures.
3) Catastrophic forgetting: Training on many languages or modalities often results in suboptimal performance on some subsets and requires careful balancing.
4) Evaluation complexity: Developing an evaluation suite that accurately measures performance across languages and cultures requires substantial resources and expertise.
To address these challenges, we introduce Pangea, an open-sourced multilingual MLLM designed to bridge linguistic and cultural gaps in visual understanding tasks.
1) 6M multilingual instruction tuning data: Pangea is trained on PangeaIns, a high-quality multilingual multimodal instruction tuning dataset comprising 6 million samples in 39 typologically diverse languages, addressing data scarcity. PangeaIns combines existing open-source resources with newly created instructions focused on multicultural understanding. We curate high-quality English instructions, carefully translate them, and adapt them for multilingual contexts.
2) Multicultural instruction generation pipeline: To address Western-centric biases in visual representations, we source images from LAION-Multi
3) Balanced data distribution: PangeaIns features an extensive and balanced distribution of languages, tasks, and cultural contexts (as shown in Figure 2).
4) PangeaBench: To evaluate Pangea-7B capabilities, we present PangeaBench, a comprehensive multilingual and multimodal evaluation suite comprising five multimodal and three text-based tasks across 14 datasets in 47 languages. PangeaBench assesses MLLMs' performance on open-domain multimodal chat, image captioning, cultural understanding, multimodal reasoning, and text-only tasks including question answering and complex math reasoning.
Creating a truly multilingual, multicultural MLLM presents unique challenges. We developed PangeaIns, a diverse and high-quality dataset specifically designed for instruction tuning. PangeaIns features an extensive and balanced distribution of languages, tasks, and cultural contexts. Comprising 6 million samples in 39 languages, PangeaIns was curated with a focus on linguistic and cultural diversity. We empirically keep the final language ratio of English to Multilingual as 40%:60% as we found a significant portion of English data plays an important role in cross-lingual transfer. This is discussed in more details in (see Discussion). Figure 2 shows the details of our PangeaIns's distribution. We implemented three key strategies to ensure comprehensive coverage, each addressing the specific hurdles encountered in multilingual multimodal learning.
We first create a high-quality set of English multimodal instructions, which serve as the foundation for translation into other languages.
Figure 2 shows the statistics of our translated datasets.
Then, we use the proprietary model Gemini 1.5 Pro
While machine translation enables us to scale across multiple languages, data translated from English is still Anglo-centric in its coverage of cultural
Curation of Culturally Diverse Images.
We began by sampling 10 million images from the LAION-Multi dataset
Captioning of Multicultural Images with Different Languages. To provide context and enhance the models' ability to interpret the images accurately, we regenerated a more detailed caption using Gemini 1.5 Pro based on high-quality original text. Each image was accompanied by a caption written in the language corresponding to its cultural origin.
Generating Multilingual and Cross-Cultural Instructions. For each image, we used Gemini 1.5 Pro to generate captions in native languages, leveraging high-quality alt text to enrich context. This alt text provided crucial cultural and contextual information, such as identifying key figures or locations. We carefully engineered prompts to create multilingual instructions based on 13 task types like Information Seeking and Cultural Interpretation. Each image had up to two QA pairs, ensuring diverse interactions. This approach enabled the model to better capture visual, cultural, and contextual nuances and respond effectively across various linguistic contexts.
To further enrich PangeaIns, we conducted an extensive survey of the available multilingual multimodal literature and datasets, including those hosted on HuggingFace. As a result, we incorporated several high-quality, open-source datasets into our PangeaIns mixture. These include Chinese ALLaVA-4V
We train Pangea-7B on PangeaIns, our multilingual multimodal dataset comprising 6 million samples across 39 languages. The model is based on LLaVA-Next architecture
To assess the capabilities of Pangea-7B across a variety of languages, cultures, and task types, we have developed PangeaBench, a comprehensive multilingual and multimodal evaluation suite. The overview and examples of PangeaBench from each task are shown in Figure 4.
Multimodal Chat:
The Multimodal Chat task evaluates a model's ability to engage in dynamic conversations using both text and images.
Multilingual LlavaBench
Captioning:
The XM100 dataset was created to evaluate models in multilingual image captioning, consisting of images paired with captions in 36 languages
Multilingual VQA:
This task evaluates a model's ability to answer questions about images in multiple languages.
The xGQA
Multi-Subject Reasoning:
The xMMMU and M3Exam
While multimodal tasks are critical for evaluating the holistic capabilities of models like PangeaBench, text-only multilingual tasks provide an equally essential dimension to assess.
We include three tasks QA, Translation, and Reasoning covering five datasets for the text-only evaluations in PangeaBench.
Specifically, we include TydiQA
For evaluation, we compare Pangea-7B against several state-of-the-art open source baselines, including English-centric models Llava-1.5-7B
Models | AVG | Multimodal Chat | Cultural Understanding | Captioning | Short VQA | Multi-subject Reasoning | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AVG (all) | xChatBench | M-LlavaBench | CVQA | MaRVL | XM100 | xGQA | MaXM | xMMMU | M3Exam | ||||||||||||
en | mul | en | mul | en | mul | en | mul | en | mul | en | mul | en | mul | en | mul | en | mul | en | mul | ||
Proprietary Models | |||||||||||||||||||||
Gemini-1.5-Pro | 67.1 | 62.5 | 67.0 | 54.4 | 103.4 | 106.6 | 75.9 | 75.7 | 76.4 | 72.0 | 27.6 | 19.1 | 54.2 | 48.7 | 56.4 | 63.5 | 65.8 | 57.7 | 77.4 | 64.7 | |
GPT4o | 68.6 | 64.6 | 71.0 | 64.4 | 104.6 | 100.4 | 79.1 | 79.4 | 81.4 | 82.1 | 27.7 | 19.1 | 55.8 | 51.0 | 60.7 | 65.4 | 69.1 | 58.3 | 68.0 | 61.0 | |
English Models | |||||||||||||||||||||
Llava-1.5-7B | 45.4 | 28.4 | 28.5 | 11.8 | 66.1 | 40.8 | 48.9 | 36.5 | 56.2 | 53.7 | 28.6 | 1.1 | 62.0 | 30.6 | 49.8 | 20.4 | 36.2 | 31.5 | 32.3 | 29.0 | |
Llava-Next-7B | 51.1 | 32.7 | 40.5 | 18.9 | 78.9 | 50.7 | 55.7 | 42.6 | 62.8 | 50.9 | 29.3 | 9.4 | 64.8 | 37.8 | 54.9 | 21.4 | 36.7 | 34.3 | 36.5 | 28.4 | |
Phi-3.5-Vision | 54.0 | 35.0 | 38.5 | 13.2 | 70.8 | 58.0 | 56.3 | 42.3 | 72.1 | 56.5 | 30.2 | 5.2 | 64.7 | 38.4 | 55.3 | 25.0 | 42.6 | 38.8 | 55.8 | 37.2 | |
Cambrian-8B | 50.9 | 36.4 | 27.5 | 11.3 | 78.4 | 61.8 | 59.7 | 47.5 | 75.4 | 61.8 | 20.6 | 9.9 | 64.6 | 39.8 | 55.3 | 28.7 | 41.8 | 33.2 | 34.7 | 33.4 | |
LLaVA-OV-7B | 59.5 | 41.3 | 51.0 | 28.5 | 89.7 | 55.3 | 65.2 | 53.7 | 72.7 | 57.5 | 30.6 | 7.0 | 64.4 | 48.2 | 54.9 | 34.8 | 46.3 | 41.0 | 60.4 | 45.8 | |
Molmo-7B-D | 55.4 | 34.1 | 49.5 | 21.1 | 95.9 | 13.8 | 59.4 | 48.3 | 65.3 | 54.9 | 22.1 | 9.1 | 51.5 | 43.0 | 52.9 | 37.5 | 44.5 | 40.4 | 57.1 | 39.1 | |
Llama3.2-11B | 57.2 | 41.9 | 49.0 | 27.8 | 93.9 | 58.2 | 70.2 | 61.4 | 64.5 | 58.1 | 27.6 | 4.5 | 55.6 | 45.4 | 55.3 | 43.9 | 46.5 | 41.4 | 51.8 | 36.6 | |
Multilingual Models | |||||||||||||||||||||
PaliGemma-3B | 37.3 | 25.8 | 6.0 | 3.5 | 32.1 | 31.9 | 52.9 | 42.9 | 56.5 | 52.2 | 18.7 | 0.8 | 59.7 | 30.5 | 47.9 | 19.9 | 26.3 | 25.2 | 36.0 | 25.6 | |
PALO-7B | 46.3 | 32.2 | 27.0 | 11.8 | 68.9 | 71.2 | 50.9 | 39.2 | 63.3 | 54.2 | 30.4 | 0.8 | 60.5 | 37.8 | 51.4 | 16.3 | 33.1 | 30.5 | 30.8 | 27.8 | |
mBLIP mT0-XL | 35.1 | 29.8 | 2.5 | 0.5 | 32.7 | 28.2 | 40.5 | 37.5 | 67.3 | 66.7 | 31.9 | 3.1 | 44.2 | 39.9 | 44.7 | 36.8 | 29.3 | 30.4 | 22.8 | 25.0 | |
mBLIP BLOOMZ | 36.1 | 30.0 | 4.0 | 1.6 | 43.5 | 41.0 | 44.9 | 36.9 | 62.3 | 58.6 | 22.5 | 10.3 | 43.3 | 36.9 | 44.7 | 24.8 | 29.2 | 30.8 | 30.3 | 29.5 | |
Pangea | |||||||||||||||||||||
Pangea-7B (Ours) | 59.9 | 52.7 | 46.0 | 35.6 | 84.2 | 89.5 | 64.4 | 57.2 | 87.0 | 79.0 | 30.4 | 14.2 | 64.7 | 60.2 | 55.3 | 53.2 | 45.7 | 43.7 | 61.4 | 42.1 | |
Δ over SoTA Open | +0.4 | +10.8 | -3.5 | +7.1 | -11.7 | +18.3 | -5.8 | -4.2 | +11.6 | +12.3 | -0.2 | +3.9 | -0.1 | +12.0 | 0.0 | +9.3 | -0.8 | +2.3 | +1.0 | -3.7 |
Multilingual Multimodal Results
We show the performance of models on the multimodal tasks from PangeaBench in Table 1.
The results provide clear insights into the strengths and remaining challenges of Pangea-7B in multilingual and multimodal tasks. Key observations from the evaluation include:
1) Superior English and Multilingual Performance: Pangea-7B outperforms existing open-source models across both English and multilingual tasks. Particularly in cultural understanding (CVQA, MaRVL), it has achieved substantial gains, highlighting its effectiveness in both cross-lingual and cross-cultural contexts.
2) Balanced Cross-Language Capabilities: Unlike many models that exhibit a significant drop in performance when moving from English to multilingual tasks, Pangea-7B is relatively consistent. For instance, in Multimodal Chat tasks, the performance gap between English and multilingual remains relatively small, indicating its ability to handle multiple languages effectively.
3) Challenges Compared to Proprietary Models: While Pangea-7B leads in open-source models, some gaps remain when compared to closed-source models like GPT4o. Additionally, though Pangea-7B narrows the gap between English and multilingual performance, there is still room for improvement in fully closing this divide across all tasks.
Multilingual Text-only Results
We further evaluate our model in text-only scenarios in Table 2. Interesting findings include:
1) Best Text Performance Among Multimodal LLMs: Pangea-7B demonstrates the strongest performance among all multimodal LLMs in the text-only tasks consistently outperforming baselines like Llava-Next-7B. This highlights that, despite being trained as a multimodal model, Pangea-7B maintains superior text understanding and reasoning capabilities compared to other MLLMs.
2) Maintained Performance from its Text Backbone: Pangea-7B generally maintains or sees slight drops in performance on most text-only benchmarks compared with its text backbone Qwen2-7B-Instruct. Notably, the model shows a significant improvement in MGSM. This improvement is directly attributable to the inclusion of math-related instructions in PangeaIns, which enhances the model's capability to handle complex multilingual reasoning and mathematical tasks.
Models | AVG (all) | FLORES-Sub | TyDiQA | XStoryCloze | MGSM | MMMLU | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
en | mul | x->en | en->x | en | mul | en | mul | en | mul | en | mul | |
Vicuna-1.5-7B | 52.1 | 38.7 | 55.6 | 42.4 | 59.7 | 52.7 | 78.1 | 57.4 | 17.6 | 6.4 | 49.5 | 34.7 |
Qwen2-7B-Instruct | 66.6 | 54.5 | 61.8 | 46.0 | 72.2 | 71.2 | 80.3 | 61.9 | 48.8 | 40.4 | 70.1 | 53.1 |
Llava-1.5-7B | 53.1 | 39.0 | 54.7 | 41.5 | 66.8 | 52.8 | 79.1 | 57.6 | 14.8 | 7.6 | 50.2 | 35.7 |
Llava-Next-7B | 54.0 | 38.9 | 54.8 | 41.4 | 68.3 | 52.1 | 79.1 | 57.1 | 15.6 | 7.5 | 52.1 | 36.5 |
Phi-3.5-Vision | 60.7 | 41.7 | 28.5 | 32.5 | 75.9 | 51.3 | 77.9 | 54.8 | 59.2 | 33.1 | 62.0 | 36.7 |
PALO-7B | 52.0 | 37.5 | 52.9 | 40.4 | 69.4 | 50.8 | 77.4 | 57.2 | 13.6 | 5.8 | 46.7 | 33.4 |
PANGEA-7B (Ours) | 72.8 | 54.3 | 60.7 | 44.9 | 73.7 | 66.0 | 79.1 | 61.2 | 82.0 | 47.4 | 68.4 | 52.2 |
Scaling Effect of Number of Instructions: Understanding how the quantity of instructions affects model performance is crucial for optimizing training strategies and resource allocation. Figure 5 reveals a clear scaling effect related to the number of instructions used during training. Performance improvements were consistent as we increased the number of multilingual instructions in PangeaIns, for both English and multilingual performance. This demonstrates the necessity of scaling multilingual multimodal instruction tuning.
Role of English Data: In multilingual scenarios, English data plays a pivotal role in cross-lingual transfer. To investigate this, we sampled 500K examples from the translated data described in Machine Translated Instructions, ensuring a consistent data distribution. We varied the ratio of English data while keeping the total number of training samples fixed at 500K. For the 17 multilingual languages in the translated subset, we evenly distributed the number of samples across languages. As shown in Figure 6, English performance generally improves as the percentage of English data increases. More surprisingly, using no English data (full multilingual data) results in relatively lower multilingual performance. As we introduce more English data, multilingual performance improves, peaking at 38.7% with 40% English. However, performance drops sharply when English data reaches 100%. This suggests that English data aids cross-lingual transfer, however, over-reliance on it harms multilingual performance.
How does the proportion of training samples in a language affect downstream performance? An interesting question to ask is whether the downstream task performance is correlated with the number of training samples. Our analysis in Figure 7 revealed a nuanced relationship between training sample proportion and downstream performance. While there is a general positive correlation, the impact varies significantly across languages and tasks. For widely spoken languages with rich linguistic resources, we observed a near-linear relationship. However, for low-resource languages, even a small increase in proportion yielded disproportionately large performance gains. Interestingly, we also noted instances of positive transfer between typologically similar languages. These findings suggest that strategic allocation of training samples, considering both language prevalence and linguistic similarities, can optimize overall model performance.
Preliminary Explorations of Multilingual OCR: Multilingual OCR emerged as a particularly challenging aspect of Pangea's functionality. We made efforts to enhance its multilingual OCR capabilities. Specifically, we constructed a dataset of 500K multilingual OCR instructions spanning 10 languages, with 50K examples per language, sourced from web user interfaces. Webpages naturally serve as image-rich environments containing abundant text, and by capturing screenshots of websites from various countries in different languages, we were able to gather a substantial number of OCR images. We employed the same model architecture as Pangea but trained it exclusively on these OCR images, reserving a portion of the data as a test set. As shown in Figure 8, the results indicate that improving multilingual OCR performance is feasible with an increase in training samples. However, the OCR accuracy for non-Latin scripts (e.g., Chinese, Japanese, and Korean) remains lower than for Latin-based languages. Looking ahead, we aim to further expand the multilingual OCR training dataset to include more languages and integrate this data into PangeaIns.
We introduced Pangea, a novel multilingual multimodal large language model designed to bridge linguistic and cultural gaps in visual understanding tasks. By leveraging PangeaIns, our newly curated 6M multilingual multimodal instruction data samples, we demonstrated significant improvements in cross-lingual and cross-cultural understanding across 39 typologically diverse languages. Our comprehensive evaluation using PangeaBench revealed Pangea's superior performance compared to existing open-source models, particularly in tasks requiring nuanced cultural interpretation. We also highlight ongoing challenges in areas such as low-resource language support and multilingual OCR. We fully open-source Pangea, PangeaIns, and PangeaBench to facilitate future research to build open and inclusive MLLMs.
The authors would like to thank the Cambrian team for their project webpage template.
@article{yue2024pangeafullyopenmultilingual,
title={Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages},
author={Xiang Yue and Yueqi Song and Akari Asai and Seungone Kim and Jean de Dieu Nyandwi and Simran Khanuja and Anjali Kantharuban and Lintang Sutawika and Sathyanarayanan Ramamoorthy and Graham Neubig},
year={2024},
journal={arXiv preprint arXiv:2410.16153},
url={https://arxiv.org/abs/2410.16153}
}