We introduce Pangea-7B, a fully open multilingual multimodal language model (MLLM) designed to bridge multilingual and multicultural gaps in visual understanding tasks. Pangea-7B is trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. Pangea-7B is evaluated on PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. As demonstrated in Figure 1, Pangea-7B demonstrates state-of-the-art results, outperforming existing open models in multilingual and culturally diverse contexts.

Pangea is structured around three key aspects, each offering important insights into the design space of MLLMs:

Pangea-7B not only sets a new multilingual multicultural multimodal model, but also acts as an extensive, open-source guide for developing instruction-tuned multilingual and multicultural MLLMs. Pangea is completely open-source, including model weights, instruction-tuning data PangeaIns, and instruction-tuning and evaluation code. Our aim is to facilitate the development of inclusive and robust multilingual MLLMs, prompting equity and accessibility across a broader linguistic and cultural spectrum.

Analyzing the Difficulties

There are four major challenges in training a multilingual MLLM:

1) Data scarcity: High-quality multilingual multimodal data is scarce, especially in low-resource languages, which makes it difficult to create large-scale training data.

2) Cultural nuances: Visual interpretations are context-dependent and vary across cultures.

3) Catastrophic forgetting: Training on many languages or modalities often results in suboptimal performance on some subsets and requires careful balancing.

4) Evaluation complexity: Developing an evaluation suite that accurately measures performance across languages and cultures requires substantial resources and expertise.

Addressing Challenges with Pangea

To address these challenges, we introduce Pangea, an open-sourced multilingual MLLM designed to bridge linguistic and cultural gaps in visual understanding tasks.

1) 6M multilingual instruction tuning data: Pangea is trained on PangeaIns, a high-quality multilingual multimodal instruction tuning dataset comprising 6 million samples in 39 typologically diverse languages, addressing data scarcity. PangeaIns combines existing open-source resources with newly created instructions focused on multicultural understanding. We curate high-quality English instructions, carefully translate them, and adapt them for multilingual contexts.

2) Multicultural instruction generation pipeline: To address Western-centric biases in visual representations, we source images from LAION-Multi , which includes images from various countries and captions in multiple languages. However, LAION-Multi contains many images that are not culturally representative of the country's speaking population. Additionally, the associated alt text is often short, noisy, and lacks sufficient detail. To combat these issues, we develop a multicultural multilingual multimodal instruction generation pipeline. This pipeline leverages an LLM to score and filter images based on cultural informativeness. We then enhance the remaining data by generating detailed descriptions and creating complex instructions that combine culturally relevant tasks with general multilingual scenarios. This approach improves the model's cultural understanding while maintaining robust multilingual performance.

3) Balanced data distribution: PangeaIns features an extensive and balanced distribution of languages, tasks, and cultural contexts (as shown in Figure 2).

4) PangeaBench: To evaluate Pangea-7B capabilities, we present PangeaBench, a comprehensive multilingual and multimodal evaluation suite comprising five multimodal and three text-based tasks across 14 datasets in 47 languages. PangeaBench assesses MLLMs' performance on open-domain multimodal chat, image captioning, cultural understanding, multimodal reasoning, and text-only tasks including question answering and complex math reasoning.

PangeaIns

Creating a truly multilingual, multicultural MLLM presents unique challenges. We developed PangeaIns, a diverse and high-quality dataset specifically designed for instruction tuning. PangeaIns features an extensive and balanced distribution of languages, tasks, and cultural contexts. Comprising 6 million samples in 39 languages, PangeaIns was curated with a focus on linguistic and cultural diversity. We empirically keep the final language ratio of English to Multilingual as 40%:60% as we found a significant portion of English data plays an important role in cross-lingual transfer. This is discussed in more details in (see Discussion). Figure 2 shows the details of our PangeaIns's distribution. We implemented three key strategies to ensure comprehensive coverage, each addressing the specific hurdles encountered in multilingual multimodal learning.

Machine Translated Instructions

We first create a high-quality set of English multimodal instructions, which serve as the foundation for translation into other languages. Figure 2 shows the statistics of our translated datasets. Then, we use the proprietary model Gemini 1.5 Pro to translate the English instructions into 17 languages. Lastly, we developed a post-processing pipeline. This pipeline automatically corrected these errors or directly dropped the examples, ensuring that all translated instructions remained consistent.

Multicultural Understanding Instructions

While machine translation enables us to scale across multiple languages, data translated from English is still Anglo-centric in its coverage of cultural . To address this, we developed a pipeline focused on creating instructions for multicultural understanding. The pipeline of creating multicultural understanding instructions is shown in Figure 3.

Curation of Culturally Diverse Images. We began by sampling 10 million images from the LAION-Multi dataset . 1) Heuristic Filtering: We implemented automatic filtering based on several key criteria: Image Size, Aspect Ratio, Text Length, NSFW content, Offensive Text, Deduplication, and CLIP Score (used to assess the alignment between the image and its textual description). This helped remove low-quality or inappropriate images and ensured the remaining dataset adhered to quality standards. 2) LLM Scoring: we employed Llama-3.1-8B-Instruct to evaluate the relevance and quality of the accompanying text descriptions (alt text) for each image. The following tasks are evaluated by the model: text quality, subject classificationt, country/region classification (images classified as `no specific country` were excluded). 3) Avoiding Overrepresentation: We downsampled images from frequently occurring subjects. Ultimately, we curated a final set of 1 million high-quality, culturally specific images that form the foundation of our dataset.

Captioning of Multicultural Images with Different Languages. To provide context and enhance the models' ability to interpret the images accurately, we regenerated a more detailed caption using Gemini 1.5 Pro based on high-quality original text. Each image was accompanied by a caption written in the language corresponding to its cultural origin.

Generating Multilingual and Cross-Cultural Instructions. For each image, we used Gemini 1.5 Pro to generate captions in native languages, leveraging high-quality alt text to enrich context. This alt text provided crucial cultural and contextual information, such as identifying key figures or locations. We carefully engineered prompts to create multilingual instructions based on 13 task types like Information Seeking and Cultural Interpretation. Each image had up to two QA pairs, ensuring diverse interactions. This approach enabled the model to better capture visual, cultural, and contextual nuances and respond effectively across various linguistic contexts.

Curating Existing Multilingual Instructions

To further enrich PangeaIns, we conducted an extensive survey of the available multilingual multimodal literature and datasets, including those hosted on HuggingFace. As a result, we incorporated several high-quality, open-source datasets into our PangeaIns mixture. These include Chinese ALLaVA-4V , Viet Document and OCR QA , Llava Chinese , Llava Medical Chinese Instruction , LLaVA-Japanese-Instruct , MTVQA , Japanese STAIR Captions , Russian GQA , French Doc-VQA , and French Table-VQA . Each of these datasets brings unique linguistic and cultural perspectives to the mix, covering a wide range of languages and task types.

PangeaBench: Evaluation of Multilingual Multimodal Models

To assess the capabilities of Pangea-7B across a variety of languages, cultures, and task types, we have developed PangeaBench, a comprehensive multilingual and multimodal evaluation suite. The overview and examples of PangeaBench from each task are shown in Figure 4.

Multimodal Tasks

Multimodal Chat: The Multimodal Chat task evaluates a model's ability to engage in dynamic conversations using both text and images. Multilingual LlavaBench stands as the only benchmark for assessing multilingual long-form generation in MLLMs, using coarse-grained evaluation criteria focused on helpfulness, relevance, and accuracy. However, research shows that these criteria may not align well with human judgment . To improve assessment, we developed a new benchmark called xChatBench, featuring fine-grained evaluation criteria across diverse scenarios. It addresses a common issue where English-centric models respond in English regardless of the query language. This behavior is penalized in xChatBench, receiving a score of zero to emphasize the importance of multilingual accuracy and effective communication. This strict criterion is essential for enhancing user experience in multilingual contexts.

Captioning: The XM100 dataset was created to evaluate models in multilingual image captioning, consisting of images paired with captions in 36 languages . To improve the dataset's diversity and streamline the evaluation, images were clustered based on their captions, and 100 representative images were manually selected from these clusters. This method reduces redundancy and ensures a broader range of images and captions, making XM100 an effective benchmark for assessing multilingual captioning capabilities.

Multilingual VQA: This task evaluates a model's ability to answer questions about images in multiple languages. The xGQA and MaXM datasets offer a wide variety of visual question-answering challenges across different languages and scripts, focusing on cross-lingual visual understanding. These datasets provide a comprehensive benchmark to assess the model's proficiency in handling diverse linguistic and visual contexts.

Multi-Subject Reasoning: The xMMMU and M3Exam datasets assess a model's reasoning abilities across various academic subjects. xMMMU is a machine-translated version of MMMU validation questions, focusing on multimodal reasoning in multiple subjects. It includes 300 questions translated into six languages using GPT-4. M3Exam presents complex, real-world educational questions that require both textual and visual understanding. These datasets provide a comprehensive benchmark for evaluating models' academic and multimodal reasoning skills. Further details on the translation quality of xMMMU and descriptions of other datasets are available in the evaluation section.

eval data — **Figure 4:** Overview of PangeaBench, which contains 5 multimodal and 3 text tasks covering 14 datasets (including two newly curated xChatBench and xMMMU datasets). The table provides details about the datasets, while the figure shows evaluation examples from five different multimodal eval tasks in our PangeaBench.

Text-Only Multilingual Datasets

While multimodal tasks are critical for evaluating the holistic capabilities of models like PangeaBench, text-only multilingual tasks provide an equally essential dimension to assess. We include three tasks QA, Translation, and Reasoning covering five datasets for the text-only evaluations in PangeaBench. Specifically, we include TydiQA to test the model's ability to answer questions across 11 typologically diverse languages. We adopt the FLORES dataset to assess machine translation performance across multiple languages. We sample 11 languages (denoted as FLORES-Sub). We use MMMLU , a human-translated version of MMLU to test the general language understanding of models. We use XStoryCloze and MGSM to test the model's commonsense and mathematical reasoning ability in multilingual contexts respectively.

Evaluation

For evaluation, we compare Pangea-7B against several state-of-the-art open source baselines, including English-centric models Llava-1.5-7B , Llava-Next-7B , Phi-3.5-Vision , Cambrian-8B and multilingual models PaliGemma-3B , PALO-7B , mBLIP mT0-XL and mBLIP BLOOMZ . We also consider two text-only LLMs baselines Vicuna-1.5-7B and Qwen2-7B-Instruct , which are the backbones of Llava-Next and our Pangea-7B respectively. We integrate our multimodal tasks in PangeaBench into lmms-eval , a multimodal evaluation package that supports many English multimodal benchmarks. We use lm-evaluation-harness to evaluate text-only tasks. We follow the original paper for their best models' prompts in different tasks. Figure 1 shows the aggregate performance of various multimodal LLMs on PangeaBench.

Models	AVG		Multimodal Chat				Cultural Understanding				Captioning		Short VQA				Multi-subject Reasoning
	AVG (all)		xChatBench		M-LlavaBench		CVQA		MaRVL		XM100		xGQA		MaXM		xMMMU		M3Exam
	en	mul	en	mul	en	mul	en	mul	en	mul	en	mul	en	mul	en	mul	en	mul	en	mul
Proprietary Models
Gemini-1.5-Pro	67.1	62.5	67.0	54.4	103.4	106.6	75.9	75.7	76.4	72.0	27.6	19.1	54.2	48.7	56.4	63.5	65.8	57.7	77.4	64.7
GPT4o	68.6	64.6	71.0	64.4	104.6	100.4	79.1	79.4	81.4	82.1	27.7	19.1	55.8	51.0	60.7	65.4	69.1	58.3	68.0	61.0
English Models
Llava-1.5-7B	45.4	28.4	28.5	11.8	66.1	40.8	48.9	36.5	56.2	53.7	28.6	1.1	62.0	30.6	49.8	20.4	36.2	31.5	32.3	29.0
Llava-Next-7B	51.1	32.7	40.5	18.9	78.9	50.7	55.7	42.6	62.8	50.9	29.3	9.4	64.8	37.8	54.9	21.4	36.7	34.3	36.5	28.4
Phi-3.5-Vision	54.0	35.0	38.5	13.2	70.8	58.0	56.3	42.3	72.1	56.5	30.2	5.2	64.7	38.4	55.3	25.0	42.6	38.8	55.8	37.2
Cambrian-8B	50.9	36.4	27.5	11.3	78.4	61.8	59.7	47.5	75.4	61.8	20.6	9.9	64.6	39.8	55.3	28.7	41.8	33.2	34.7	33.4
LLaVA-OV-7B	59.5	41.3	51.0	28.5	89.7	55.3	65.2	53.7	72.7	57.5	30.6	7.0	64.4	48.2	54.9	34.8	46.3	41.0	60.4	45.8
Molmo-7B-D	55.4	34.1	49.5	21.1	95.9	13.8	59.4	48.3	65.3	54.9	22.1	9.1	51.5	43.0	52.9	37.5	44.5	40.4	57.1	39.1
Llama3.2-11B	57.2	41.9	49.0	27.8	93.9	58.2	70.2	61.4	64.5	58.1	27.6	4.5	55.6	45.4	55.3	43.9	46.5	41.4	51.8	36.6
Multilingual Models
PaliGemma-3B	37.3	25.8	6.0	3.5	32.1	31.9	52.9	42.9	56.5	52.2	18.7	0.8	59.7	30.5	47.9	19.9	26.3	25.2	36.0	25.6
PALO-7B	46.3	32.2	27.0	11.8	68.9	71.2	50.9	39.2	63.3	54.2	30.4	0.8	60.5	37.8	51.4	16.3	33.1	30.5	30.8	27.8
mBLIP mT0-XL	35.1	29.8	2.5	0.5	32.7	28.2	40.5	37.5	67.3	66.7	31.9	3.1	44.2	39.9	44.7	36.8	29.3	30.4	22.8	25.0
mBLIP BLOOMZ	36.1	30.0	4.0	1.6	43.5	41.0	44.9	36.9	62.3	58.6	22.5	10.3	43.3	36.9	44.7	24.8	29.2	30.8	30.3	29.5
Pangea
Pangea-7B (Ours)	59.9	52.7	46.0	35.6	84.2	89.5	64.4	57.2	87.0	79.0	30.4	14.2	64.7	60.2	55.3	53.2	45.7	43.7	61.4	42.1
Δ over SoTA Open	+0.4	+10.8	-3.5	+7.1	-11.7	+18.3	-5.8	-4.2	+11.6	+12.3	-0.2	+3.9	-0.1	+12.0	0.0	+9.3	-0.8	+2.3	+1.0	-3.7

Table 1: Models' multilingual multimodal evaluation results on PangeaBench.

Multilingual Multimodal Results We show the performance of models on the multimodal tasks from PangeaBench in Table 1. The results provide clear insights into the strengths and remaining challenges of Pangea-7B in multilingual and multimodal tasks. Key observations from the evaluation include:
1) Superior English and Multilingual Performance: Pangea-7B outperforms existing open-source models across both English and multilingual tasks. Particularly in cultural understanding (CVQA, MaRVL), it has achieved substantial gains, highlighting its effectiveness in both cross-lingual and cross-cultural contexts.
2) Balanced Cross-Language Capabilities: Unlike many models that exhibit a significant drop in performance when moving from English to multilingual tasks, Pangea-7B is relatively consistent. For instance, in Multimodal Chat tasks, the performance gap between English and multilingual remains relatively small, indicating its ability to handle multiple languages effectively.
3) Challenges Compared to Proprietary Models: While Pangea-7B leads in open-source models, some gaps remain when compared to closed-source models like GPT4o. Additionally, though Pangea-7B narrows the gap between English and multilingual performance, there is still room for improvement in fully closing this divide across all tasks.

Multilingual Text-only Results We further evaluate our model in text-only scenarios in Table 2. Interesting findings include:
1) Best Text Performance Among Multimodal LLMs: Pangea-7B demonstrates the strongest performance among all multimodal LLMs in the text-only tasks consistently outperforming baselines like Llava-Next-7B. This highlights that, despite being trained as a multimodal model, Pangea-7B maintains superior text understanding and reasoning capabilities compared to other MLLMs.
2) Maintained Performance from its Text Backbone: Pangea-7B generally maintains or sees slight drops in performance on most text-only benchmarks compared with its text backbone Qwen2-7B-Instruct. Notably, the model shows a significant improvement in MGSM. This improvement is directly attributable to the inclusion of math-related instructions in PangeaIns, which enhances the model's capability to handle complex multilingual reasoning and mathematical tasks.

Models	AVG (all)		FLORES-Sub		TyDiQA		XStoryCloze		MGSM		MMMLU
Models	en	mul	x->en	en->x	en	mul	en	mul	en	mul	en	mul
Vicuna-1.5-7B	52.1	38.7	55.6	42.4	59.7	52.7	78.1	57.4	17.6	6.4	49.5	34.7
Qwen2-7B-Instruct	66.6	54.5	61.8	46.0	72.2	71.2	80.3	61.9	48.8	40.4	70.1	53.1
Llava-1.5-7B	53.1	39.0	54.7	41.5	66.8	52.8	79.1	57.6	14.8	7.6	50.2	35.7
Llava-Next-7B	54.0	38.9	54.8	41.4	68.3	52.1	79.1	57.1	15.6	7.5	52.1	36.5
Phi-3.5-Vision	60.7	41.7	28.5	32.5	75.9	51.3	77.9	54.8	59.2	33.1	62.0	36.7
PALO-7B	52.0	37.5	52.9	40.4	69.4	50.8	77.4	57.2	13.6	5.8	46.7	33.4
PANGEA-7B (Ours)	72.8	54.3	60.7	44.9	73.7	66.0	79.1	61.2	82.0	47.4	68.4	52.2

Table 2: Models' multilingual text-only evaluation results on PangeaBench.

Discussion

Scaling Effect of Number of Instructions: Understanding how the quantity of instructions affects model performance is crucial for optimizing training strategies and resource allocation. Figure 5 reveals a clear scaling effect related to the number of instructions used during training. Performance improvements were consistent as we increased the number of multilingual instructions in PangeaIns, for both English and multilingual performance. This demonstrates the necessity of scaling multilingual multimodal instruction tuning.

english_vs_multilingual — **Figure 6:** Impact of English training data proportion on English vs. multilingual performance.

Role of English Data: In multilingual scenarios, English data plays a pivotal role in cross-lingual transfer. To investigate this, we sampled 500K examples from the translated data described in Machine Translated Instructions, ensuring a consistent data distribution. We varied the ratio of English data while keeping the total number of training samples fixed at 500K. For the 17 multilingual languages in the translated subset, we evenly distributed the number of samples across languages. As shown in Figure 6, English performance generally improves as the percentage of English data increases. More surprisingly, using no English data (full multilingual data) results in relatively lower multilingual performance. As we introduce more English data, multilingual performance improves, peaking at 38.7% with 40% English. However, performance drops sharply when English data reaches 100%. This suggests that English data aids cross-lingual transfer, however, over-reliance on it harms multilingual performance.

How does the proportion of training samples in a language affect downstream performance? An interesting question to ask is whether the downstream task performance is correlated with the number of training samples. Our analysis in Figure 7 revealed a nuanced relationship between training sample proportion and downstream performance. While there is a general positive correlation, the impact varies significantly across languages and tasks. For widely spoken languages with rich linguistic resources, we observed a near-linear relationship. However, for low-resource languages, even a small increase in proportion yielded disproportionately large performance gains. Interestingly, we also noted instances of positive transfer between typologically similar languages. These findings suggest that strategic allocation of training samples, considering both language prevalence and linguistic similarities, can optimize overall model performance.

Preliminary Explorations of Multilingual OCR: Multilingual OCR emerged as a particularly challenging aspect of Pangea's functionality. We made efforts to enhance its multilingual OCR capabilities. Specifically, we constructed a dataset of 500K multilingual OCR instructions spanning 10 languages, with 50K examples per language, sourced from web user interfaces. Webpages naturally serve as image-rich environments containing abundant text, and by capturing screenshots of websites from various countries in different languages, we were able to gather a substantial number of OCR images. We employed the same model architecture as Pangea but trained it exclusively on these OCR images, reserving a portion of the data as a test set. As shown in Figure 8, the results indicate that improving multilingual OCR performance is feasible with an increase in training samples. However, the OCR accuracy for non-Latin scripts (e.g., Chinese, Japanese, and Korean) remains lower than for Latin-based languages. Looking ahead, we aim to further expand the multilingual OCR training dataset to include more languages and integrate this data into PangeaIns.

Pangea: A Fully Open Multilingual Multimodal
LLM for 39 Languages

Analyzing the Difficulties

Addressing Challenges with Pangea