Text-rich visual understanding—the ability to process environments where dense textual content is integrated with visuals—is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multi-modal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks—achieving up to a 48% improvement on VisualWebBench and a 30% boost in element accuracy on Mind2Web—but also generalize surprisingly well to non- web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.
We construct MultiUI pipeline with four main stages:
Platform | Visual Understanding and Reasoning | Grounding | Text Recognition | Total | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Web Capt. | Img Capt. | Web QA | Img QA | Act. Pred. | Action | Elem. | Head | Elem. | ||
Desktop | 150K | 526K | 1.1M | 979K | 65K | 1.2M | 694K | 98K | 175K | 5.0M |
Mobile | 100K | 0 | 936K | 0 | 34K | 613K | 488K | 74K | 41K | 2.3M |
Total | 250K | 526K | 2.1M | 979K | 99K | 1.8M | 1.2M | 172K | 217K | 7.3M |
Model | GUI Understanding | GUI Grounding | |||||||
---|---|---|---|---|---|---|---|---|---|
Visual WebBench | Web SRC | SQA Short | Widget Cap | VWB Ele-G | VWB Act-G | SSpot (REC) | SSpot (REG) | RefExp | |
GPT-4V | 64.6 | - | - | - | 0.2 | 0 | - | - | - |
Pix2Struct | - | - | - | 136.7* | - | - | - | - | - |
S4 | - | 61.1* | - | 130.6* | - | - | - | - | - |
SeeClick | 9.7 | - | - | - | - | - | - | - | - |
CogAgent | 28.7 | - | - | - | 29.3 | 36.6 | - | - | - |
ScreenAI | - | 87.2* | 94.8* | 156.4* | - | - | - | - | - |
Trained with LLaVA-1.5 data | |||||||||
LLaVA-1.5-7B | 17.0 | 30.9 | 42.6 | 20.0 | 0.7 | 0.0 | 0.6 | 24.1 | 0.4 |
LLaVA-1.5-13B | 19.4 | 32.5 | 46.0 | 10.2 | 0.0 | 0.0 | 0.9 | 22.7 | 1.1 |
LLaVA-Vicuna | 23.1 | 41.5 | 53.0 | 38.4 | 0.0 | 0.0 | 1.3 | 30.3 | 1.2 |
Trained with LLaVA-1.5 data + MultiUI | |||||||||
UIX-Vicuna | 71.1 | 69.5 | 73.9 | 66.5 | 55.5 | 26.7 | 44.7 | 69.5 | 35.8 |
Δ over LLaVA-Vicuna | +48.0 | +28.0 | +20.9 | +28.1 | +55.5 | +26.7 | +43.4 | +39.2 | +34.6 |
Trained with LLaVA-NeXT data | |||||||||
LLaVA-1.6-7B | 36.0 | 67.2 | 66.0 | 35.4 | 0.2 | 0.0 | 0.9 | 77.4 | 0.4 |
LLaVA-1.6-13B | 39.4 | 71.2 | 68.3 | 23.4 | 0.0 | 1.0 | 0.4 | 74.5 | 0.0 |
LLaVA-1.6-34B | 50.5 | 83.2 | 74.0 | 46.3 | 1.7 | 3.0 | 2.8 | 72.0 | 3.4 |
LLaVA-NeXT-8B | 42.1 | 72.8 | 68.0 | 49.8 | 1.0 | 0.0 | 1.7 | 74.0 | 1.1 |
LLaVA-Llama3.1 | 35.3 | 65.0 | 65.7 | 34.2 | 0.5 | 0.0 | 1.3 | 53.1 | 0.9 |
LLaVA-Qwen2 | 41.7 | 72.5 | 68.6 | 38.0 | 1.2 | 0.0 | 1.3 | 60.4 | 1.9 |
Trained with MultiUI + LLaVA-NeXT data | |||||||||
UIX-Llama3.1 | 74.2 | 75.3 | 72.7 | 55.6 | 16.2 | 11.9 | 22.2 | 93.0 | 17.9 |
Δ over LLaVA-Llama3.1 | +38.9 | +10.3 | +7.0 | +21.4 | +16.2 | +11.9 | +20.9 | +39.9 | +17.0 |
UIX-Qwen2-7B | 75.9 | 82.9 | 78.8 | 72.7 | 66.1 | 35.6 | 55.2 | 99.9 | 43.5 |
Δ over LLaVA-Qwen2 | +34.2 | +10.4 | +10.2 | +34.7 | +64.9 | +35.6 | +53.9 | +39.5 | +41.6 |
Model | General OCR / DocQA / ChartQA | General Grounding | ||||||
---|---|---|---|---|---|---|---|---|
Doc VQA | Chart QA | Text VQA | Info VQA | Visual MRC | OCR Bench | RefCOCO+ (REC) | RefCOCO+ (REG) | |
GPT-4V | 88.4 | 78.5 | 78 | - | - | 64.5 | - | - |
GPT-4o | 92.8 | 85.7 | - | - | - | 73.6 | - | - |
Pix2Struct | 76.6 | 58.6 | - | 40 | - | - | - | - |
S4 | - | 55.0 | - | - | - | - | - | - |
CogAgent | 81.6 | 68.4 | 76.1 | 44.5 | - | - | - | - |
DocOwl-1.5-Chat | 82.2 | 70.2 | 68.6 | 50.7 | - | - | - | - |
DocOwl2 | 80.7 | 70 | 66.7 | 46.4 | - | - | - | - |
Trained with LLaVA-1.5 data | ||||||||
LLaVA-1.5-7B | 28.1 | 18.1 | 46.0 | 25.8 | 35.3 | 31.3 | 50.0 | 30.3 |
LLaVA-1.5-13B | 30.2 | 18.2 | 48.7 | 29.4 | 38.3 | 52.1 | 59.9 | 33.4 |
LLaVA-Vicuna | 46.1 | 21.2 | 59.6 | 31.9 | 39.7 | 38.1 | 61.7 | 35.2 |
Trained with MultiUI + LLaVA-1.5 data | ||||||||
UIX-Vicuna | 72.8 | 24.2 | 67.0 | 41.6 | 43.3 | 53.4 | 65.7 | 42.7 |
Δ over LLaVA-Vicuna | +26.7 | +3.0 | +7.4 | +9.7 | +3.6 | +15.3 | +4.0 | +7.5 |
Trained with LLaVA-NeXT data | ||||||||
LLaVA-NeXT-7B | 74.4 | 54.8 | 64.8 | 37.0 | 33.3 | 52.1 | 77.0 | 34.4 |
LLaVA-NeXT-13B | 77.5 | 62.4 | 67.0 | 41.5 | 35.9 | 55.0 | 80.8 | 35.6 |
LLaVA-NeXT-34B | 83.9 | 68.6 | 69.4 | 51.3 | 37.9 | 57.2 | 84.8 | 33.2 |
LLaVA-NeXT-8B | 78.2 | 69.2 | 65.3 | 37.6 | 29.3 | 55.2 | 79.5 | 30.2 |
LLaVA-Llama3.1 | 74.7 | 66.5 | 64.3 | 35.7 | 46.8 | 54.0 | 74.8 | 30.0 |
LLaVA-Qwen2 | 76.5 | 68.5 | 67.0 | 41.1 | 44.1 | 55.7 | 75.9 | 30.8 |
Trained with MultiUI + LLaVA-NeXT data | ||||||||
UIX-Llama3.1 | 78.0 | 66.9 | 65.1 | 44.2 | 49.7 | 58.6 | 71.7 | 33.3 |
Δ over LLaVA-Llama3.1 | +3.3 | +0.4 | +0.8 | +8.5 | +2.9 | +4.6 | -3.1 | +3.3 |
UIX-Qwen2 | 85.3 | 74.0 | 72.7 | 52.2 | 49.1 | 66.3 | 79.1 | 37.2 |
Δ over LLaVA-Qwen2 | +8.8 | +5.5 | +5.7 | +11.1 | +5.0 | +10.6 | +3.2 | +6.4 |
Model | Mind2Web | |||||
---|---|---|---|---|---|---|
Cross-Task | Cross-Website | Cross-Domain | ||||
Step SR | Ele. | Step SR | Ele. | Step SR | Ele. | |
SeeClick | 25.5 | 28.3 | 16.4 | 21.4 | 20.8 | 23.2 |
CogAgent | 26.9 | 30.2 | 23.4 | 27.3 | 28.5 | 33.1 |
LLaVA-Qwen2 | - | 7.5 | - | 7.6 | - | 10.4 |
UIX-Qwen2 | - | 13.5 | - | 9.8 | - | 13.8 |
UIX-Qwen2-M2W | 38.2 | 43.4 | 31.0 | 39.2 | 34.9 | 40.4 |
@misc{liu2024harnessingwebpageuistextrich,
title={Harnessing Webpage UIs for Text-Rich Visual Understanding},
author={Junpeng Liu and Tianyue Ou and Yifan Song and Yuxiao Qu and Wai Lam and Chenyan Xiong and Wenhu Chen and Graham Neubig and Xiang Yue},
year={2024},
eprint={2410.13824},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.13824},
}