Harnessing Webpage UIs For Text Rich Visual Understanding

Carnegie Mellon University The Chinese University of Hong Kong §Peking University University of Waterloo
+Equal contribution.
Corresponding to: jpliu@link.cuhk.edu.hk, xyue2@andrew.cmu.edu

MultiUIs - a dataset of 7.3 million samples spanning various UI types and tasks,
                 structured using enhanced accessibility trees and task taxonomies.




Abstract

Text-rich visual understanding—the ability to process environments where dense textual content is integrated with visuals—is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multi-modal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks—achieving up to a 48% improvement on VisualWebBench and a 30% boost in element accuracy on Mind2Web—but also generalize surprisingly well to non- web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.

Method

We construct MultiUI pipeline with four main stages:

  1. Website Scraping
  2. Website Curation using Llama-3-70b-Instruct
  3. Task Extraction utilizing Llama-3-70b-Instruct, GPT-4o mini, and rule-based approaches to generate Web UI tasks across three categories:
    • Visual understanding and reasoning
    • Text recognition
    • Grounding
  4. For each task, generate task samples by applying diverse instruction templates paraphrased by GPT-4o.

We ultimately curated a dataset of 7.3 million web UI-related samples in the form of VQA, covering nine tasks across perception, comprehension, grounding, and reasoning capabilities.
Platform Visual Understanding and Reasoning Grounding Text Recognition Total
Web Capt. Img Capt. Web QA Img QA Act. Pred. Action Elem. Head Elem.
Desktop 150K 526K 1.1M 979K 65K 1.2M 694K 98K 175K 5.0M
Mobile 100K 0 936K 0 34K 613K 488K 74K 41K 2.3M
Total 250K 526K 2.1M 979K 99K 1.8M 1.2M 172K 217K 7.3M

Model Performance

Overall, our dataset significantly improves model performances in GUI understanding and grounding benchmarks across all three model backbones we tested(i.e. LLaVA-Vicuna, LLaVA-Llama3.1, and LLaVA-Qwen2). Training on MultiUI also improves our model’s performance in Agent tasks. UIX-Qwen2-M2W surpasses both SeeClick and CogAgent in Mind2Web. Training on MultiUI also yields improvements in OCR-related scenarios, encompassing document, chart, and infographic understanding, as well as general grounding tasks.
Model GUI Understanding GUI Grounding
Visual WebBench Web SRC SQA Short Widget Cap VWB Ele-G VWB Act-G SSpot (REC) SSpot (REG) RefExp
GPT-4V 64.6 - - - 0.2 0 - - -
Pix2Struct - - - 136.7* - - - - -
S4 - 61.1* - 130.6* - - - - -
SeeClick 9.7 - - - - - - - -
CogAgent 28.7 - - - 29.3 36.6 - - -
ScreenAI - 87.2* 94.8* 156.4* - - - - -
Trained with LLaVA-1.5 data
LLaVA-1.5-7B 17.0 30.9 42.6 20.0 0.7 0.0 0.6 24.1 0.4
LLaVA-1.5-13B 19.4 32.5 46.0 10.2 0.0 0.0 0.9 22.7 1.1
LLaVA-Vicuna 23.1 41.5 53.0 38.4 0.0 0.0 1.3 30.3 1.2
Trained with LLaVA-1.5 data + MultiUI
UIX-Vicuna 71.1 69.5 73.9 66.5 55.5 26.7 44.7 69.5 35.8
Δ over LLaVA-Vicuna +48.0 +28.0 +20.9 +28.1 +55.5 +26.7 +43.4 +39.2 +34.6
Trained with LLaVA-NeXT data
LLaVA-1.6-7B 36.0 67.2 66.0 35.4 0.2 0.0 0.9 77.4 0.4
LLaVA-1.6-13B 39.4 71.2 68.3 23.4 0.0 1.0 0.4 74.5 0.0
LLaVA-1.6-34B 50.5 83.2 74.0 46.3 1.7 3.0 2.8 72.0 3.4
LLaVA-NeXT-8B 42.1 72.8 68.0 49.8 1.0 0.0 1.7 74.0 1.1
LLaVA-Llama3.1 35.3 65.0 65.7 34.2 0.5 0.0 1.3 53.1 0.9
LLaVA-Qwen2 41.7 72.5 68.6 38.0 1.2 0.0 1.3 60.4 1.9
Trained with MultiUI + LLaVA-NeXT data
UIX-Llama3.1 74.2 75.3 72.7 55.6 16.2 11.9 22.2 93.0 17.9
Δ over LLaVA-Llama3.1 +38.9 +10.3 +7.0 +21.4 +16.2 +11.9 +20.9 +39.9 +17.0
UIX-Qwen2-7B 75.9 82.9 78.8 72.7 66.1 35.6 55.2 99.9 43.5
Δ over LLaVA-Qwen2 +34.2 +10.4 +10.2 +34.7 +64.9 +35.6 +53.9 +39.5 +41.6



Model General OCR / DocQA / ChartQA General Grounding
Doc VQA Chart QA Text VQA Info VQA Visual MRC OCR Bench RefCOCO+ (REC) RefCOCO+ (REG)
GPT-4V 88.4 78.5 78 - - 64.5 - -
GPT-4o 92.8 85.7 - - - 73.6 - -
Pix2Struct 76.6 58.6 - 40 - - - -
S4 - 55.0 - - - - - -
CogAgent 81.6 68.4 76.1 44.5 - - - -
DocOwl-1.5-Chat 82.2 70.2 68.6 50.7 - - - -
DocOwl2 80.7 70 66.7 46.4 - - - -
Trained with LLaVA-1.5 data
LLaVA-1.5-7B 28.1 18.1 46.0 25.8 35.3 31.3 50.0 30.3
LLaVA-1.5-13B 30.2 18.2 48.7 29.4 38.3 52.1 59.9 33.4
LLaVA-Vicuna 46.1 21.2 59.6 31.9 39.7 38.1 61.7 35.2
Trained with MultiUI + LLaVA-1.5 data
UIX-Vicuna 72.8 24.2 67.0 41.6 43.3 53.4 65.7 42.7
Δ over LLaVA-Vicuna +26.7 +3.0 +7.4 +9.7 +3.6 +15.3 +4.0 +7.5
Trained with LLaVA-NeXT data
LLaVA-NeXT-7B 74.4 54.8 64.8 37.0 33.3 52.1 77.0 34.4
LLaVA-NeXT-13B 77.5 62.4 67.0 41.5 35.9 55.0 80.8 35.6
LLaVA-NeXT-34B 83.9 68.6 69.4 51.3 37.9 57.2 84.8 33.2
LLaVA-NeXT-8B 78.2 69.2 65.3 37.6 29.3 55.2 79.5 30.2
LLaVA-Llama3.1 74.7 66.5 64.3 35.7 46.8 54.0 74.8 30.0
LLaVA-Qwen2 76.5 68.5 67.0 41.1 44.1 55.7 75.9 30.8
Trained with MultiUI + LLaVA-NeXT data
UIX-Llama3.1 78.0 66.9 65.1 44.2 49.7 58.6 71.7 33.3
Δ over LLaVA-Llama3.1 +3.3 +0.4 +0.8 +8.5 +2.9 +4.6 -3.1 +3.3
UIX-Qwen2 85.3 74.0 72.7 52.2 49.1 66.3 79.1 37.2
Δ over LLaVA-Qwen2 +8.8 +5.5 +5.7 +11.1 +5.0 +10.6 +3.2 +6.4



Model Mind2Web
Cross-Task Cross-Website Cross-Domain
Step SR Ele. Step SR Ele. Step SR Ele.
SeeClick 25.5 28.3 16.4 21.4 20.8 23.2
CogAgent 26.9 30.2 23.4 27.3 28.5 33.1
LLaVA-Qwen2 - 7.5 - 7.6 - 10.4
UIX-Qwen2 - 13.5 - 9.8 - 13.8
UIX-Qwen2-M2W 38.2 43.4 31.0 39.2 34.9 40.4

Example Training Data

BibTeX

@misc{liu2024harnessingwebpageuistextrich,
      title={Harnessing Webpage UIs for Text-Rich Visual Understanding}, 
      author={Junpeng Liu and Tianyue Ou and Yifan Song and Yuxiao Qu and Wai Lam and Chenyan Xiong and Wenhu Chen and Graham Neubig and Xiang Yue},
      year={2024},
      eprint={2410.13824},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.13824}, 
}