GUI-World: A Dataset for GUI-Oriented Multimodal Large Language Models

Dongping Chen1*, Yue Huang2*, Siyuan Wu1*, Jingyu Tang1*, Liuyi Chen1, Yilin Bai1, Zhigang He1, Chenlong Wang1, Huichi Zhou1, Yiqiang Li1, Tianshuo Zhou1, Yue Yu1, Chujie Gao1, Qihui Zhang1, Yi Gui1, Zhen Li1, Yao Wan1, Pan Zhou1, Jianfeng Gao3, Lichao Sun4
1Huazhong University of Science and Technology,
2University of Notre Dame, 3Microsoft Research, 4Lehigh University

GUI-world Overview GUI-world Benchmark Overview In this work, we introduce a comprehensive GUI-oriented dataset, GUI-World, to benchmark and enhance GUI understanding capabilities. Specifically, there are three-fold major contributions:
  1. A Dataset. We propose GUI-WORLD, a comprehensive GUI dataset comprising over 12,000 videos specifically designed to assess and improve the GUI understanding capabilities of MLLMs, spanning a range of categories and scenarios, including desktop, mobile, and extended reality (XR), and representing the first GUI-oriented instruction-tuning dataset in the video domain.
  2. A Novel Model. Based on GUI-WORLD, we propose GUI-Vid, a GUI-oriented VideoLLM with enhanced capabilities to handle various and complex GUI tasks. GUI-Vid shows a significant improvement on the benchmark and achieves results comparable to the top-performing models
  3. Comprehensive Experiments and Valuable Insights. Our experiments indicate that most existing MLLMs continue to face challenges with GUI-oriented tasks, particularly in sequential and dynamic GUI content. Empirical findings suggest that improvements in vision perception, along with an increase in the number of keyframes and higher resolution, can boost performance in GUI-oriented tasks, thereby paving the way for the future of GUI agents.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding.

GUI-World Dataset Construction

environment infrastructure We introduce **GUI-World**, a comprehensive dataset covering six GUI scenarios including video, human-annotated keyframes, as well as detailed captions and diverse types of QA produced by our data curation framework, aiming at benchmarking and enhancing the general GUI-oriented capabilities. These GUI scenarios encompass desktop operating systems (e.g., macOS, Windows) and mobile platforms (e.g., Android and iOS), websites, software, and even extended-range technologies (XR) (e.g., GUI in Apple Vision Pro). Discussion for each scenario are in Six Main GUI Categories. The construction of the **GUI-World** dataset mainly follows a two-stage process:
  1. GUI Video Collection and Image Sequence Process: During this phase, a group of 24 undergraduate and graduate students manually collects GUI-related videos from YouTube or manual screen recording. Subsequently, these students use video editing software to transform the videos into short video clips, each containing various human operations on GUI content, and then annotate them with detailed operational descriptions.
  2. Diversify Types of QA through MLLM-Human Collaboration: Given that human annotations might contain grammar errors or unclear statements, we utilize the MLLM, specifically GPT-4V, to first refine the descriptions of the image sequences and then generate various types of QA focusing on static and dynamic GUI content, aiming at comprehensively tasking MLLMs with their GUI-oriented abilities. Finally, all MLLM-generated content will be carefully reviewed through human verification to ensure alignment with original human aims.

Data Statistics and Comparison

Below we present an overview of the main statistics of **GUI-World**, showcasing the outline and a broad spectrum of tasks. **GUI-World** contains a total of 12k videos and 100k queries. We make a comparison of **GUI-World** against some other different benchmarks or dataset in GUI domain as presented below.
  GUI-WORLD
Instances 12,379
Sem. Both
VL ✔️
Video ✔️
Web ✔️
Mob. ✔️
Desk. ✔️
XR ✔️
Sequential ✔️
CrossApp ✔️
Dynamic ✔️
Detailed Tasks GUI Understanding Instruction Following
AgentStudio OSWorld UGIF AitW Mind2Web Rico FerretUI WebArena MetaGUI MiniWoB++ OmniAct MMINA
304 369 523 715,142 2,350 72,219 123,702 812 1,125 100 9,802 1,050
High High High High Both Low Low Low Low Low Low Low
✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
✔️ ✔️ ✔️ ✔️ ✔️
✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
✔️ ✔️
✔️
✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
General Control General Control UI Grounded Instruction Following GUI Understanding Web Navigation UI Code/Layout Generation UI Grounding & Understanding Web Navigation Mobile Navigation Web Navigation Code Generation Web Navigation

Benchmark

We conduct evaluations on four of the most robust image-based MLLMs: GPT-4V(ision), GPT-4o, Qwen-VL-Max, and Gemini-Pro-1.5. We benchmark on three keyframe selection settings: (1) *Random*, where frames are sampled at fixed time intervals within a video; (2) *Extracted*, with keyframes extracted using [Katna](https://github.com/keplerlab/katna); and (3) *Human*, where keyframes are selected by humans during the annotation process. For the *Random* and *Extracted* settings, we input 10 frames into each MLLM, while the *Human* setting uses an average of 6.719 frames. Each model's responses employ a three-step Chain-of-Thought (CoT) process, i.e., "Describe-Analyze-Answer", to evaluate their peak performance. Additionally, we assessed three advanced VideoLLMs—ChatUnivi, Minigpt4-video, and Videochat2—for their performance on GUI content. To assess free-form questions and multiple-round conversations, we utilize the LLM-as-a-Judge methodology, which assigns a similarity score ranging from 1 to 5 between MLLM's response and a predefined golden answer. For a comprehensive evaluation, we also provide BLEU and BERTScore in our paper. For multiple-choice questions, we measure performance using accuracy as the primary evaluation metric. **We are actively updating the benchmark with new LLMs, VLMs and methods. Pull requests welcomed!**
Notice: F.K. = Keyframes in Fintuning, E.K. = Keyframes in Evaluation, I. = Image, V. = Video, MC = Multiple-Choice QA, Free = Free-Form QA
Setting F.K. E.K. Data Software Website XR Multi IOS Android Avg.
I. V. MC Free MC Free MC Free MC Free MC Free MC Free MC Free
Baseline - 8 - - 45.5% 2.144 42.6% 2.221 44.0% 2.005 40.4% 2.222 40.2% 2.169 44.7% 2.119 42.9% 2.147
- 16 - - 45.1% 2.144 41.8% 2.240 41.0% 2.007 40.7% 2.238 39.9% 2.138 44.7% 2.147 42.2% 2.154
GUI-Vid 8 8 58.3% 2.709 53.6% 2.817 62.2% 2.626 54.2% 2.627 53.1% 2.708 54.9% 2.501 56.0% 2.665
59.9% 2.856 54.1% 2.925 59.0% 2.751 52.1% 2.837 50.0% 2.756 54.0% 2.571 54.8% 2.782
16 59.0% 2.709 55.1% 2.821 62.8% 2.645 53.3% 2.624 55.5% 2.727 55.7% 2.501 56.9% 2.671
59.9% 2.847 54.1% 2.957 55.6% 2.764 52.9% 2.861 51.8% 2.772 53.4% 2.572 54.6% 2.796

Empirical Results

Commercial ImageLLMs outperform Open-source VideoLLMs in Zero-shot Settings

  1. Commercial ImageLLMs, notably GPT-4V and GPT-4o, consistently outperform open-source VideoLLMs in zero-shot settings. GPT-4o exhibits superior performance across all GUI scenarios in complex tasks, reflected in its high scores in both multiple-choice and free-form queries, with an average of 84.8% and 3.573. Similarly, Gemini demonstrates strong capabilities in captioning and descriptive tasks within software and iOS environments, scoring 2.836 and 2.936, respectively. Further analysis reveals that GPT-4V excels in applications with minimal textual content and simple layouts, such as TikTok, health apps, and GitHub. In contrast, its performance drops in more intricate applications like Microsoft ToDo and XR software. As for VideoLLMs, their significantly poorer performance is attributed to two main factors: their inability to accurately interpret GUI content from user inputs and a lack of sufficient GUI-oriented pretraining, which is evident from their inadequate performance in basic captioning and description tasks.

Performance Variate in Different GUI Scenarios

GPT-4V and Gemini excel in common scenarios such as mobile and website interfaces but show marked deficiencies in more complex GUI environments like XR and multi-window interactions, across both captioning and intricate tasks. This performance gap highlights a significant shortfall in understanding environments where GUI elements are scattered and demand sophisticated interpretation. It emphasizes the critical need for specialized benchmarks and datasets tailored to these complex GUI scenarios, which is essential for enhancing the GUI-oriented capabilities of MLLMs, paving the way for them to become truly reliable and high-performing general control agents.

Keyframe Selection is Important for GUI-oriented Tasks

Across both basic tasks such as captioning and more complex tasks like prediction and reasoning, significant variations are evident among keyframe selection methods. GPT-4V and Gemini significantly benefit from using random-selected and human-selected keyframes, scoring approximately 0.2-0.3 points higher in both captioning and free-form tasks than those using programmatic extraction. This suggests that traditional keyframe technologies, designed for natural videos, are less effective for detecting essential GUI operations, particularly when subtle movements like mouse clicks and dynamic changes are involved. Conversely, the difference in performance is relatively smaller in Qwen-VL-Max, indicating that while keyframe selection methods are crucial for models proficient in GUI content, they exert less influence on less capable models.

Dynamic GUI Tasks Continue to Challenge MLLMs

In the fine-grained tasks, GPT-4V and GPT-4o excel with static GUI content and prediction tasks over image sequences but struggle with providing detailed descriptions for entire videos and dynamic GUI content. This discrepancy is attributed to minor variations in GUI that significantly impact descriptions. Enhancing the number of keyframes and the granularity of perception might mitigate these issues. Among VideoLLMs, ChatUnivi excels in conversational tasks by effectively leveraging contextual nuances, particularly in subsequent rounds, yet it underperforms in GUI-oriented captioning tasks. In contrast, GUI-Vid demonstrates proficiency in sequential tasks but falls short in both captioning and static content handling. This gap is linked to deficiencies in GUI-Vid’s pretraining, which lacked comprehensive GUI content crucial for effective vision-text alignment, as evidenced by its poor performance and an instruction tuning process also failed to fully address these shortcomings.

Vision Perception is Important for Sequential GUI Tasks

Integrating detailed textual information slightly outperforms purely vision-based inputs or detailed captions, akin to a Chain of Thought (CoT) setting. Surprisingly, GPT-4V excels in caption and prediction tasks with just detailed captions, providing insights on enhancing specific GUI-oriented tasks through additional textual information. However, it still falls short in more challenging tasks, such as retrieving static or dynamic content. This underscores the critical role of visual perception in GUI environments, where even minor changes can significantly impact outcomes.

Supreme Enhancement of GUI-Vid on Graphic-based Interface After Finetuned on GUI-World

As a pioneering study in training VideoLLMs as screen agents, GUI-Vid significantly outperforms the baseline model, showing an average improvement of 30% across various tasks and GUI scenarios, even surpassing the commercial ImageLLM, Qwen-VL-Max. This enhancement is particularly notable in captioning and prediction over image sequences, where GUI-Vid matches the performance of GPT-4V and Gemini-Pro. Our two-stage progressive fintuning significantly enhances the performance in all GUI scenarios. Remarkably, GUI-Vid scored 3.747 in caption tasks within the XR scenario, highlighting its potential in XR applications and the high-quality annotations provided by our dataset. However, in Multiple-Choice QA and Chatbot tasks, GUI-Vid still lags behind industry leaders like GPT-4V and Gemini-Pro, a discrepancy likely due to the baseline LLM’s weaker performance and the challenges of instruction-based fine-tuning.

Upper Bound of GUI-oriented Capability with More Keyframes and High Resolution

Our two ablation studies during the fine-tuning phase demonstrate that utilizing GUI image-text captioning data significantly enhances the model's preliminary understanding of GUI elements, outperforming training that relies solely on videos. Additionally, an increased number of keyframes correlates with improved performance across various scenarios, notably in environments featuring multiple windows and software applications. Further evidence reveals that higher image resolutions substantially boost task performance, both basic and complex, for GPT-4o. These findings underscore the potential for further developing a more robust GUI Agent.

Acknowledgement

Many thanks to Yinuo Liu, Zhengyan Fu, Shilin Zhang, Yu, Tianhe Gu, Haokuan Yuan, and Junqi Wang for their invalueble effort in this project. This project is based on methodologies and code presented in [Videochat2](https://github.com/OpenGVLab/Ask-Anything). This website is based on templates in [TrustLLM](https://trustllmbenchmark.github.io/TrustLLM-Website/) and [OSWorld](https://os-world.github.io/).

BibTeX

@misc{chen2024guiworld,
        title={GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents}, 
        author={Dongping Chen and Yue Huang and Siyuan Wu and Jingyu Tang and Liuyi Chen and Yilin Bai and Zhigang He and Chenlong Wang and Huichi Zhou and Yiqiang Li and Tianshuo Zhou and Yue Yu and Chujie Gao and Qihui Zhang and Yi Gui and Zhen Li and Yao Wan and Pan Zhou and Jianfeng Gao and Lichao Sun},
        year={2024},
        eprint={2406.10819},
  }

GUI-World Team