For AI, "playing with mobile phones" is not an easy task. Just identifying various user interfaces (UI) is a big problem: not only must the type of each component be identified, but also according to its Use the symbols and positions to determine the function of the component.
Understanding the UI of mobile devices can help realize various human-computer interaction tasks, such as UI automation.
Previous work modeling mobile UI usually relied on the view hierarchy information of the screen, directly utilizing the structural data of the UI, and thereby bypassing the problem of identifying components starting from the screen pixels.
However, not all view hierarchies are available in all scenarios. This method usually outputs incorrect results due to missing object descriptions or misaligned structural information, so although using view hierarchies can improve short-term performance , but may ultimately hinder the applicability and generalization performance of the model.
Recently, two researchers from Google Research proposed Spotlight, a purely visual method that can be used for mobile UI understanding. Based on the visual language model, it only needs to combine screenshots of the user interface and a sense on the screen. The area of interest (focus) can be used as input.
Paper link: https://arxiv.org/pdf/2209.14927.pdf
Spotlight’s general architecture is very Easily extensible and capable of performing a range of user interface modeling tasks.
The experimental results in this article show that the Spotlight model has achieved sota performance on several representative user interface tasks, successfully surpassing previous methods that used screenshots and view hierarchies as input.
In addition, the article also explores the multi-task learning and few-shot prompt capabilities of the Spotlight model, and also shows promising experimental results in the direction of multi-task learning.
The author of the paper, Yang Li, is a senior researcher at Google Research Center and an affiliated faculty member of CSE at the University of Washington. He received a PhD in computer science from the Chinese Academy of Sciences and conducted postdoctoral research at EECS at the University of California, Berkeley. He leads the development of Next Android App Prediction, is a pioneer in on-device interactive machine learning on Android, develops gesture search, and more.
Computational understanding of user interfaces is a critical step in enabling intelligent UI behavior.
Prior to this, the team studied various UI modeling tasks, including window titles (widgets), screen summarization (screen summarization) and command grounding. These tasks solved the automation and usability of different interaction scenarios. Accessibility issues.
Subsequently, these functions were also used to demonstrate how machine learning can help "user experience practitioners" improve UI quality by diagnosing clickability confusion, and provide ideas for improving UI design. All of this work is consistent with work in other fields. Together they demonstrate how deep neural networks can potentially change end-user experience and interaction design practices.
Although we have achieved a certain degree of success in handling "single UI tasks", the next question is: whether it can be done from "specific Improve the processing capabilities of "general UI" in the "UI Recognition" task.
The Spotlight model is also the first attempt at a solution to this problem. The researchers developed a multi-tasking model to handle a series of UI tasks simultaneously. Although some progress has been made in the work, there are still some problem.
The previous UI model relied heavily on the UI view hierarchy, which is the structure or metadata of the mobile UI screen, such as the Document Object Model of the web page. The model directly obtains the UI objects on the screen. Details including type, text content, location, etc.
This kind of metadata gave the previous model an advantage over purely visual models, but the accessibility of view hierarchy data was a big problem, and problems such as missing object descriptions or improper alignment of structural information often occurred.
So while there are short-term benefits to using a view hierarchy, it may ultimately hinder the performance and applicability of your model. Additionally, previous models had to handle heterogeneous information across datasets and UI tasks, often resulting in more complex model architectures that were ultimately difficult to scale or generalize across tasks.
The purely visual Spotlight approach aims to achieve universal user interface understanding capabilities entirely from raw pixels.
Researchers introduce a unified approach to representing different UI tasks, where information can be universally represented into two core modes: visual and verbal, where the visual mode captures what the user sees from the UI screen Content and language patterns can be natural language or any task-related token sequence.
The Spotlight model input is a triplet: a screenshot, an area of interest on the screen, and a text description of the task; the output is a text description or response about the area of interest.
This simple input and output representation of the model is more general, can be applied to a variety of UI tasks, and can be extended to a variety of model architectures.
The model is designed to enable a series of learning strategies and settings, from fine-tuning for specific tasks to multi-task learning and few-shot learning.
Spotlight models can leverage existing architectural building blocks, such as ViT and T5, which are pre-trained in high-resource general visual language fields and can be built directly on top of these general domain models. Construct.
Because UI tasks are usually related to specific objects or areas on the screen, the model needs to be able to focus on the object or area of interest. The researchers introduced the focus region extractor (Focus Region Extractor) into the visual language model. , enabling the model to focus on that area based on the screen context.
The researchers also designed a Region Summarizer to obtain a latent representation of the screen area based on ViT encoding by using the attention query generated by the region bounding box.
Specifically, the bounding box for each coordinate (a scalar value, including left, top, right, or bottom), represented as a yellow box in the screenshot.
First convert the input into a set of dense vectors through a multi-layer perceptron (MLP), and then feed it back to the Transformer model to obtain the embedding vector (coordinate-type embedding) according to the coordinate type. The dense vector and Their corresponding coordinate type embeddings are color-coded to indicate their relationship to each coordinate value.
Then coordinate queries participate in the screen encoding of ViT output through "cross-attention", and finally the attention output of Transformer is used as Regional representation of T5 downstream decoding.
The researchers used two unlabeled (unlabeled) data sets to pre-train the Spotlight model, which were an internal data set based on the C4 corpus and an internal mobile data set. , containing a total of 2.5 million mobile UI screens and 80 million web pages.
Then the pre-trained model is fine-tuned for four downstream tasks: title, summary, grouding and clickability.
For the window title (widget captioning) and screen summary tasks, use the CIDEr metric to measure how similar the model text description is to a set of references created by the rater; for the command grounding task, the accuracy metric is the model's response to the user The percentage of commands that successfully locate target objects; for clickability predictions, use the F1 score to measure the model's ability to distinguish clickable objects from non-clickable objects.
In the experiment, Spotlight was compared to several baseline models: WidgetCaption uses the view hierarchy and the image of each UI object to generate text descriptions for the objects; Screen2Words uses the view hierarchy and screenshots and accessibility features (e.g., app Program Description) to generate summaries for screens; VUT combines screenshots and view hierarchies to perform multiple tasks; the original Tappability model leverages object metadata from view hierarchies and screenshots to predict an object's Tappability.
Spotlight greatly surpassed the previous sota model in four UI modeling tasks.
In a more difficult task setting, the model is required to learn multiple tasks at the same time, because multi-task models can greatly reduce the energy of the model Consumption (model footprint), the results show that the performance of the Spotlight model is still competitive.
To understand how the Region Summarizer enables Spotlight to focus on target and related areas on the screen, the researchers analyzed the window The attention weights of the title and screen summary tasks indicate where the model's attention is on the screenshot.
In the figure below, for the window title task, when the model predicts "select Chelsea team", the check box on the left The box is highlighted with a red border, and you can see from the attention heat map on the right that the model not only learned to pay attention to the target area of the checkbox, but also learned to pay attention to the leftmost text "Chelsea" to generate the title.
For the screen summary task, the model predicts "page displaying the tutorial of a learning app", and Given the screenshot on the left, in the example, the target area is the entire screen, and the model can learn to process important parts of the screen for summary.
Reference materials:
https://www.php.cn/link/64517d8435994992e682b3e4aa0a0661
The above is the detailed content of Two Chinese Google researchers released the first purely visual 'mobile UI understanding' model, four major tasks to refresh SOTA. For more information, please follow other related articles on the PHP Chinese website!