aichat.blog

Screenai: a Visual Language Model for Ui and Visually Situated Language Understanding

Google Research 3:00 am on May 23, 2024

ScreenAI is a comprehensive model developed to interpret screens and perform tasks such as question answering, navigation, and summarization using existing ScreenAI models and Large Language Models (LLMs). It achieves state-of-the-art performance on benchmark datasets like WebSRC, MoTIF, Chart Q, DocVQ, InfographicVQA, and OCR VQA.

ScreenAI combines ScreenAI models with Large Language Models for advanced UI interpretation.
Pre-trained using self-supervised learning before fine-tuning on tasks like question answering (Screen Q), navigation, and summarization.
Fine-tuned on public datasets including Referring Expression, MoTI, Mu, and Android in the Wild for navigation tasks.
Performs competitayerally with state-of-the-art models on benchmarks like Screen Annotation and Complex ScreenQA.
Data generation techniques with LLMs show significant improvements without saturating at model size increases.

http://blog.research.google/2024/03/screenai-visual-language-model-for-ui.html

< Previous Story - Next Story >

Screenai: a Visual Language Model for Ui and Visually Situated Language Understanding

Categories