Screenai: a Visual Language Model for Ui and Visually Situated Language Understanding



Google Research 3:00 am on May 23, 2024


ScreenAI is a comprehensive model developed to interpret screens and perform tasks such as question answering, navigation, and summarization using existing ScreenAI models and Large Language Models (LLMs). It achieves state-of-the-art performance on benchmark datasets like WebSRC, MoTIF, Chart Q, DocVQ, InfographicVQA, and OCR VQA.

  • ScreenAI combines ScreenAI models with Large Language Models for advanced UI interpretation.
  • Pre-trained using self-supervised learning before fine-tuning on tasks like question answering (Screen Q), navigation, and summarization.
  • Fine-tuned on public datasets including Referring Expression, MoTI, Mu, and Android in the Wild for navigation tasks.
  • Performs competitayerally with state-of-the-art models on benchmarks like Screen Annotation and Complex ScreenQA.
  • Data generation techniques with LLMs show significant improvements without saturating at model size increases.

http://blog.research.google/2024/03/screenai-visual-language-model-for-ui.html

< Previous Story     -     Next Story >

Copy and Copyright Pubcon Inc.
1996-2024 all rights reserved. Privacy Policy.
All trademarks and copyrights held by respective owners.