Google Research 3:00 am on May 23, 2024
ScreenAI is a comprehensive model developed to interpret screens and perform tasks such as question answering, navigation, and summarization using existing ScreenAI models and Large Language Models (LLMs). It achieves state-of-the-art performance on benchmark datasets like WebSRC, MoTIF, Chart Q, DocVQ, InfographicVQA, and OCR VQA.
- ScreenAI combines ScreenAI models with Large Language Models for advanced UI interpretation.
- Pre-trained using self-supervised learning before fine-tuning on tasks like question answering (Screen Q), navigation, and summarization.
- Fine-tuned on public datasets including Referring Expression, MoTI, Mu, and Android in the Wild for navigation tasks.
- Performs competitayerally with state-of-the-art models on benchmarks like Screen Annotation and Complex ScreenQA.
- Data generation techniques with LLMs show significant improvements without saturating at model size increases.
http://blog.research.google/2024/03/screenai-visual-language-model-for-ui.html
< Previous Story - Next Story >