Stephen Pimentel on LinkedIn: Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (2024)

Stephen Pimentel

CTO at Dragonscale

  • Report this post

Apple's Ferret-UI is a specialized multimodal large language model designed specifically for understanding mobile UI screens, featuring capabilities in referring, grounding, and reasoning. It improves upon general-domain models by handling the unique challenges of UI screens, such as their elongated aspect ratio and smaller details like icons and text. Ferret-UI employs a novel approach of dividing each screen into two sub-images based on its orientation, encoding them separately to enhance detail recognition. The model was trained on a diverse set of UI tasks, including icon recognition and widget listing, with instruction-following samples and region annotations to support accurate interaction. Advanced tasks for training include detailed description and function inference. Ferret-UI significantly outperforms existing open-source UI models and even surpasses GPT-4V in elementary UI tasks, validated by a comprehensive benchmark designed for model evaluation.https://lnkd.in/gBNuJT4P

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs arxiv.org
Like Comment

To view or add a comment, sign in

More Relevant Posts

  • Zihui Ouyang

    Python Programming | Data Visualization | AI | ML | Hugging Face | SQL

    • Report this post

    In this era of large models, Visual Language Models (VLM) are getting bigger as well, the parameter count is usually in the hundreds of billions range. However, smaller models are still important because they are easier to train and deploy. So Google released PaLI (Pathways Language and Image) last year, which is a multimodal model. The text decoder-encoder was based on mT5 models while the image encoder was based on ViT models. Google recently released the third generation of PaLI in this paper: https://lnkd.in/gaeNkedF

    2310.09199.pdf arxiv.org
    Like Comment

    To view or add a comment, sign in

  • EnterpriseTalk

    3,590 followers

    • Report this post

    Deploying Large Language Models (LLMs) is a step towards enhancing user experience. But, knowing where to start and which aspects to consider before LLM deployment is essential.Here are 8 Factors to consider before deploying LLM

    What to Consider Before Deploying Large Language Models (LLMs) https://enterprisetalk.com

    2

    Like Comment

    To view or add a comment, sign in

  • NLPlanet | Breaking Down Generative AI Daily

    10,890 followers

    • Report this post

    GPT4RoI: augmenting vision-language tasks with regions of interests 🧑🏫🔍 GPT4RoI reformulates the bounding box as a spatial instruction, allowing for region-level alignments between visual features and language embeddings.GPT4RoI opens up a whole new conversational and interactive experience beyond image-level understanding:1️⃣ Users can control the GPT4RoI model using language and spatial instructions, adjusting question details.2️⃣ GPT4RoI model supports multi-region spatial instructions, offering a wider range of region-level multimodal capacities like detailed region captions and complex region reasoning.🤖 GPT4RoI is remarkable due to its versatility. It can use any object detector as a spatial instruction provider, extracting details like color, shape, material, action, and object relationships, improving its understanding abilities.

    GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest arxiv.org

    1

    Like Comment

    To view or add a comment, sign in

  • Santosh Sawant

    Senior ML Architect, LLMs | Ex-Ola

    • Report this post

    FIND: INterface for Foundation models’ embeDDingsFoundation models across the vision and language domains, such as GPT4, DALLE-3, SAM and LLaMA etc., have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) .However, the process of training individual foundation models has become remarkably costly. Furthermore, the full potential of these models remains untapped due to limitations in their fixed output modalities (i.e. text output for Q&A and visual output for image generation). Although techniques such as prompt engineering and adaptive tuning have shown promising results, these approaches struggle with integrating different foundation models off the shelf, expanding the output types and task objectives.Paper proposes FIND - a generalized interface for aligning foundation models’ embeddings. The interface enables task-adaptive prototyping, which means we only need to change the configure file instead of the model architecture when adapting to the new tasks. Because all the vision-language tasks are trained in a unified way, this creates an interleaved shared embedding space where vision and language references are replaceable and addable. The proposed interface has the following favorable attributes:1. Generalizable. It applies to various tasks spanning retrieval, segmentation, etc., under the same architecture and weights.2. Prototypable. Different tasks are able to be implemented through prototyping attention masks and embedding types.3. Extendable. The proposed interface is adaptive to new tasks, and new models.4. Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. Furthermore, FIND has achieved SoTA performance on interleaved image retrieval and segmentation and shows better or comparable performance on generic/interactive/grounded segmentation and image-text retrieval.Paper : https://lnkd.in/gGcvmv9kCheckout more paper review herehttps://lnkd.in/gaCZSrXm

    • Stephen Pimentel on LinkedIn: Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (11)

    12

    Like Comment

    To view or add a comment, sign in

  • Shawn Horton

    Key Account Executive, Healthcare @ Google Cloud

    • Report this post

    Optical character recognition has become the standard way developers extract and utilize text and layout data from PDFs and images. In this blog, we will discuss the history of #OCR, where the technology is headed, and how it is more important than ever with the rise of large language models (#LLMs).

    What is OCR | Google Cloud Blog cloud.google.com

    19

    Like Comment

    To view or add a comment, sign in

  • Hamdi Amroun, PhD

    Head of AI (ex. AWS)

    • Report this post

    JPMorgan has introduced DocLLM, a generative language model designed to understand multimodal documents with complex layouts. This model focuses on documents like forms, invoices, receipts, reports, and contracts, which contain both text and spatial elements. Unlike other multimodal models that use image encoders, DocLLM exclusively uses bounding box information to incorporate spatial layout. It achieves this by breaking down the attention mechanism in classical transformers into a set of disentangled matrices. Additionally, it employs a pre-training objective to fill in text segments, making it effective for irregular layouts and diverse content found in visual documents. The model is fine-tuned using a large instruction dataset for various document intelligence tasks and outperforms state-of-the-art language models on 14 out of 16 datasets across all tasks while also generalizing well to 4 out of 5 previously unseen datasets. You can find the full paper here: https://lnkd.in/euGH5KuM

    Paper page - DocLLM: A layout-aware generative language model for multimodal document understanding huggingface.co

    421

    22 Comments

    Like Comment

    To view or add a comment, sign in

  • Sukhwinder S

    Building Programs || Data & Design || Learning & Development

    • Report this post

    JPMorgan introduced DocLLM, a generative LM designed to understand multimodal docs with complex layouts. - DocLLM, is an extension to language models to reason over visual documents using both text and layout-Focuses on using bounding box info to incorporate spatial layout instead of expensive image encoders-Captures cross-alignment between text and layout via disentangled attention matrices-Pre-trains model to infill missing text segments to handle irregular layouts-Fine-tunes on instruction dataset covering 4 core document intelligence tasks-Outperforms state-of-the-art on 14/16 datasets and generalizes well to 4/5 unseen datasets.#genai #llm

    1

    Like Comment

    To view or add a comment, sign in

  • Backplain

    276 followers

    • Report this post

    "Capable and Powerful Small Language Models" + "Open Models will become comparable with proprietary models" = Lots of Large Language Model (LLM) choice = Multi-model support = Backplain. #controlai #llmhttps://hubs.ly/Q02d-r_00

    Exploring The Future: 5 Cutting-Edge Generative AI Trends In 2024 forbes.com

    6

    Like Comment

    To view or add a comment, sign in

  • SentientMatters

    1,544 followers

    • Report this post

    Open Interpreter Evolves: OS Mode and Visual Mode Enhancements UnveiledThe Open Interpreter project successfully emulates OpenAI's interpreter, introducing an OS mode and a visual mode. In addition to replicating OpenAI's interpreter, the project now allows users to control their computer using a language model. The visual mode enables interaction by clicking buttons and viewing the screen, expanding the project's capabilities beyond text-based emulation. This development enhances user engagement and usability, making Open Interpreter a versatile tool for controlling computer functions through a language model interface, providing a unique and innovative approach to human-computer interaction.#OpenInterpreter #AI #LanguageModel #Innovation #HumanComputerInteraction #Emulation

    Open Interpreter - The New Computer Update changes.openinterpreter.com

    1

    Like Comment

    To view or add a comment, sign in

  • USC Master of Science in Applied Economics and Econometrics (MS AEE)

    1,818 followers

    • Report this post

    Interested in #llms and #finance? Check out Doc LLM introduced by J.P. Morgan which is designed to understand multimodal documents with complex layouts.This model focuses on documents like forms, invoices, receipts, reports, and contracts, which contain both text and spatial elements. Unlike other multimodal models that use image encoders, DocLLM exclusively uses bounding box information to incorporate spatial layout. It achieves this by breaking down the attention mechanism in classical transformers into a set of disentangled matrices. Additionally, it employs a pre-training objective to fill in text segments, making it effective for irregular layouts and diverse content found in visual documents. The model is fine-tuned using a large instruction dataset for various document intelligence tasks and outperforms state-of-the-art language models on 14 out of 16 datasets across all tasks while also generalizing well to 4 out of 5 previously unseen datasets. See the post below for details!🔥#transformers #multimodal #ai

    Like Comment

    To view or add a comment, sign in

Stephen Pimentel on LinkedIn: Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (31)

Stephen Pimentel on LinkedIn: Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (32)

2,255 followers

  • 155 Posts

View Profile

Follow

Explore topics

  • Sales
  • Marketing
  • Business Administration
  • HR Management
  • Content Management
  • Engineering
  • Soft Skills
  • See All
Stephen Pimentel on LinkedIn: Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (2024)

References

Top Articles
Latest Posts
Article information

Author: Kieth Sipes

Last Updated:

Views: 6484

Rating: 4.7 / 5 (67 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Kieth Sipes

Birthday: 2001-04-14

Address: Suite 492 62479 Champlin Loop, South Catrice, MS 57271

Phone: +9663362133320

Job: District Sales Analyst

Hobby: Digital arts, Dance, Ghost hunting, Worldbuilding, Kayaking, Table tennis, 3D printing

Introduction: My name is Kieth Sipes, I am a zany, rich, courageous, powerful, faithful, jolly, excited person who loves writing and wants to share my knowledge and understanding with you.