Stephen Pimentel on LinkedIn: Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (2024)

Stephen Pimentel

CTO at Dragonscale

Report this post

Apple's Ferret-UI is a specialized multimodal large language model designed specifically for understanding mobile UI screens, featuring capabilities in referring, grounding, and reasoning. It improves upon general-domain models by handling the unique challenges of UI screens, such as their elongated aspect ratio and smaller details like icons and text. Ferret-UI employs a novel approach of dividing each screen into two sub-images based on its orientation, encoding them separately to enhance detail recognition. The model was trained on a diverse set of UI tasks, including icon recognition and widget listing, with instruction-following samples and region annotations to support accurate interaction. Advanced tasks for training include detailed description and function inference. Ferret-UI significantly outperforms existing open-source UI models and even surpasses GPT-4V in elementary UI tasks, validated by a comprehensive benchmark designed for model evaluation.https://lnkd.in/gBNuJT4P

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs arxiv.org

Like Comment

To view or add a comment, sign in

More Relevant Posts

Zihui Ouyang

Python Programming | Data Visualization | AI | ML | Hugging Face | SQL

5mo
Report this post
In this era of large models, Visual Language Models (VLM) are getting bigger as well, the parameter count is usually in the hundreds of billions range. However, smaller models are still important because they are easier to train and deploy. So Google released PaLI (Pathways Language and Image) last year, which is a multimodal model. The text decoder-encoder was based on mT5 models while the image encoder was based on ViT models. Google recently released the third generation of PaLI in this paper: https://lnkd.in/gaeNkedF

2310.09199.pdf arxiv.org
Like Comment

To view or add a comment, sign in
EnterpriseTalk

3,590 followers

1w
Report this post
Deploying Large Language Models (LLMs) is a step towards enhancing user experience. But, knowing where to start and which aspects to consider before LLM deployment is essential.Here are 8 Factors to consider before deploying LLM

What to Consider Before Deploying Large Language Models (LLMs) https://enterprisetalk.com

2

Like Comment

To view or add a comment, sign in
NLPlanet | Breaking Down Generative AI Daily

10,890 followers

9mo
Report this post
GPT4RoI: augmenting vision-language tasks with regions of interests 🧑🏫🔍 GPT4RoI reformulates the bounding box as a spatial instruction, allowing for region-level alignments between visual features and language embeddings.GPT4RoI opens up a whole new conversational and interactive experience beyond image-level understanding:1️⃣ Users can control the GPT4RoI model using language and spatial instructions, adjusting question details.2️⃣ GPT4RoI model supports multi-region spatial instructions, offering a wider range of region-level multimodal capacities like detailed region captions and complex region reasoning.🤖 GPT4RoI is remarkable due to its versatility. It can use any object detector as a spatial instruction provider, extracting details like color, shape, material, action, and object relationships, improving its understanding abilities.

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest arxiv.org

1

Like Comment

To view or add a comment, sign in
See Also
Feasibility evaluation of radiotherapy positioning system guided by augmented reality and point cloud registration
Santosh Sawant

Senior ML Architect, LLMs | Ex-Ola

2mo
Report this post
FIND: INterface for Foundation models’ embeDDingsFoundation models across the vision and language domains, such as GPT4, DALLE-3, SAM and LLaMA etc., have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) .However, the process of training individual foundation models has become remarkably costly. Furthermore, the full potential of these models remains untapped due to limitations in their fixed output modalities (i.e. text output for Q&A and visual output for image generation). Although techniques such as prompt engineering and adaptive tuning have shown promising results, these approaches struggle with integrating different foundation models off the shelf, expanding the output types and task objectives.Paper proposes FIND - a generalized interface for aligning foundation models’ embeddings. The interface enables task-adaptive prototyping, which means we only need to change the configure file instead of the model architecture when adapting to the new tasks. Because all the vision-language tasks are trained in a unified way, this creates an interleaved shared embedding space where vision and language references are replaceable and addable. The proposed interface has the following favorable attributes:1. Generalizable. It applies to various tasks spanning retrieval, segmentation, etc., under the same architecture and weights.2. Prototypable. Different tasks are able to be implemented through prototyping attention masks and embedding types.3. Extendable. The proposed interface is adaptive to new tasks, and new models.4. Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. Furthermore, FIND has achieved SoTA performance on interleaved image retrieval and segmentation and shows better or comparable performance on generic/interactive/grounded segmentation and image-text retrieval.Paper : https://lnkd.in/gGcvmv9kCheckout more paper review herehttps://lnkd.in/gaCZSrXm
12

Like Comment

To view or add a comment, sign in
Shawn Horton

Key Account Executive, Healthcare @ Google Cloud

6mo
Report this post
Optical character recognition has become the standard way developers extract and utilize text and layout data from PDFs and images. In this blog, we will discuss the history of #OCR, where the technology is headed, and how it is more important than ever with the rise of large language models (#LLMs).

What is OCR | Google Cloud Blog cloud.google.com

19

Like Comment

To view or add a comment, sign in
Hamdi Amroun, PhD

Head of AI (ex. AWS)

3mo
Report this post
JPMorgan has introduced DocLLM, a generative language model designed to understand multimodal documents with complex layouts. This model focuses on documents like forms, invoices, receipts, reports, and contracts, which contain both text and spatial elements. Unlike other multimodal models that use image encoders, DocLLM exclusively uses bounding box information to incorporate spatial layout. It achieves this by breaking down the attention mechanism in classical transformers into a set of disentangled matrices. Additionally, it employs a pre-training objective to fill in text segments, making it effective for irregular layouts and diverse content found in visual documents. The model is fine-tuned using a large instruction dataset for various document intelligence tasks and outperforms state-of-the-art language models on 14 out of 16 datasets across all tasks while also generalizing well to 4 out of 5 previously unseen datasets. You can find the full paper here: https://lnkd.in/euGH5KuM

Paper page - DocLLM: A layout-aware generative language model for multimodal document understanding huggingface.co

421

22 Comments

Like Comment

To view or add a comment, sign in
Sukhwinder S

Building Programs || Data & Design || Learning & Development

3mo
Report this post
JPMorgan introduced DocLLM, a generative LM designed to understand multimodal docs with complex layouts. - DocLLM, is an extension to language models to reason over visual documents using both text and layout-Focuses on using bounding box info to incorporate spatial layout instead of expensive image encoders-Captures cross-alignment between text and layout via disentangled attention matrices-Pre-trains model to infill missing text segments to handle irregular layouts-Fine-tunes on instruction dataset covering 4 core document intelligence tasks-Outperforms state-of-the-art on 14/16 datasets and generalizes well to 4/5 unseen datasets.#genai #llm

1

Like Comment

To view or add a comment, sign in
Backplain

276 followers

3mo
Report this post
"Capable and Powerful Small Language Models" + "Open Models will become comparable with proprietary models" = Lots of Large Language Model (LLM) choice = Multi-model support = Backplain. #controlai #llmhttps://hubs.ly/Q02d-r_00

Exploring The Future: 5 Cutting-Edge Generative AI Trends In 2024 forbes.com

6

Like Comment

To view or add a comment, sign in
SentientMatters

1,544 followers

3mo
Report this post
Open Interpreter Evolves: OS Mode and Visual Mode Enhancements UnveiledThe Open Interpreter project successfully emulates OpenAI's interpreter, introducing an OS mode and a visual mode. In addition to replicating OpenAI's interpreter, the project now allows users to control their computer using a language model. The visual mode enables interaction by clicking buttons and viewing the screen, expanding the project's capabilities beyond text-based emulation. This development enhances user engagement and usability, making Open Interpreter a versatile tool for controlling computer functions through a language model interface, providing a unique and innovative approach to human-computer interaction.#OpenInterpreter #AI #LanguageModel #Innovation #HumanComputerInteraction #Emulation

Open Interpreter - The New Computer Update changes.openinterpreter.com

1

Like Comment

To view or add a comment, sign in
USC Master of Science in Applied Economics and Econometrics (MS AEE)

1,818 followers

3mo
Report this post
Interested in #llms and #finance? Check out Doc LLM introduced by J.P. Morgan which is designed to understand multimodal documents with complex layouts.This model focuses on documents like forms, invoices, receipts, reports, and contracts, which contain both text and spatial elements. Unlike other multimodal models that use image encoders, DocLLM exclusively uses bounding box information to incorporate spatial layout. It achieves this by breaking down the attention mechanism in classical transformers into a set of disentangled matrices. Additionally, it employs a pre-training objective to fill in text segments, making it effective for irregular layouts and diverse content found in visual documents. The model is fine-tuned using a large instruction dataset for various document intelligence tasks and outperforms state-of-the-art language models on 14 out of 16 datasets across all tasks while also generalizing well to 4 out of 5 previously unseen datasets. See the post below for details!🔥#transformers #multimodal #ai

Like Comment

To view or add a comment, sign in

2,255 followers

155 Posts

View Profile

Explore topics

Sales
Marketing
Business Administration
HR Management
Content Management
Engineering
Soft Skills
See All

Stephen Pimentel on LinkedIn: Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (2024)

More Relevant Posts

Explore topics

References