Stephen Pimentel
CTO at Dragonscale
- Report this post
Apple's Ferret-UI is a specialized multimodal large language model designed specifically for understanding mobile UI screens, featuring capabilities in referring, grounding, and reasoning. It improves upon general-domain models by handling the unique challenges of UI screens, such as their elongated aspect ratio and smaller details like icons and text. Ferret-UI employs a novel approach of dividing each screen into two sub-images based on its orientation, encoding them separately to enhance detail recognition. The model was trained on a diverse set of UI tasks, including icon recognition and widget listing, with instruction-following samples and region annotations to support accurate interaction. Advanced tasks for training include detailed description and function inference. Ferret-UI significantly outperforms existing open-source UI models and even surpasses GPT-4V in elementary UI tasks, validated by a comprehensive benchmark designed for model evaluation.https://lnkd.in/gBNuJT4P
To view or add a comment, sign in
More Relevant Posts
-
Zihui Ouyang
Python Programming | Data Visualization | AI | ML | Hugging Face | SQL
- Report this post
In this era of large models, Visual Language Models (VLM) are getting bigger as well, the parameter count is usually in the hundreds of billions range. However, smaller models are still important because they are easier to train and deploy. So Google released PaLI (Pathways Language and Image) last year, which is a multimodal model. The text decoder-encoder was based on mT5 models while the image encoder was based on ViT models. Google recently released the third generation of PaLI in this paper: https://lnkd.in/gaeNkedF
Like CommentTo view or add a comment, sign in
-
EnterpriseTalk
3,590 followers
- Report this post
Deploying Large Language Models (LLMs) is a step towards enhancing user experience. But, knowing where to start and which aspects to consider before LLM deployment is essential.Here are 8 Factors to consider before deploying LLM
2
Like CommentTo view or add a comment, sign in
-
NLPlanet | Breaking Down Generative AI Daily
10,890 followers
- Report this post
GPT4RoI: augmenting vision-language tasks with regions of interests 🧑🏫🔍 GPT4RoI reformulates the bounding box as a spatial instruction, allowing for region-level alignments between visual features and language embeddings.GPT4RoI opens up a whole new conversational and interactive experience beyond image-level understanding:1️⃣ Users can control the GPT4RoI model using language and spatial instructions, adjusting question details.2️⃣ GPT4RoI model supports multi-region spatial instructions, offering a wider range of region-level multimodal capacities like detailed region captions and complex region reasoning.🤖 GPT4RoI is remarkable due to its versatility. It can use any object detector as a spatial instruction provider, extracting details like color, shape, material, action, and object relationships, improving its understanding abilities.
1
Like CommentTo view or add a comment, sign in
-
Santosh Sawant
Senior ML Architect, LLMs | Ex-Ola
- Report this post
FIND: INterface for Foundation models’ embeDDingsFoundation models across the vision and language domains, such as GPT4, DALLE-3, SAM and LLaMA etc., have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) .However, the process of training individual foundation models has become remarkably costly. Furthermore, the full potential of these models remains untapped due to limitations in their fixed output modalities (i.e. text output for Q&A and visual output for image generation). Although techniques such as prompt engineering and adaptive tuning have shown promising results, these approaches struggle with integrating different foundation models off the shelf, expanding the output types and task objectives.Paper proposes FIND - a generalized interface for aligning foundation models’ embeddings. The interface enables task-adaptive prototyping, which means we only need to change the configure file instead of the model architecture when adapting to the new tasks. Because all the vision-language tasks are trained in a unified way, this creates an interleaved shared embedding space where vision and language references are replaceable and addable. The proposed interface has the following favorable attributes:1. Generalizable. It applies to various tasks spanning retrieval, segmentation, etc., under the same architecture and weights.2. Prototypable. Different tasks are able to be implemented through prototyping attention masks and embedding types.3. Extendable. The proposed interface is adaptive to new tasks, and new models.4. Interleavable. With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. Furthermore, FIND has achieved SoTA performance on interleaved image retrieval and segmentation and shows better or comparable performance on generic/interactive/grounded segmentation and image-text retrieval.Paper : https://lnkd.in/gGcvmv9kCheckout more paper review herehttps://lnkd.in/gaCZSrXm
12
Like CommentTo view or add a comment, sign in
-
Shawn Horton
Key Account Executive, Healthcare @ Google Cloud
- Report this post
Optical character recognition has become the standard way developers extract and utilize text and layout data from PDFs and images. In this blog, we will discuss the history of #OCR, where the technology is headed, and how it is more important than ever with the rise of large language models (#LLMs).
19
Like CommentTo view or add a comment, sign in
-
Hamdi Amroun, PhD
Head of AI (ex. AWS)
- Report this post
JPMorgan has introduced DocLLM, a generative language model designed to understand multimodal documents with complex layouts. This model focuses on documents like forms, invoices, receipts, reports, and contracts, which contain both text and spatial elements. Unlike other multimodal models that use image encoders, DocLLM exclusively uses bounding box information to incorporate spatial layout. It achieves this by breaking down the attention mechanism in classical transformers into a set of disentangled matrices. Additionally, it employs a pre-training objective to fill in text segments, making it effective for irregular layouts and diverse content found in visual documents. The model is fine-tuned using a large instruction dataset for various document intelligence tasks and outperforms state-of-the-art language models on 14 out of 16 datasets across all tasks while also generalizing well to 4 out of 5 previously unseen datasets. You can find the full paper here: https://lnkd.in/euGH5KuM
421
22 Comments
Like CommentTo view or add a comment, sign in
-
Sukhwinder S
Building Programs || Data & Design || Learning & Development
- Report this post
JPMorgan introduced DocLLM, a generative LM designed to understand multimodal docs with complex layouts. - DocLLM, is an extension to language models to reason over visual documents using both text and layout-Focuses on using bounding box info to incorporate spatial layout instead of expensive image encoders-Captures cross-alignment between text and layout via disentangled attention matrices-Pre-trains model to infill missing text segments to handle irregular layouts-Fine-tunes on instruction dataset covering 4 core document intelligence tasks-Outperforms state-of-the-art on 14/16 datasets and generalizes well to 4/5 unseen datasets.#genai #llm
1
Like CommentTo view or add a comment, sign in
-
Backplain
276 followers
- Report this post
"Capable and Powerful Small Language Models" + "Open Models will become comparable with proprietary models" = Lots of Large Language Model (LLM) choice = Multi-model support = Backplain. #controlai #llmhttps://hubs.ly/Q02d-r_00
6
Like CommentTo view or add a comment, sign in
-
SentientMatters
1,544 followers
- Report this post
Open Interpreter Evolves: OS Mode and Visual Mode Enhancements UnveiledThe Open Interpreter project successfully emulates OpenAI's interpreter, introducing an OS mode and a visual mode. In addition to replicating OpenAI's interpreter, the project now allows users to control their computer using a language model. The visual mode enables interaction by clicking buttons and viewing the screen, expanding the project's capabilities beyond text-based emulation. This development enhances user engagement and usability, making Open Interpreter a versatile tool for controlling computer functions through a language model interface, providing a unique and innovative approach to human-computer interaction.#OpenInterpreter #AI #LanguageModel #Innovation #HumanComputerInteraction #Emulation
1
Like CommentTo view or add a comment, sign in
-
- Report this post
Interested in #llms and #finance? Check out Doc LLM introduced by J.P. Morgan which is designed to understand multimodal documents with complex layouts.This model focuses on documents like forms, invoices, receipts, reports, and contracts, which contain both text and spatial elements. Unlike other multimodal models that use image encoders, DocLLM exclusively uses bounding box information to incorporate spatial layout. It achieves this by breaking down the attention mechanism in classical transformers into a set of disentangled matrices. Additionally, it employs a pre-training objective to fill in text segments, making it effective for irregular layouts and diverse content found in visual documents. The model is fine-tuned using a large instruction dataset for various document intelligence tasks and outperforms state-of-the-art language models on 14 out of 16 datasets across all tasks while also generalizing well to 4 out of 5 previously unseen datasets. See the post below for details!🔥#transformers #multimodal #ai
Like CommentTo view or add a comment, sign in
2,255 followers
- 155 Posts
View Profile
Follow