RoboBrain 2.5: Advanced version of RoboBrain. Depth in Sight, Time in Mind. 🎉🎉🎉
-
Updated
Feb 28, 2026 - Python
RoboBrain 2.5: Advanced version of RoboBrain. Depth in Sight, Time in Mind. 🎉🎉🎉
UI-Venus is a native UI agent designed to perform precise GUI element grounding and effective navigation using only screenshots as input.
Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
🔥🔥🔥[AAAI 2026 Oral] Official Implementation of Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
[ACL 2025] The code repository for "Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning" in PyTorch.
🦙 echoOLlama: A real-time voice AI platform powered by local LLMs. Features WebSocket streaming, voice interactions, and OpenAI API compatibility. Built with FastAPI, Redis, and PostgreSQL. Perfect for private AI conversations and custom voice assistants.
Not a neutral survey — a field manual for engineers who build, train, and ship multimodal retrieval at production scale. The C-L-I triangle (Compression · Localization · Instruction), MLLM encoders vs late interaction, MUVERA economics, and falsifiable forecasts through 2030.
Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖
A comprehensive survey of Vision–Language Models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets
[AAAI'26] Official implementation of CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation
Using MAIRA-2 multimodal transformer designed for the generation of grounded or non-grounded radiology reports from chest X-rays.
Evaluating ‘Graphical Perception’ with Multimodal Large Language Models
Multi-Modal Healthcare Assistant
Gemma3 Vision - AI Image Analysis & Chat
ElaMath is a smart, voice-enabled math assistant that helps students solve and understand math problems using both spoken questions and images. It’s powered by the powerful multimodal meta-llama/llama-4-scout-17b-16e-instruct model via Groq API, combined with Whisper for speech recognition and ElevenLabs/gTTS for natural voice responses.
Elarova — A smart, multimodal research assistant designed to help students by combining speech, text, and other input modes for efficient academic research and study support. Powered by state-of-the-art speech recognition, text-to-speech, and AI models, including meta-llama/llama-4-scout-17b-16e-instruct, with an easy-to-use Gradio web interface.
Create a tool that uses a multimodal LLM to describe testing instructions for any digital product's features, based on the screenshots.
LLMChat is an open-source, privacy-first AI chatbot (powering LLMChat.co). It’s a Next.js + TypeScript monorepo that gives you one interface for multiple LLMs (OpenAI, Anthropic, Google, Groq, Ollama, etc.) with Deep Research and Pro Search modes, optional auth and credits and local-first storage (IndexedDB) so chat history stays in the browser.
Multimodel Document Intelliigence for better document understanding and context awareness for Academic Documents
Add a description, image, and links to the multimodel-large-language-model topic page so that developers can more easily learn about it.
To associate your repository with the multimodel-large-language-model topic, visit your repo's landing page and select "manage topics."