Meta has introduced Chameleon, a new family of multimodal AI models designed to use various modalities natively.
Unlike the common “late fusion” approach, Chameleon uses an “early-fusion token-based mixed-modal” architecture, transforming images into discrete tokens and using a unified vocabulary for text, code, and image tokens.
This allows Chameleon to achieve state-of-the-art performance in tasks like image captioning and visual question answering (VQA), while remaining competitive in text-only tasks.
Chameleon's training involves a vast dataset of 4.4 trillion tokens and significant GPU resources. The model performs well across both text-only and multimodal tasks, surpassing other models in VQA and image captioning benchmarks.
Meta's unified token space enables Chameleon to generate interleaved image and text sequences effectively, making it a strong candidate for open-source multimodal AI applications, potentially inspiring advancements in fields such as robotics.
Meta has recently introduced two major developments in the realm of multimodal models: Chameleon and CM3leon.
-
This model is designed to understand and generate both images and text in any sequence.
-
It showcases exceptional performance in various tasks such as visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.
-
Chameleon excels in image captioning, outperforming models like Flamingo, IDEFICS, and Llava-1.5.
-
It is competitive with models like Mixtral 8x7B and Gemini-Pro on text-only benchmarks in tasks like commonsense reasoning and reading comprehension.
- Notably, Chameleon unlocks new possibilities in mixed-modal reasoning and generation, handling tasks where the prompt or output contains mixed sequences of images and text.
|