Metamorph Multimodal Understanding And Generation Via Instruction Tuning
1 mentions across 1 person
Visit ↗All mentions
“In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens.”
Visual-Predictive Instruction Tuning for Unified Multimodal LLMs ↗