MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks
Favorite Posted by AJ Piergiovanni and Anelia Angelova, Research Scientists, Google Research Vision-language foundational models are built on the premise of a single pre-training followed by subsequent adaptation to multiple downstream tasks. Two main and disjoint training scenarios are popular: a CLIP-style contrastive learning and next-token prediction. Contrastive learning trains
Read More
Shared by Google AI Technology May 4, 2023