Scaling Languagefree Visual Representation Learning
2 mentions across 1 person
Visit ↗All mentions
“In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders.”
Visual Self-Supervised Learning Matches Language-Supervised Methods at Scale ↗