Understand CLIP (Contrastive Language-Image Pre-Training) — Visual Models from NLP

CLIP introduces a model that enables zero shot learning for a new dataset (in addition to a new example) by using NLP to supervise pre-training. i.e., To identify an object, you can provide the name or description of a new object that the model has not seen before.

Traditionally a computer vision model was trained with just images, which means to classify an object as a zebra, the model had to be trained with lots of zebras. However, what if you train a model using not just an image but also its associated text (e.g. caption)? Now if you train the model with hundreds of animals (excluding a zebra) and test it with an image of zebra with a description of how a zebra looks like (like horses but with black and white stripes), then the model may be able to classify a zebra without seeing one in training. This is also called zero-shot learning.

Natural Language Supervision for Visual Models

The idea is to learn more about the image using supervision from Natural Language Processing. However, its hard to find high quality large datasets that have crowd-labeled images with text. The paper introduces a new dataset of 400 million image x text pairs collected from internet.

One way to train the model is to jointly train the image CNN and text transformer from scratch, however that does not scale very well. Secondly training a model to predict the exact words of the text accompanying an image is hard and therefore a contrastive representation might be easier.

What is contrastive representation learning?

Contrastive representation captures information that is shared by multiple sources (images, text). The idea is to maximize mutual information. Predictive learning might use an encode+decoder setup to predict one source from other. Contrastive learning on the other hand learns an embedding that separates (contrasts) samples from two different sources

Training CLIP

Given N (image, text) pairs, CLIP is trained to predict which of the N x N possible pairings actually occurred. To do this, CLIP trains a multi modal embedding space by jointly training an image and a text encoder by maximizing (image,text) mappings and penalizing incorrect mappings.

For the image encorder the paper uses two models

ResNet-50 as base architecture with modifications.
Vision Transformer (ViT)

For the text encorder they used a transformer based on this paper.

CLIP Evaluation

We learnt that zero shot learning is about predicting a new class, but the paper treats zero shot learning as learning a new task. For example, another paper identified using a language model trained to generate Wikipedia articles to reliably transliterate names between languages.

To Perform zero-shot class classification, CLIP uses the name of all the classes in the dataset as text pairings and predicts the most probably (text, image) pair.

For the input dataset, since the classes sometimes only have a single word (dog), they replaced it with a prompt ‘A photo of a dog’ or sometimes providing more context ‘A photo of a dog, a type of pet’. This is similar to Prompt engineering.

The performance of CLIP on various tasks/datasets is summarized in this figure

CLIP comparison with human performance

Humans showed an increase in average performance from 54% to 76% between zero shot and one-shot (trained with one example) learning. The few shot learning from the paper was not so great, suggesting that there is still scope to improve the algorithm. One reason is that few shot learning does not make use of prior knowledge and humans do. However, out of distribution images are hard for both humans and the algorithm.

Limitations

The following limitations are listed in the paper

CLIPs zero shot performance is weak in certain fine grained classification tasks such as differentiating models of cars, species of flowers, and variants of aircraft when compared with task specific models.
CLIP struggles with abstract tasks such as counting the number of objects.
For novel tasks such as classifying the distance from nearest car in photo, the performance is random.
For out of distribution tasks the performance is poor as well (OCR trained on typed documents works well but handwritten documents does not)

Conclusion

CLIP presents a method to perform zero shot learning for completely new tasks using a computer vision model that is pretrained with supervision from NLP. It has shown promising results on multiple datasets, but still needs work for complex datasets.