Using the BLIP-2 Model for Image Captioning

2024-03-05

Overview

In the previous post we looked at the BLIP model for image captioning. The same group of researchers from Salesforce developed a more advanced version of the BLIP model, called BLIP-2. In this post we will look at the BLIP-2 model and how we can use it for image captioning tasks.

The BLIP-2 paper proposes a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods.

Image captioning with BLIP-2

Step 1: Install the `lavis` library

The lavis library provides a simple API for loading pre-trained models and processing images and text. The lavis library can be installed using pip:

$ pip install salesforce-lavis

Step 2: Generate image captions

The code snippet below demonstrates how to use the BLIP-2 model for image captioning.

import torch
import requests
from PIL import Image

img_url = 'https://i.pinimg.com/564x/26/c7/35/26c7355fe46f62d84579857c6f8c4ea5.jpg'
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

import torch
from lavis.models import load_model_and_preprocess

model, vis_processors, _ = load_model_and_preprocess(
    name="blip2_t5",
    model_type="caption_coco_flant5xl",
    is_eval=True,
    device=device
)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

# generate caption using beam search
model.generate({"image": image})

# generate caption using necleus sampling
# due to the non-determinstic nature of necleus sampling, you may get different captions.
model.generate({"image": image}, use_nucleus_sampling=True, num_captions=3)

CaptionCraft support of BLIP-2