Using the BLIP Model for Image Captioning

2024-03-01

Overview

BLIP is an open-source model (source code is available at https://github.com/salesforce/BLIP). It is able to perform various multi-modal tasks, including:

Image captioning
Visual question answering
Image-text retrieval (Image-text matching)

BLIP was proposed in BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. The diagram below demonstrates how BLIP works at a high level.

Image captioning with BLIP

Next we will demonstrate how to use the BLIP model for image captioning from scratch.

Step 1: Clone the BLIP repository

$ git clone https://github.com/salesforce/BLIP

Step 2: Create a virtual environment and install the required packages

$ cd BLIP/
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt

Step 3: Generate image captions

The following Python code shows how to generate image captions using the BLIP model. The code loads a demo image from the internet and generates two captions using beam search and nucleus sampling. Note that beam search and nucleus sampling are two popular decoding strategies for sequence generation tasks. Simply put, beam search is a deterministic decoding method and the generated caption is more consistent in each run; while nucleus sampling is a stochastic decoding method that leads to better performance but the generated caption may vary each time.

from PIL import Image
import requests
import torch
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
img_url = 'https://i.pinimg.com/564x/26/c7/35/26c7355fe46f62d84579857c6f8c4ea5.jpg'
model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_caption_capfilt_large.pth'

def load_demo_image(image_size, device):
    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
    w,h = raw_image.size
    transform = transforms.Compose([
        transforms.Resize((image_size, image_size), interpolation=InterpolationMode.BICUBIC),
        transforms.ToTensor(),
        transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
    ])
    image = transform(raw_image).unsqueeze(0).to(device)
    return image

from models.blip import blip_decoder

image_size = 384
image = load_demo_image(image_size=image_size, device=device)

model = blip_decoder(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device)

with torch.no_grad():
    # beam search
    caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)
    print(f'caption (beam search): {caption[0]}')

    # nucleus sampling
    caption = model.generate(image, sample=True, top_p=0.9, max_length=20, min_length=5)
    print(f'caption (nucleus sampling): {caption[0]}')

CaptionCraft support of BLIP

CaptionCraft provides an easy-to-integrate API for image captioning using the BLIP model. You can try it out for free at https://rapidapi.com/fantascatllc/api/image-caption-generator2.

import requests

url = "https://image-caption-generator2.p.rapidapi.com/v2/captions/simple"
params = {"imageUrl": "https://i.pinimg.com/564x/26/c7/35/26c7355fe46f62d84579857c6f8c4ea5.jpg"}
headers = {
	"X-RapidAPI-Key": "<Your-RapidAPI-Key>",
	"X-RapidAPI-Host": "image-caption-generator2.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=params)
print(response.json())