Using the BLIP Model for Image Captioning
2024-03-01Overview
BLIP is an open-source model (source code is available at https://github.com/salesforce/BLIP). It is able to perform various multi-modal tasks, including:
- Image captioning
- Visual question answering
- Image-text retrieval (Image-text matching)
BLIP was proposed in BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. The diagram below demonstrates how BLIP works at a high level.
Image captioning with BLIP
Next we will demonstrate how to use the BLIP model for image captioning from scratch.
Step 1: Clone the BLIP repository
$ git clone https://github.com/salesforce/BLIP
Step 2: Create a virtual environment and install the required packages
$ cd BLIP/
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
Step 3: Generate image captions
The following Python code shows how to generate image captions using the BLIP model. The code loads a demo image from the internet and generates two captions using beam search and nucleus sampling. Note that beam search and nucleus sampling are two popular decoding strategies for sequence generation tasks. Simply put, beam search is a deterministic decoding method and the generated caption is more consistent in each run; while nucleus sampling is a stochastic decoding method that leads to better performance but the generated caption may vary each time.
from PIL import Image
import requests
import torch
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
img_url = 'https://i.pinimg.com/564x/26/c7/35/26c7355fe46f62d84579857c6f8c4ea5.jpg'
model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_caption_capfilt_large.pth'
def load_demo_image(image_size, device):
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
w,h = raw_image.size
transform = transforms.Compose([
transforms.Resize((image_size, image_size), interpolation=InterpolationMode.BICUBIC),
transforms.ToTensor(),
transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))
])
image = transform(raw_image).unsqueeze(0).to(device)
return image
from models.blip import blip_decoder
image_size = 384
image = load_demo_image(image_size=image_size, device=device)
model = blip_decoder(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device)
with torch.no_grad():
# beam search
caption = model.generate(image, sample=False, num_beams=3, max_length=20, min_length=5)
print(f'caption (beam search): {caption[0]}')
# nucleus sampling
caption = model.generate(image, sample=True, top_p=0.9, max_length=20, min_length=5)
print(f'caption (nucleus sampling): {caption[0]}')
CaptionCraft support of BLIP
CaptionCraft provides an easy-to-integrate API for image captioning using the BLIP model. You can try it out for free at https://rapidapi.com/fantascatllc/api/image-caption-generator2.
import requests
url = "https://image-caption-generator2.p.rapidapi.com/v2/captions/simple"
params = {"imageUrl": "https://i.pinimg.com/564x/26/c7/35/26c7355fe46f62d84579857c6f8c4ea5.jpg"}
headers = {
"X-RapidAPI-Key": "<Your-RapidAPI-Key>",
"X-RapidAPI-Host": "image-caption-generator2.p.rapidapi.com"
}
response = requests.get(url, headers=headers, params=params)
print(response.json())