All posts

LLaVA is a large multimodal model that connects a vision encoder with a language model, aiming for general-purpose visual and language understanding.

BLIP-2 is a successor to the BLIP model. This post explains how you can use BLIP-2 to generate captions for images.

Learn more about the BLIP model and how you can integrate BLIP to generate captions.