How to Implement Image Captioning with Vision Transformer (ViT) and Hugging Face Transformers !
Image captioning is a famous multimodal task that combines computer vision and natural language processing. The research topic has undergone considerable study over the years, and the models available today are substantially strong enough to handle a large variety of cases. In this article, we will explore the use of Hugging Face’s transformer library to utilize the latest sequence-to-sequence models using a Vision Transformer encoder and a GPT-based decoder. We will see how HuggingFace makes it simple to use openly available models to perform image captioning. Model Selection and Architecture We use the ViT-GPT2-image-captioning pre-trained model by nlpconnect available on HuggingFace. Image captioning takes an image as an input and outputs a textual description of the image. For this task, we use a multi-modal model divided into two parts; an encoder and a decoder. The encoder takes the raw image pixels as input and uses a neural network to transform them into a 1-dimensional com...