import streamlit as st # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Main Title st.markdown('
Image Captioning with VisionEncoderDecoderModel
', unsafe_allow_html=True) # Description st.markdown("""

VisionEncoderDecoderModel allows you to initialize an image-to-text model using any pretrained Transformer-based vision model (e.g., ViT, BEiT, DeiT, Swin) as the encoder and any pretrained language model (e.g., RoBERTa, GPT2, BERT, DistilBERT) as the decoder.

This approach has been demonstrated to be effective in models like TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li et al.

After training or fine-tuning a VisionEncoderDecoderModel, it can be saved and loaded just like any other model. Examples are provided below.

""", unsafe_allow_html=True) # Image Captioning Overview st.markdown('
What is Image Captioning?
', unsafe_allow_html=True) st.markdown("""

Image Captioning is the task of generating a textual description of an image. It uses a model to encode the image into a feature representation, which is then decoded by a language model to produce a natural language description.

How It Works

Image captioning typically involves the following steps:

Why Use Image Captioning?

Image captioning is useful for:

Where to Use It

Applications of image captioning span various domains:

Importance

Image captioning is essential for bridging the gap between visual and textual information, enabling better interaction between machines and users by providing context and meaning to images.

""", unsafe_allow_html=True) # How to Use st.markdown('
How to Use the Model
', unsafe_allow_html=True) st.code(''' import sparknlp from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline # Load image data imageDF = spark.read \\ .format("image") \\ .option("dropInvalid", value = True) \\ .load("src/test/resources/image/") # Define Image Assembler imageAssembler = ImageAssembler() \\ .setInputCol("image") \\ .setOutputCol("image_assembler") # Define VisionEncoderDecoder for image captioning imageCaptioning = VisionEncoderDecoderForImageCaptioning \\ .pretrained() \\ .setBeamSize(2) \\ .setDoSample(False) \\ .setInputCols(["image_assembler"]) \\ .setOutputCol("caption") # Create pipeline pipeline = Pipeline().setStages([imageAssembler, imageCaptioning]) # Apply pipeline to image data pipelineDF = pipeline.fit(imageDF).transform(imageDF) # Show results pipelineDF \\ .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "caption.result") \\ .show(truncate = False) ''', language='python') # Results st.markdown('
Results
', unsafe_allow_html=True) st.markdown("""
Image Name Result
palace.JPEG [a large room filled with furniture and a large window]
egyptian_cat.jpeg [a cat laying on a couch next to another cat]
hippopotamus.JPEG [a brown bear in a body of water]
hen.JPEG [a flock of chickens standing next to each other]
ostrich.JPEG [a large bird standing on top of a lush green field]
junco.JPEG [a small bird standing on a wet ground]
bluetick.jpg [a small dog standing on a wooden floor]
chihuahua.jpg [a small brown dog wearing a blue sweater]
tractor.JPEG [a man is standing in a field with a tractor]
ox.JPEG [a large brown cow standing on top of a lush green field]
""", unsafe_allow_html=True) # Model Information st.markdown('
Model Information
', unsafe_allow_html=True) st.markdown("""
Attribute Description
Model Name image_captioning_vit_gpt2
Compatibility Spark NLP 5.1.2+
License Open Source
Edition Official
Input Labels [image_assembler]
Output Labels [caption]
Language en
Size 890.3 MB
""", unsafe_allow_html=True) # Data Source Section st.markdown('
Data Source
', unsafe_allow_html=True) st.markdown("""

The image captioning model is available on Hugging Face. This model uses ViT for image encoding and GPT2 for generating captions.

""", unsafe_allow_html=True) # Conclusion st.markdown('
Conclusion
', unsafe_allow_html=True) st.markdown("""

The VisionEncoderDecoderModel represents a powerful approach for bridging the gap between visual and textual information. By leveraging pretrained models for both image encoding and text generation, it effectively captures the nuances of both domains, resulting in high-quality outputs such as detailed image captions and accurate text-based interpretations of visual content.

""", unsafe_allow_html=True) # References st.markdown('
References
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # Community & Support st.markdown('
Community & Support
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True)