0% found this document useful (0 votes)
17 views7 pages

Deep Learning-Based Image Captioning For Visually

This paper presents a deep learning-based image captioning system designed to assist visually impaired individuals by generating audio descriptions of captured images. Utilizing convolutional neural networks (CNNs) for feature extraction and recurrent neural networks (RNNs) for caption generation, the system aims to enhance the independence and confidence of blind users while interacting with their environment. The approach leverages the MS-COCO dataset for training, enabling the model to produce accurate and meaningful descriptions of various visual scenes.

Uploaded by

a471k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views7 pages

Deep Learning-Based Image Captioning For Visually

This paper presents a deep learning-based image captioning system designed to assist visually impaired individuals by generating audio descriptions of captured images. Utilizing convolutional neural networks (CNNs) for feature extraction and recurrent neural networks (RNNs) for caption generation, the system aims to enhance the independence and confidence of blind users while interacting with their environment. The approach leverages the MS-COCO dataset for training, enabling the model to produce accurate and meaningful descriptions of various visual scenes.

Uploaded by

a471k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.

1051/e3sconf/202339904005
ICONNECT-2023

Deep learning-based image captioning for


visually impaired people
R. Kavitha1*, S. Shree Sandhya2, Praveena Betes2, P. Rajalakshmi2, and E. Sarubala2
1Professor, CSE Department, Parisutham Institute of Technology and Science, India
2UG students, CSE Department, Parisutham Institute of Technology and Science, India

Abstract. Vision loss can affect people of all ages. Severe or complete
vision loss may occur when the eye or brain parts that need to process
images are damaged. In this paper, in order to facilitate the blind, deep
learning algorithms are used to caption the image for the blind person in
which the blind can know about the object, distance and position of object.
Whenever an image is captured via the camera, the scenes are recognized
and predicted by the machine. Afterthe prediction, it will be sent as an audio
output to the user. Thus, with the help of this paper an artificial vision to
the blind, can be achieved and help them to gain confidence while
travelling alone.

1 Introduction
Vision impairment may be due to sickness, an accident or a medical condition. This paper
is aimed at providing assistance for the blind so that they can feel more confident, secured
and independent. The paper involves developing a system that can automatically generate
textual descriptions of images to allow blind individuals to better understand and interact
with visual content. This system can be integrated into various platforms such as mobile
applications, websites, and assistive devices, enabling visually impaired individuals to access
information in a way that was previously not possible. To analyze images and to generate
descriptive captions, machine vision, natural language generation and learning techniques
are together utilized.

Convolutional neural networks (CNNs) for extracting picture features and recurrent neural
networks (RNNs) for generating language are two deep learning techniques that can be
used to complete the task. The two neural networks that are used in image captioning are
CNN and RNN. CNN is a multi- layered neural network. It is designed to extract features at
each layer that are increasingly complex and to determine the output. CNNs are able to
automatically extract meaningful visual features from the input image, which can be used to
generate more accurate and descriptive captions. The RNN is a multi-layered neural
network that stores input in context nodes, enabling it to learn data sequences and produce
a sequence as an output. RNNs are commonly used in image captioning as a decoder
network to generate the corresponding caption from the visual features extracted by a CNN.

* Corresponding author : kavithha@gmail.com

© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons
Attribution License 4.0 (https://creativecommons.org/licenses/by/4.0/).
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

Image caption generator is the technique of recognizing and understanding the context of
an image and annotating it with relevant captions using deep learning. The goal of image
captioning is to produce a coherent andsemantically meaningful description of animage that
captures the main objects, actions, and attributes depicted in the visual content. There are
numerous uses for image captioning in industries like assistive technology, image search,
and social media. Datasets are used for generating the captions for images. The dataset used
in this paper is MS-COCO (Microsoft Common Objects in Context).

1.1 Literature Survey


Literature survey in our paper represents the ground study of what we have done for the
completion of the paper. It is a survey of previously published research papers on the topic
of image captioning and deep learning-based research papers. [1] presents a text encoder,
an image encoder, and a decoder to generate natural language descriptions of images. The
model works in two stages, using a reinforcement learning algorithm to refine the initial
caption generated in the first stage. [2] developed a summarization module that generates a
summary of the remote sensing image, which is then used to guide the captioning process
in the deep captioning module. The deep captioning module generates a natural language
description of the remote sensing image based on the summary and the image itself. [3] It is
an approach that involves training a neural network consisting of two main components: a
visual feature extractor and a language generator. The visual feature extractor is typically a
convolutional neural network (CNN) that is trained to extract features from the image. The
language generator is a recurrent neural network (RNN) that generates the image caption
word-by-word. [4] It is a method for generating more precise and detailed image captions
by leveraging online positive recall and missing concepts mining. The approach uses a two-
stage framework to generate captions that are more informative and accurate. [5] It is an
approach that uses two-phase learning framework to generate captions. In the first phase, a
visual saliency detector is trained to identify the salient regions of the image. In the second
phase, a standard image captioning model is trained using the saliency maps generated by
the saliency detector as additional input. [6] A cross-modal retrieval model is trained to
learn a shared representation space for images and captions from both the source and target
domains. A model adaptation technique is used to fine-tune the cross-modal retrieval
model. [7] First extracts visual features from the input image using a CNN, and then
encodes the visual features into a fixed-length vector using a recurrent neural network. [8]
The encoded visual features are used as input to a context-aware policy network, which
generates a sequence of words that describe the image. [9] It consists of two tasks: a source
domain captioning task and a target domain captioning task. The source domain task is
trained on a dataset of images from the same domain as the training data, while the target
domain task is trained on a smaller dataset of images from the target domain. [10] Uses
instance-level fine-grained feature representation and demonstrated its effectiveness
through extensive experiments. [11] It consists of two main components: a generator
network and a discriminator network. It combines generation- and retrieval-based methods
using a dual generator generative adversarial network. [12] To improve the quality of the
generated captions, a novel loss function is used that combines the attribute detection loss,
the attribute prediction loss, and the captioning loss.

2 Materials and Methods


In Fig 1, Once the image is captured, it is first divided into ‘n’ pieces and then computes an
image representation for each part. Using a compound coefficient, the EfficientNet-B3
isolates the image's features and uniformly scales the depth, width and resolution in all

2
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

three dimensions. Tokenization splits the input data into a sequence of meaningful parts. In
tokenization, the image is split into patches andtext is split into tokens.
After tokenization, the extracted features will be fed for training with the Recurrent
Neural Network (RNN) algorithm. A feedback connection is the vital characteristics of RNN.
An RNN feedback loop has the ability to transfer the effects between the earlier portion and
later portion of the sequences which is an essential capability of modeling the sequences.
The MS-COCO dataset is used for understanding the visual scenes and generating captions.

Fig 1. Architecture Diagram

2.1 Convolutional Neural Network (CNN)

The CNN is trained on a large dataset of images to learn a hierarchical representation of


visual features. Once the CNN has extracted the features from the image, they are passed to
a RNN, which is responsible for generating the captions. during training the input image is
fed through the pre-trained CNN, and the output features from one or more of the
intermediate layers are extracted. In Fig 2, the input layer, which receives the unprocessed
picture data as input, is the initial layer of a CNN. The convolutional layer is the following
layer, and it uses a series of filters to extract features from the input image. These filters are
developed through training and are capable of identifying edges, corners, and other
interesting aspects of the image. The fully connected layers employ the features that have
been learned from the convolutional layers to assign the image to one of several
categories or forecast a numerical value. CNNs are highly effective for image recognition
tasks because they are able to automatically learn features from the input data.

3
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

Fig 2. Convolutional Neural Network

2.2 Recurrent Neural Network (RNN)

RNN takes the features from the CNN as input and generates a sequence of words, one
word at a time. RNNs are designed to handle sequential data, making them well-suited for
generating sequences of words, such as captions. The RNN typically uses a type of LSTM
(Long Short-Term Memory) network, which is able to capture long-term dependencies in
the sequence of visual features. In Fig 3, At each time step, the LSTM takes the output from
the previous time step and combines it with the current input to generate a new output.
This output is then passed through a fully connected layer to generate a probability
distribution over the vocabulary of words. The final output of the RNN is a sequence of
wordsthat describe the content of the image.

Fig 3. Recurrent Neural Network

2.3 Training Dataset

In machine learning model, a set of data from the entire data will be taken as training dataset.
It consists of a collection of input data and their corresponding output values. MS COCO
datasets are used in this paper. Exploring the MS COCO dataset, a sizable image dataset
with 328,000 photos of common objects and people, is mostly done to comprehend the
visual situations. The dataset is made up of the output captions for the input photos. The
model may learn more precisely and generalise to new, untried data more effectively if the
training data set is larger. The training dataset for supervised learning contains pairings of
input and output data with the aim of teaching the machine learning model to translate the
input data to the appropriate output data. In unsupervised learning, the training dataset
simply contains input data, and the objective is to discover patterns or structure in the data.
In Fig 4, the images in the MS COCO dataset cover a wide range of scenes and objects,

4
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

including people, animals, vehicles, and indoor and outdoor environments. This diversity
makes the dataset well-suited for training machine learning models that can generate
accurate and diverse captions for variety of images.

Fig 4. MS-Coco Dataset

2.4 Tokenization

Text data modeling starts with tokenization, Tokenization is the process of dividing a
stream of textual data into tokens, which can be words, terms, sentences, symbols, or other
significant objects. Unstructured data and text written in natural language are tokenized
into informational units that can be regarded as separate elements. Tokens can either be
words, character or subwords. In Fig 5, RNN uses the words that came before it to
anticipate the subsequent words in a phrase. In order to achieve this, the tokenized word list
in the caption of the image is transformed. Strings are turned to integers using the
tokenization process. Create a dictionary that translates all distinct words to a numerical
index by first going through all of the training captions. It will therefore have an integer
value for each word it encounters.

Fig 5. Tokenization

2.5 Caption Generation

Caption generation in image captioning refers to the process of generating a natural


language description of an image. The goal is to train a machine learning model to
automatically generate captions that accurately and semantically describe the content of the
input image. Generating accurate and semantically meaningful captions for images is a

5
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

challenging task, and there are many techniques and approaches that have been developed
to improve the performance of image captioning models.

Fig 6. Caption Generation

These include using attention mechanisms to focus on specific regions of the image,
incorporating external knowledge sources, and using reinforcement learning to optimize the
caption generation process. In Fig 6, The captions for the image are generated and
produced as audio output that can help the blindpeople to identify the objects.

3 Results and Discussion


Once the image fed into the CNN, the image is first divided into ‘n’ pieces and then
computes an image representation for each part. The EfficientNet-B3 separates the
features of the image and scales the depth, width, and resolution in all three dimensions
consistently. Tokenization splits the input data into a sequence of meaningful parts. In
tokenization, the image is split into patches and text is split into tokens. After tokenization,
the extracted features will be fed for training with the Recurrent Neural Network (RNN)
algorithm. An RNN's main characteristic is its network of feedback links. This feedback
loop gives the RNN the ability to simulate how the earlier portions of the sequence affect
the later portions of the sequence, which is a crucial capability when modelling sequences.
The MS-COCO dataset, which contains 328,000 pictures of people and objects from
everyday life, is used to comprehend visual scenarios. The datasets will be collected from
MS-COCO dataset and the datasets will be trained using advanced image captioning
techniques implementing attention algorithm. Whenever an image is captured, the scenes
are recognized and predicted by the machine. After training the model with algorithm, a live
scene is captured via the camera. This captured scene will be recognized and the output
model file will be generated. The major objects are also predicted and the distance is
calculated from the camera. After the prediction, it is been sent as an audiooutput to the
user.

4 Conclusion
In this paper, convolutional and recurrent neural networks, among other deep learning
models, were investigated to provide captions for images. The use of pre-trained CNNs,
such as the EfficientNet-B3 model, for feature extraction helped in capturing meaningful
information from images, while the RNN generates sequential words to form coherent
captions. The deep learning algorithm is the finest technique which ensures accuracy in the

6
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

achieved output. Overall, the study shows how deep learning methods may provide precise
and insightful descriptions for images, which has benefits in areas like image retrieval and
image indexing.

The datasets will be collected from MS-COCO dataset and the datasets will be trained using
advanced image captioning techniques implementing attention algorithm. Whenever an
image is captured via the camera, the scenes are recognized and predicted by the machine.
After training the model with algorithm, a live scene is captured via the camera. This
captured scene will be recognized and the output model file will be generated. The major
objects are also predicted and the distance is calculated from the camera. As a result of the
prediction an audio output has sent to the user.

References
1. Depeng Wang, Zhenzhen Hu, Yuanen Zhou, Richang Hong, IEEE Transactions on
Multimedia, 23, 3, pp. 779-789 (2022)
2. Sumbul G., Nayak S., & Demir B., IEEE Transactions on Geoscience and Remote
Sensing, 59, 8, pp. 6922-6934 (2021)
3. Yu N, Hu X, Song B, Yang J, Zhang J., IEEE Transactions on Image Processing, 28, 6,
pp. 2743-2754 (2019)
4. Zhang M, Yang Y, Zhang H, Ji Y, Shen H. T., Chua T., IEEE Transactions on Image
Processing, 28, 1, pp. 32-44 (2019)
5. Zhou L, Zhang Y, Jiang Y, Zhang T, Fan W., IEEE Transactions on Image Processing,
29, pp. 694-709 (2020)
6. Zhao, W., Wu, X., Luo, J., IEEE Transactions on Image Processing, 30, pp. 1180-1192
(2021)
7. Maofu Liu, Huijun Hu, Lingjun Li, Yan Yu and Weili Guan, IEEE Transactions on
Cybernetics, 52, 2, pp. 1247- 1257 (2022)
8. Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 44, 2, pp. 710- 722 (2022)
9. Min Yang, Wei Zhao, Wei Xu, Yabing Feng, Zhou Zhao, Xiaojun Chen, Kai Lei, IEEE
Transactions on Multimedia, 21, 4, pp. 1047-1061 (2019)
10. Qingbao Huang, Yu Liang, Jielong Wei, Yi Cai, Hanyu Liang, Ho-fung Leung, Qing Li,
IEEE Transactions on Multimedia, 24, pp. 2004-2017 (2022)
11. Min Yang, Junhao Liu, Ying Shen, Zhou Zhao, Xiaojun Chen, Qingyao Wu,
ChengmingLi, IEEE Transactions on Image Processing, 29, pp. 9627-9640 (2020).
12. Yiqing Huang, Jiansheng Chen, Wanli Ouyang, Weitao Wan, Youze Xu, IEEE
Transactions on Image Processing, 29, pp. 4013-4026 (2020)

You might also like