0% found this document useful (0 votes)

17 views7 pages

Deep Learning-Based Image Captioning For Visually

This paper presents a deep learning-based image captioning system designed to assist visually impaired individuals by generating audio descriptions of captured images. Utilizing convolutional neural networks (CNNs) for feature extraction and recurrent neural networks (RNNs) for caption generation, the system aims to enhance the independence and confidence of blind users while interacting with their environment. The approach leverages the MS-COCO dataset for training, enabling the model to produce accurate and meaningful descriptions of various visual scenes.

Uploaded by

a471k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views7 pages

Deep Learning-Based Image Captioning For Visually

Uploaded by

a471k

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.

1051/e3sconf/202339904005
ICONNECT-2023

Deep learning-based image captioning for

visually impaired people
R. Kavitha1*, S. Shree Sandhya2, Praveena Betes2, P. Rajalakshmi2, and E. Sarubala2
1Professor, CSE Department, Parisutham Institute of Technology and Science, India
2UG students, CSE Department, Parisutham Institute of Technology and Science, India

Abstract. Vision loss can affect people of all ages. Severe or complete
vision loss may occur when the eye or brain parts that need to process
images are damaged. In this paper, in order to facilitate the blind, deep
learning algorithms are used to caption the image for the blind person in
which the blind can know about the object, distance and position of object.
Whenever an image is captured via the camera, the scenes are recognized
and predicted by the machine. Afterthe prediction, it will be sent as an audio
output to the user. Thus, with the help of this paper an artificial vision to
the blind, can be achieved and help them to gain confidence while
travelling alone.

1 Introduction
Vision impairment may be due to sickness, an accident or a medical condition. This paper
is aimed at providing assistance for the blind so that they can feel more confident, secured
and independent. The paper involves developing a system that can automatically generate
textual descriptions of images to allow blind individuals to better understand and interact
with visual content. This system can be integrated into various platforms such as mobile
applications, websites, and assistive devices, enabling visually impaired individuals to access
information in a way that was previously not possible. To analyze images and to generate
descriptive captions, machine vision, natural language generation and learning techniques
are together utilized.

Convolutional neural networks (CNNs) for extracting picture features and recurrent neural
networks (RNNs) for generating language are two deep learning techniques that can be
used to complete the task. The two neural networks that are used in image captioning are
CNN and RNN. CNN is a multi- layered neural network. It is designed to extract features at
each layer that are increasingly complex and to determine the output. CNNs are able to
automatically extract meaningful visual features from the input image, which can be used to
generate more accurate and descriptive captions. The RNN is a multi-layered neural
network that stores input in context nodes, enabling it to learn data sequences and produce
a sequence as an output. RNNs are commonly used in image captioning as a decoder
network to generate the corresponding caption from the visual features extracted by a CNN.

* Corresponding author : kavithha@gmail.com

© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons
Attribution License 4.0 (https://creativecommons.org/licenses/by/4.0/).
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

Image caption generator is the technique of recognizing and understanding the context of
an image and annotating it with relevant captions using deep learning. The goal of image
captioning is to produce a coherent andsemantically meaningful description of animage that
captures the main objects, actions, and attributes depicted in the visual content. There are
numerous uses for image captioning in industries like assistive technology, image search,
and social media. Datasets are used for generating the captions for images. The dataset used
in this paper is MS-COCO (Microsoft Common Objects in Context).

1.1 Literature Survey

Literature survey in our paper represents the ground study of what we have done for the
completion of the paper. It is a survey of previously published research papers on the topic
of image captioning and deep learning-based research papers. [1] presents a text encoder,
an image encoder, and a decoder to generate natural language descriptions of images. The
model works in two stages, using a reinforcement learning algorithm to refine the initial
caption generated in the first stage. [2] developed a summarization module that generates a
summary of the remote sensing image, which is then used to guide the captioning process
in the deep captioning module. The deep captioning module generates a natural language
description of the remote sensing image based on the summary and the image itself. [3] It is
an approach that involves training a neural network consisting of two main components: a
visual feature extractor and a language generator. The visual feature extractor is typically a
convolutional neural network (CNN) that is trained to extract features from the image. The
language generator is a recurrent neural network (RNN) that generates the image caption
word-by-word. [4] It is a method for generating more precise and detailed image captions
by leveraging online positive recall and missing concepts mining. The approach uses a two-
stage framework to generate captions that are more informative and accurate. [5] It is an
approach that uses two-phase learning framework to generate captions. In the first phase, a
visual saliency detector is trained to identify the salient regions of the image. In the second
phase, a standard image captioning model is trained using the saliency maps generated by
the saliency detector as additional input. [6] A cross-modal retrieval model is trained to
learn a shared representation space for images and captions from both the source and target
domains. A model adaptation technique is used to fine-tune the cross-modal retrieval
model. [7] First extracts visual features from the input image using a CNN, and then
encodes the visual features into a fixed-length vector using a recurrent neural network. [8]
The encoded visual features are used as input to a context-aware policy network, which
generates a sequence of words that describe the image. [9] It consists of two tasks: a source
domain captioning task and a target domain captioning task. The source domain task is
trained on a dataset of images from the same domain as the training data, while the target
domain task is trained on a smaller dataset of images from the target domain. [10] Uses
instance-level fine-grained feature representation and demonstrated its effectiveness
through extensive experiments. [11] It consists of two main components: a generator
network and a discriminator network. It combines generation- and retrieval-based methods
using a dual generator generative adversarial network. [12] To improve the quality of the
generated captions, a novel loss function is used that combines the attribute detection loss,
the attribute prediction loss, and the captioning loss.

2 Materials and Methods

In Fig 1, Once the image is captured, it is first divided into ‘n’ pieces and then computes an
image representation for each part. Using a compound coefficient, the EfficientNet-B3
isolates the image's features and uniformly scales the depth, width and resolution in all

2
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

three dimensions. Tokenization splits the input data into a sequence of meaningful parts. In
tokenization, the image is split into patches andtext is split into tokens.
After tokenization, the extracted features will be fed for training with the Recurrent
Neural Network (RNN) algorithm. A feedback connection is the vital characteristics of RNN.
An RNN feedback loop has the ability to transfer the effects between the earlier portion and
later portion of the sequences which is an essential capability of modeling the sequences.
The MS-COCO dataset is used for understanding the visual scenes and generating captions.

Fig 1. Architecture Diagram

2.1 Convolutional Neural Network (CNN)

The CNN is trained on a large dataset of images to learn a hierarchical representation of

visual features. Once the CNN has extracted the features from the image, they are passed to
a RNN, which is responsible for generating the captions. during training the input image is
fed through the pre-trained CNN, and the output features from one or more of the
intermediate layers are extracted. In Fig 2, the input layer, which receives the unprocessed
picture data as input, is the initial layer of a CNN. The convolutional layer is the following
layer, and it uses a series of filters to extract features from the input image. These filters are
developed through training and are capable of identifying edges, corners, and other
interesting aspects of the image. The fully connected layers employ the features that have
been learned from the convolutional layers to assign the image to one of several
categories or forecast a numerical value. CNNs are highly effective for image recognition
tasks because they are able to automatically learn features from the input data.

3
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

Fig 2. Convolutional Neural Network

2.2 Recurrent Neural Network (RNN)

RNN takes the features from the CNN as input and generates a sequence of words, one
word at a time. RNNs are designed to handle sequential data, making them well-suited for
generating sequences of words, such as captions. The RNN typically uses a type of LSTM
(Long Short-Term Memory) network, which is able to capture long-term dependencies in
the sequence of visual features. In Fig 3, At each time step, the LSTM takes the output from
the previous time step and combines it with the current input to generate a new output.
This output is then passed through a fully connected layer to generate a probability
distribution over the vocabulary of words. The final output of the RNN is a sequence of
wordsthat describe the content of the image.

Fig 3. Recurrent Neural Network

2.3 Training Dataset

In machine learning model, a set of data from the entire data will be taken as training dataset.
It consists of a collection of input data and their corresponding output values. MS COCO
datasets are used in this paper. Exploring the MS COCO dataset, a sizable image dataset
with 328,000 photos of common objects and people, is mostly done to comprehend the
visual situations. The dataset is made up of the output captions for the input photos. The
model may learn more precisely and generalise to new, untried data more effectively if the
training data set is larger. The training dataset for supervised learning contains pairings of
input and output data with the aim of teaching the machine learning model to translate the
input data to the appropriate output data. In unsupervised learning, the training dataset
simply contains input data, and the objective is to discover patterns or structure in the data.
In Fig 4, the images in the MS COCO dataset cover a wide range of scenes and objects,

4
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

including people, animals, vehicles, and indoor and outdoor environments. This diversity
makes the dataset well-suited for training machine learning models that can generate
accurate and diverse captions for variety of images.

Fig 4. MS-Coco Dataset

2.4 Tokenization

Text data modeling starts with tokenization, Tokenization is the process of dividing a
stream of textual data into tokens, which can be words, terms, sentences, symbols, or other
significant objects. Unstructured data and text written in natural language are tokenized
into informational units that can be regarded as separate elements. Tokens can either be
words, character or subwords. In Fig 5, RNN uses the words that came before it to
anticipate the subsequent words in a phrase. In order to achieve this, the tokenized word list
in the caption of the image is transformed. Strings are turned to integers using the
tokenization process. Create a dictionary that translates all distinct words to a numerical
index by first going through all of the training captions. It will therefore have an integer
value for each word it encounters.

Fig 5. Tokenization

2.5 Caption Generation

Caption generation in image captioning refers to the process of generating a natural

language description of an image. The goal is to train a machine learning model to
automatically generate captions that accurately and semantically describe the content of the
input image. Generating accurate and semantically meaningful captions for images is a

5
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

challenging task, and there are many techniques and approaches that have been developed
to improve the performance of image captioning models.

Fig 6. Caption Generation

These include using attention mechanisms to focus on specific regions of the image,
incorporating external knowledge sources, and using reinforcement learning to optimize the
caption generation process. In Fig 6, The captions for the image are generated and
produced as audio output that can help the blindpeople to identify the objects.

3 Results and Discussion

Once the image fed into the CNN, the image is first divided into ‘n’ pieces and then
computes an image representation for each part. The EfficientNet-B3 separates the
features of the image and scales the depth, width, and resolution in all three dimensions
consistently. Tokenization splits the input data into a sequence of meaningful parts. In
tokenization, the image is split into patches and text is split into tokens. After tokenization,
the extracted features will be fed for training with the Recurrent Neural Network (RNN)
algorithm. An RNN's main characteristic is its network of feedback links. This feedback
loop gives the RNN the ability to simulate how the earlier portions of the sequence affect
the later portions of the sequence, which is a crucial capability when modelling sequences.
The MS-COCO dataset, which contains 328,000 pictures of people and objects from
everyday life, is used to comprehend visual scenarios. The datasets will be collected from
MS-COCO dataset and the datasets will be trained using advanced image captioning
techniques implementing attention algorithm. Whenever an image is captured, the scenes
are recognized and predicted by the machine. After training the model with algorithm, a live
scene is captured via the camera. This captured scene will be recognized and the output
model file will be generated. The major objects are also predicted and the distance is
calculated from the camera. After the prediction, it is been sent as an audiooutput to the
user.

4 Conclusion
In this paper, convolutional and recurrent neural networks, among other deep learning
models, were investigated to provide captions for images. The use of pre-trained CNNs,
such as the EfficientNet-B3 model, for feature extraction helped in capturing meaningful
information from images, while the RNN generates sequential words to form coherent
captions. The deep learning algorithm is the finest technique which ensures accuracy in the

6
E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.1051/e3sconf/202339904005
ICONNECT-2023

achieved output. Overall, the study shows how deep learning methods may provide precise
and insightful descriptions for images, which has benefits in areas like image retrieval and
image indexing.

The datasets will be collected from MS-COCO dataset and the datasets will be trained using
advanced image captioning techniques implementing attention algorithm. Whenever an
image is captured via the camera, the scenes are recognized and predicted by the machine.
After training the model with algorithm, a live scene is captured via the camera. This
captured scene will be recognized and the output model file will be generated. The major
objects are also predicted and the distance is calculated from the camera. As a result of the
prediction an audio output has sent to the user.

References
1. Depeng Wang, Zhenzhen Hu, Yuanen Zhou, Richang Hong, IEEE Transactions on
Multimedia, 23, 3, pp. 779-789 (2022)
2. Sumbul G., Nayak S., & Demir B., IEEE Transactions on Geoscience and Remote
Sensing, 59, 8, pp. 6922-6934 (2021)
3. Yu N, Hu X, Song B, Yang J, Zhang J., IEEE Transactions on Image Processing, 28, 6,
pp. 2743-2754 (2019)
4. Zhang M, Yang Y, Zhang H, Ji Y, Shen H. T., Chua T., IEEE Transactions on Image
Processing, 28, 1, pp. 32-44 (2019)
5. Zhou L, Zhang Y, Jiang Y, Zhang T, Fan W., IEEE Transactions on Image Processing,
29, pp. 694-709 (2020)
6. Zhao, W., Wu, X., Luo, J., IEEE Transactions on Image Processing, 30, pp. 1180-1192
(2021)
7. Maofu Liu, Huijun Hu, Lingjun Li, Yan Yu and Weili Guan, IEEE Transactions on
Cybernetics, 52, 2, pp. 1247- 1257 (2022)
8. Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 44, 2, pp. 710- 722 (2022)
9. Min Yang, Wei Zhao, Wei Xu, Yabing Feng, Zhou Zhao, Xiaojun Chen, Kai Lei, IEEE
Transactions on Multimedia, 21, 4, pp. 1047-1061 (2019)
10. Qingbao Huang, Yu Liang, Jielong Wei, Yi Cai, Hanyu Liang, Ho-fung Leung, Qing Li,
IEEE Transactions on Multimedia, 24, pp. 2004-2017 (2022)
11. Min Yang, Junhao Liu, Ying Shen, Zhou Zhao, Xiaojun Chen, Qingyao Wu,
ChengmingLi, IEEE Transactions on Image Processing, 29, pp. 9627-9640 (2020).
12. Yiqing Huang, Jiansheng Chen, Wanli Ouyang, Weitao Wan, Youze Xu, IEEE
Transactions on Image Processing, 29, pp. 4013-4026 (2020)

Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Project Review
No ratings yet
Project Review
12 pages
Deep Learning Image Captions
No ratings yet
Deep Learning Image Captions
9 pages
Deep Learning for Image Captioning
No ratings yet
Deep Learning for Image Captioning
2 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
No ratings yet
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
6 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
AI Image Captioning for CSE Students
No ratings yet
AI Image Captioning for CSE Students
17 pages
Deep Recurrent Architecture Based Scene Description Generator For Visually Impaired
No ratings yet
Deep Recurrent Architecture Based Scene Description Generator For Visually Impaired
6 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages
Ref 12
No ratings yet
Ref 12
7 pages
Image Caption Generation Using Deep Neural Networks
No ratings yet
Image Caption Generation Using Deep Neural Networks
3 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
Review 3
No ratings yet
Review 3
18 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Review 3
No ratings yet
Review 3
18 pages
Generating Caption From Images Using Flickr Image Dataset
No ratings yet
Generating Caption From Images Using Flickr Image Dataset
7 pages
TSP CMC 53245
No ratings yet
TSP CMC 53245
18 pages
Deep Learning Image Captioning
No ratings yet
Deep Learning Image Captioning
7 pages
Image Caption
No ratings yet
Image Caption
16 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
8 pages
Image Caption Generator by Using CNN and LSTM: International Journal For Multidisciplinary Research
No ratings yet
Image Caption Generator by Using CNN and LSTM: International Journal For Multidisciplinary Research
6 pages
AIML - Final Report - Version1
No ratings yet
AIML - Final Report - Version1
24 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Image Captioning For Assisting The Visually Impaired
No ratings yet
Image Captioning For Assisting The Visually Impaired
10 pages
Deep Learning for Image Captioning
No ratings yet
Deep Learning for Image Captioning
6 pages
Rich Image Captioning in The Wild
No ratings yet
Rich Image Captioning in The Wild
8 pages
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
No ratings yet
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
3 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
BCA Image Captioning Project
No ratings yet
BCA Image Captioning Project
15 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
RP Springer
No ratings yet
RP Springer
10 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
He 2017
No ratings yet
He 2017
8 pages
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
Deep Learning for Image Captioning
No ratings yet
Deep Learning for Image Captioning
8 pages
Minor
No ratings yet
Minor
14 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
Two Tier LSTM Model
No ratings yet
Two Tier LSTM Model
13 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
8 pages
Research Paper Final
No ratings yet
Research Paper Final
5 pages
Imagecaptionusing CNNand LSTM
No ratings yet
Imagecaptionusing CNNand LSTM
11 pages
Image Caption Generator
No ratings yet
Image Caption Generator
6 pages
Image Caption Generation Methodologies
No ratings yet
Image Caption Generation Methodologies
7 pages
Image To Caption Generator
No ratings yet
Image To Caption Generator
7 pages
Image Caption Generation
No ratings yet
Image Caption Generation
8 pages
Image Caption Generation With Adaptive Transformer
No ratings yet
Image Caption Generation With Adaptive Transformer
6 pages
Image Caption Generator Using AI: Review - 1
No ratings yet
Image Caption Generator Using AI: Review - 1
9 pages
New PDF
No ratings yet
New PDF
48 pages
Papers
No ratings yet
Papers
9 pages
Image Captioning with Deep Learning
No ratings yet
Image Captioning with Deep Learning
5 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
Leveraging Large Language Models for Autonomous Robotic Mapping and NavigationInternational Journal of Advanced Robotic Systems
No ratings yet
Leveraging Large Language Models for Autonomous Robotic Mapping and NavigationInternational Journal of Advanced Robotic Systems
18 pages
AI-based Approaches for Improving Autonomous Mobile Robot Localization in Indoor Environments a Comprehensive Review
No ratings yet
AI-based Approaches for Improving Autonomous Mobile Robot Localization in Indoor Environments a Comprehensive Review
25 pages
A Review of AI-Enhanced Navigation Strategies for Mobile Robots in Dynamic Environments
No ratings yet
A Review of AI-Enhanced Navigation Strategies for Mobile Robots in Dynamic Environments
6 pages
A Novel Collision-free Navigation Method for Autonomous Chaotic Mobile Robots
No ratings yet
A Novel Collision-free Navigation Method for Autonomous Chaotic Mobile Robots
10 pages
Computer Vision and Voice Assisted Image Captioning Framework For Visually Impaired Individuals Using Deep Learning Approach
No ratings yet
Computer Vision and Voice Assisted Image Captioning Framework For Visually Impaired Individuals Using Deep Learning Approach
7 pages
Image Captioning For The Visually Impaired
No ratings yet
Image Captioning For The Visually Impaired
5 pages
Describing Image Focused in Cognitive and Visual Details For Visually
No ratings yet
Describing Image Focused in Cognitive and Visual Details For Visually
9 pages
Boundary Detector Encoder and Decoder With Soft Attention For Video Captioning
No ratings yet
Boundary Detector Encoder and Decoder With Soft Attention For Video Captioning
11 pages
Fastreid: A Pytorch Toolbox For General Instance Re-Identification
No ratings yet
Fastreid: A Pytorch Toolbox For General Instance Re-Identification
10 pages
Regression Analysis and Tests Results
No ratings yet
Regression Analysis and Tests Results
2 pages
DSP Module 3 Notes
No ratings yet
DSP Module 3 Notes
14 pages
21csc305p Machine Learning Unit 3 - Updated
No ratings yet
21csc305p Machine Learning Unit 3 - Updated
147 pages
Age and Number Problem Involving Quadratic Equation
No ratings yet
Age and Number Problem Involving Quadratic Equation
33 pages
Local Volatility Derivation Guide
No ratings yet
Local Volatility Derivation Guide
15 pages
Mnnnii
No ratings yet
Mnnnii
32 pages
Parallel Binary Adder Guide
No ratings yet
Parallel Binary Adder Guide
2 pages
CS-311 Design and Analysis of Algorithms
No ratings yet
CS-311 Design and Analysis of Algorithms
50 pages
QT - Unit 3 - Linear Programming
No ratings yet
QT - Unit 3 - Linear Programming
88 pages
Or Ty BMS
No ratings yet
Or Ty BMS
3 pages
Mixed Effects Models For The Population Approach Models, Tasks, Methods and Tools, 1st Edition ISBN 1032477350, 9781032477350 Latest Edition Download
No ratings yet
Mixed Effects Models For The Population Approach Models, Tasks, Methods and Tools, 1st Edition ISBN 1032477350, 9781032477350 Latest Edition Download
15 pages
ERT 321 Process Control & Dynamics: Feedback Controllers
No ratings yet
ERT 321 Process Control & Dynamics: Feedback Controllers
34 pages
Machine Learning and Pattern Recognition - Books
No ratings yet
Machine Learning and Pattern Recognition - Books
1 page
Excel Optimization Report 2010
No ratings yet
Excel Optimization Report 2010
12 pages
Line and Circle Detection Code
No ratings yet
Line and Circle Detection Code
5 pages
Feedback Controller: Proportional, Integral, Derivative (PID)
100% (1)
Feedback Controller: Proportional, Integral, Derivative (PID)
34 pages
Perceptrón Multicapa
No ratings yet
Perceptrón Multicapa
6 pages
هه
No ratings yet
هه
6 pages
List of RungeKutta Methods - Wikipedia
No ratings yet
List of RungeKutta Methods - Wikipedia
16 pages
Load Distributing Algorithm
No ratings yet
Load Distributing Algorithm
5 pages
ICT1511 Online Assessment
No ratings yet
ICT1511 Online Assessment
18 pages
Lesson 1.1 - Simple Equations
No ratings yet
Lesson 1.1 - Simple Equations
21 pages
IB DP Mathematics Analysis and Approaches SL FE2021
No ratings yet
IB DP Mathematics Analysis and Approaches SL FE2021
1 page
Bcsl404 Lab Manual
No ratings yet
Bcsl404 Lab Manual
20 pages
T Acku HC ILV4 Q8 Ms 30 U FPMCRW QF 2 WLQ HW 9 T 6 Z O2 Hs
No ratings yet
T Acku HC ILV4 Q8 Ms 30 U FPMCRW QF 2 WLQ HW 9 T 6 Z O2 Hs
2 pages
Beam Forming
No ratings yet
Beam Forming
32 pages
Linear Programming Exercises Guide
No ratings yet
Linear Programming Exercises Guide
4 pages
Game 121 2 PDF
No ratings yet
Game 121 2 PDF
7 pages
IS4242 W6 Model Evaluation and Selection
No ratings yet
IS4242 W6 Model Evaluation and Selection
86 pages
I239-5 Naive Bayes
No ratings yet
I239-5 Naive Bayes
35 pages

Deep Learning-Based Image Captioning For Visually

Uploaded by

Deep Learning-Based Image Captioning For Visually

Uploaded by

E3S Web of Conferences 399, 04005 (2023) https://doi.org/10.

Deep learning-based image captioning for

* Corresponding author : kavithha@gmail.com

1.1 Literature Survey

2 Materials and Methods

Fig 1. Architecture Diagram

2.1 Convolutional Neural Network (CNN)

The CNN is trained on a large dataset of images to learn a hierarchical representation of

Fig 2. Convolutional Neural Network

2.2 Recurrent Neural Network (RNN)

Fig 3. Recurrent Neural Network

2.3 Training Dataset

Fig 4. MS-Coco Dataset

2.5 Caption Generation

Caption generation in image captioning refers to the process of generating a natural

Fig 6. Caption Generation

3 Results and Discussion

You might also like