0% found this document useful (0 votes)
45 views7 pages

Computer Vision and Voice Assisted Image Captioning Framework For Visually Impaired Individuals Using Deep Learning Approach

The document presents a novel framework for assisting visually impaired individuals through a computer vision and voice-assisted image captioning system utilizing deep learning techniques. It combines the VGGNet-16 model for image processing with long short-term memory networks (LSTMs) for generating descriptive captions, which are then converted to audible speech. The proposed system aims to enhance the independence and quality of life for visually impaired users by providing real-time visual information and improving navigation in various environments.

Uploaded by

a471k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views7 pages

Computer Vision and Voice Assisted Image Captioning Framework For Visually Impaired Individuals Using Deep Learning Approach

The document presents a novel framework for assisting visually impaired individuals through a computer vision and voice-assisted image captioning system utilizing deep learning techniques. It combines the VGGNet-16 model for image processing with long short-term memory networks (LSTMs) for generating descriptive captions, which are then converted to audible speech. The proposed system aims to enhance the independence and quality of life for visually impaired users by providing real-time visual information and improving navigation in various environments.

Uploaded by

a471k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2023 4th IEEE Global Conference for Advancement in Technology (GCAT)

Bangalore, India. Oct 6-8, 2023

Computer Vision and Voice Assisted Image


Captioning Framework for Visually Impaired
Individuals using Deep Learning Approach
2023 4th IEEE Global Conference for Advancement in Technology (GCAT) | 979-8-3503-0525-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/GCAT59970.2023.10353449

Safiya K M Dr. R Pandian


Research Scholar, Department of Computer Science and Associate Professor, Department of Electronics and Communication
Engineering, Sathyabama Institute of Science and Technology Engg. Sathyabama Institute of Science and Technology (Deemed to
(Deemed to be University) be University)
Chennai, India Chennai, India
mailtosafiya786@gmail.com 1 pandian.eni@sathyabama.ac.in

Abstract-. Blind or visually challenged have considerable their surroundings, including textual materials, images, and
obstacles when obtaining visual material, restricting their photos. The use of voice-assisted technology for picture
capacity to derive meaningful information from pictures. This captioning and navigation provides people with a means to get
research proposing a novel framework combining computer information about the visual elements of pictures and
vision with voice-based image captioning using deep learning
techniques. The proposed system employs VGGNet-16 model for
situations, hence enabling a deeper understanding of their
photo processing and long short-term memory networks environment [3]. By receiving accurate and detailed auditory
(LSTMs) for natural language processing. The models used in descriptions of images and scenes, individuals can make
this research were trained on Flickr8k, Flickr30k, and a bespoke informed decisions, navigate unfamiliar surroundings, and
dataset. Subsequently, these models were deployed on a engage with visual information independently, without
Raspberry Pi 4B single-board computer that was equipped with assistance from sighted individuals.
a graphics processing unit. A two-fold approach is used, whereby
the first phase entails exposing the input image to preprocessing The integration of computer vision (CV), speech
using a pre-established VGGNet-16 model. This methodology recognition, artificial intelligence (AI), and deep learning
enables the retrieval of relevant visual attributes, therefore (DL) and Internet of things (IoT) techniques has resulted in
capturing the intrinsic semantic content of the picture. The the emergence of a novel framework known as the computer
aforementioned attributes are fed into a language model based vision-aided voice-based picture captioning framework [4,5].
on LSTM to generate explanatory captions. To enhance
communication with visually impaired people, the generated
The suggested framework had the potential to revolutionize
captions are converted into audible speech using text-to-speech significantly the lives of those who experience blindness or
synthesis technology. The effectiveness of the proposed visual impairment since it offers them unequalled access to
architecture is evaluated via extensive experiments using visual information. A graphics processing unit (GPU) derived
benchmark datasets and real-time images obtained with a NoIR from the Raspberry Pi 4B is utilized to develop and implement
camera unit. The generated captions are evaluated using the model. The Raspberry Pi 4B single board computers
quantitative assessment criteria, namely BLEU and ROUGE (SBC) have been purposefully designed to efficiently carry out
scores. The developed VGGNet16 model has exceptional several tasks simultaneously, enabling real-time processing of
performance in terms of accuracy (95.620%), precision images, analysis of images, and natural language processing.
(96.928%), recall (87.879%), and F1 score (92.182%). The results
suggest that using computer vision in conjunction with voice-
This research investigates the capabilities of a No Infrared
based image captioning offers a promising solution for (NoIR) camera to capture images in low light conditions,
addressing the challenges faced by those with visual impairments hence enhancing nighttime vision technology. This capability
in perceiving and comprehending visual content. can potentially improve the safety and mobility of those who
are blind or visually impaired by facilitating their navigation
Keywords- computer vision, image captioning, LSTM, self- in low-light conditions.
attention, transformer encoder, visually impaired, VGGNet-16.
The new approach of incorporating the long short-term
I. INTRODUCTION memory (LSTM) and visual geometric group (VGGNet-16)
The worldwide landscape of supporting the visually models into voice-assisted picture captioning for those with
impaired has changed due to the confluence of technological visual impairments leverages the strengths of these deep
advancement and societal shifts, creating an environment that learning architectures to provide comprehensive image
is more welcoming and accommodating to those with visual descriptions [4]. The VGG16 model has been pre-trained
impairments. The prevalence of wearable devices such as using a substantial ImageNet dataset, demonstrating its
smart glasses has increased, including enhanced sensors, proficiency in accurately detecting and categorizing various
cameras, and artificial intelligence-driven algorithms [1,2]. visual patterns and objects seen in images. The LSTM is a
Individuals who are blind or have visual impairments distinct kind of recurrent neural network (RNN) architecture
sometimes have difficulties accessing visual information in designed to handle sequential input efficiently. The attribute

979-8-3503-0525-8/23/$31.00 ©2023 IEEE 1


Authorized licensed use limited to: Swinburne University of Technology. Downloaded on December 21,2023 at 17:50:18 UTC from IEEE Xplore. Restrictions apply.
above renders LSTM especially well-suited for generating model. The initial LSTM network is accompanied by an
textual descriptions. The voice-assisted image captioning attention model, capable of dynamically balancing the
system uses an LSTM model to process the extracted image importance of visual semantic region and textual content. The
features derived from the VGG16 model. The LSTM model is second LSTM serves as a language model, integrating the
responsible for creating descriptive captions. The textual hidden state representation of the first LSTM and the attention
descriptions are then converted into audible speech using a context vector and subsequently generating the sequence of
text-to-speech (TTS) application programming interface words.
(API). The integrated LSTM-VGG16 model can operate in
real-time, expeditiously allowing users to get visual Recently, the use of the encoder-decoder architecture,
descriptions through voice assistance. This feature enhances coupled with an attention mechanism, has shown significant
the individual's ability to investigate and interact with their advancements in picture captioning. Tan et al. [10] provide a
surrounding environment independently. novel approach for image captioning that utilizes a
hierarchical phi-LSTM architecture to create descriptive
Voice-assisted photo captioning is a mechanism for captions for images. The system is designed to convert picture
establishing a connection between those who do not possess captions from phrase to sentence format using a phrase
conventional visual perception and the visual realm. By decoder that decodes noun phrases of varying lengths and an
fulfilling these essential requirements, this technology plays a abbreviated sentence decoder that converts the image
vital role in enhancing the autonomy of individuals with visual description into an abbreviated version. In recent years, the
impairments, elevating their quality of life, and enabling their existing models that rely on LSTM for generating descriptions
full participation in a visually-oriented society. In the use a sequential approach, which hinders parallel processing
following sections, we will undertake a more thorough and fails to consider the hierarchical structure of the captions
analysis of the components of this methodology and the adequately. To address this issue, Bai et al. [11] utilizes a
cutting-edge technology used in picture captioning. Chapter 3 CNN-based image caption generation model using conditional
delves into the examination of the technique and training generative adversarial training (CGAN) as a supportive
procedure while also engaging in a scholarly discussion on the technique. In addition, using a multi-modal graph convolution
challenges and potential opportunities within the field of photo network (MGCN) enables the exploration of visual
captioning using VGG16 and LSTM networks. Chapter 4 of associations among objects that can effectively capture and
this research comprehensively examines the suggested represent visual associations among items within an image.
framework via experimental analysis. Following this analysis,
the research concludes with a discussion of its findings and The present picture captioning models need a better
implications and potential avenues for future research in this accuracy rate in accurately describing the image's content.
field. This inaccuracy can be attributed to erroneous descriptions or
a lack of consideration for scene identification. Peng et al. [12]
II. RELATED WORK offer a model that aims to identify the scene information
The task of image captioning entails the generation of matching the text volume of the LDA analysis corpus. The use
written descriptions that effectively represent the content and of a ResNet model facilitates the extraction of global picture
context of an image, thereby facilitating the comprehension features, as well as the extraction of deep scene features.
and communication of visual information by computers. A. K. During the generation process, a double Long Short-Term
Poddar et al. [6] present a neural network model consisting of Memory (LSTM) model is used to optimize the parameters
many layers of convolutional neural networks (CNN) and and enhance the precision of statement production. Lim et al.
LSTM units. Multiple models were trained by altering [13] introduce the Mask Captioning Network (MaC) with an
hyperparameters and the number of hidden layers to identify object layer and a backdrop layer. The authors use the Mask
the optimal model and maximize the probability of the RCNN framework to identify prominent areas at the pixel.
resulting Hindi captions from images. It is essential to Furthermore, the MaC framework has been included in image
consider the requirements of visually impaired persons across captioning systems using both LSTM-based and Transformer-
different languages within a global framework. In their study, based models. The use of transformers has dramatically
Ganesan J al. [7] expanded upon the VGG16-LSTM improved the efficacy of image description models [14,15].
framework to provide multilingual picture captioning, However, the attention mechanism in transformers needs to be
providing visually impaired individuals with the ability to improved in its ability to capture intricate correlations between
access material in their desired language. The study conducted key and query vectors. Ji, W et al. [16] propose a novel
by R. saleem et al. [8] emphasized the significance of using double-attention architecture that enhances the encoder-
the VGG16-LSTM framework in producing informative decoder structure for addressing picture captioning challenges.
captions for pictures, which can audibly have conveyed to The authors propose a method that enhances the performance
those with visual impairments via text-to-speech technology. of picture description by improving Self-Attention (SA) from
This technique has shown the potential of picture captioning two perspectives including a Masked Self-Attention and a
to foster a more inclusive digital environment. Xiao et al. [9] hybrid weight distribution (HWD) module.
presented a novel captioning technique that combines two
distinct LSTM networks using an adaptive semantic attention

2
Authorized licensed use limited to: Swinburne University of Technology. Downloaded on December 21,2023 at 17:50:18 UTC from IEEE Xplore. Restrictions apply.
One of the challenges encountered in picture captioning is unique words extracted from the captions in the training
the possible bias that can exist within datasets. This bias can dataset. The dataset used in this research comprises a total of
adversely affect the quality and pertinence of the generated 5,000 tokens. During the training procedure, the word
captions [17]. Integrating user input is crucial in developing embedding experiences simultaneous updates with the other
image captioning systems that are efficient for those with parameters of the model. The image captioning model,
visual impairments. In summary, the VGG16 and LSTM VGG16-LSTM, combines preprocessed image features
networks for picture captioning has shown significant promise retrieved from the VGG16 model with embedded words
in improving accessibility for those with visual impairments. produced from the word embedding layer.
The literature review highlights the significant progress in
developing inclusive technologies that use artificial Feature vectors are often associated with the internal
intelligence to provide comprehensive and precise descriptions representations produced by the LSTM model at each time
of visual material. step when processing a sequence. Long short-term memory
networks are a kind of recurrent neural networks (RNNs) that
III. METHODOLOGY have been particularly designed to handle sequential data
The deployment of the VGG16 model on a GPU platform, properly. They can capture long-term relationships in the data
specifically designed for image classification, requires the and address the problem of disappearing gradients. The LSTM
implementation of specific preprocessing processes to prepare network is presented with an input vector at every time step
the input images for accurate feature extraction. The NoIR within the input sequence. The input vector might be an
camera is used to capture video frames, which are then encoded word or a feature vector derived from the previous
converted into images before undergoing preprocessing. The layer. The input vector has the potential to be encoded by an
VGG16 model requires input images to conform to a embedded word derived from the previous time step. The
predefined size. The photographs are downsized to dimensions memory cells undergo updates at each time step, which are
of 224x224x3 pixels to conform to the model's input decided by the current input, the previous memory state, and
parameters. the candidate values provided by the gates of the LSTM
The photographs that had been resized underwent a model. The feature vector is derived from the hidden state at
normalization process to achieve a consistent scale. each time step. The vector represents the essential information
Nevertheless, the VGG16 model necessitates input pictures to obtained from the input sequence. Using the feature vector at
possess three channels. As a result, grayscale pictures are the last time step facilitates the production of the ultimate
replicated throughout all three color channels to fulfil this caption in the context of photo captioning.
need. The implementation employs a batching strategy, IV. EXPERIMENTAL ANALYSIS
whereby a collection of preprocessed photographs is grouped
into a batch to enhance the processing efficiency. Once the A. Dataset
preprocessing operations have been carried out, the images are Commonly used datasets for training and evaluating photo
prepared for input into the VGG16 model to facilitate the captioning systems include publicly available collections such
feature extraction process. as Flickr8k [18], Flickr30k [19], and bespoke datasets. The
Figure 1 illustrates the use of word embedding, a method dataset known as Flickr30k comprises a grand total of 31,783
employed in LSTM networks, particularly within the domain photographs, whereas Flickr8k, a subset of this dataset,
of natural language processing tasks, such as the generation of consists of 8,000 images that have been obtained from the
descriptive captions for images. The process of word widely used photo-sharing network known as Flickr. Five
embedding entails the transformation of individual words or captions given by different annotators have accompanied each
tokens into vector representations characterized by continuous
values. Generating a reference involves the compilation of

Fig. 1. A VGG16-LSTM image captioning framework developed for blind and visually impaired people.

3
Authorized licensed use limited to: Swinburne University of Technology. Downloaded on December 21,2023 at 17:50:18 UTC from IEEE Xplore. Restrictions apply.
picture. When using these datasets, there are disparities in their sequence. The model was constructed using TensorFlow's
dimensions, formats, and degrees of annotation accuracy. The optimizer and loss function and then trained using a
selection of a combined dataset is conducted with meticulous preprocessed dataset of pictures and captions. During the
consideration, considering our research aims and the evaluation phase, the model that has undergone training
complexity of our photo captioning infrastructure. Creating a generates descriptive descriptions for photographs. The
customized dataset allows researchers to collect data directly procedure above involves the use of the trained RNN to predict
relevant to their unique research question, leading to improved individual words sequentially. The phrases that have been
performance of the model. The correctness, consistency, and created are then reintroduced into the model throughout each
high quality of annotations are given priority due to their iteration. Afterwards, adjusting the hyperparameters and model
crucial role in efficiently training reliable AI models. A architecture becomes essential to maximize performance. This
customized dataset has been created by considering several entails modifying the learning rates and dropout rates.
factors like, the techniques used for data collection, data
annotation procedures, and data augmentation strategies. The C. Model training and validation
training set is the largest component of the dataset, including In the training and evaluation phase of the proposed
70% of the photographs used for model training. A validation VGG16-LSTM image captioning system, metrics such as
set was created by using 15% of the dataset. A portion of 15% accuracy, sensitivity, specificity, and F1 score are used to
of the dataset is allocated for the test dataset, which is separate assess the quality of the generated captions compared to the
from the training and validation sets and is not used at any ground truth captions. Within the scope of this research,
model-building step. Table 1 presents a comprehensive accuracy is defined as the proportion of accurately generated
summary of the dataset used in this research. words in the whole caption sequence about the matching
ground truth caption. The metric of sensitivity or recall
TABLE I. DETAILS OF VARIOUS IMAGE DATASET USED FOR assesses the model's capacity to capture the pertinent
IMAGE CAPTIONING. information in the ground truth captions accurately. It measures
the model's ability to generate the proper words when included
SI No. Dataset name Total No. of Training Images in the reference caption. The concept of specificity can be used
1 Flickr_8k [18] 8000 to evaluate the model's capacity to avoid creating terms absent
2 Flickr_30k [19] 31783
in the reference caption. The calculation involves determining
3 Custom 1250
the proportion of actual negative words about the overall count
Total 41033 of irrelevant words that were erroneously generated. Another
metric often used in evaluating classification models is the F1
From table I, we have developed the proposed model with score, the harmonic mean of accuracy and recall. The F1 score
41k number of image with corresponding captions. Merging takes into account the model's ability to accurately identify
datasets enhances the generalization of diverse data pertinent terms as well as its ability to include all essential
distributions and multiple datasets let the model learn from words found in the reference caption.
different instances and generalize to new data. A broad training
set reduces bias and overfitting, making the model more
resilient and adaptable. TABLE II. VALIDATION SCORE OF VGGNET-16 LSTM MODEL.
B. Software and Hardware SI Model Accuracy Precision Recall F1 score (%)
The topic of interest in this section is to pertains to the No. (%) (%) (%)
domains of software and hardware. The model above has been 1 VGG16 95.620 96.928 87.879 92.182
developed and deployed using the Raspberry Pi 4 B single-
board computer. The Raspberry Pi 4 Model B is a single-board
computer with a robust graphics processing unit (GPU)
powered by a quad-core ARM Cortex-A72 central processing Accuracy= (TrP+ TrN) / (TrP+ TrN+ FaP+FaN) (1)
unit (CPU). Increasing memory capacity to 8 GB enables more
Precision= TrP / (TrP+FaP) (2)
excellent multitasking capabilities and enhanced performance,
especially for programs that need significant memory Recall = TrP / (TrP+FaN) (3)
resources. The board is furnished with a VideoCore VI
Graphics Processing Unit (GPU) that supports OpenGL ES 3.x F1 score= 2[(Precision*Recall) / (Precision + Recall)] (4)
and facilitates the decoding of 4K videos.
Keras and TensorFlow is the major libraries used for The evaluation of models in table II provides valuable
constructing and training deep learning models specifically information about their precision and efficacy. A full
tailored for generating descriptive captions for images. evaluation of the performance of a model or test in various
TensorFlow serves as the fundamental framework, while Keras circumstances is achieved by comprehending the concepts of
operates as an application programming interface (API) that true positives (TrP), true negatives (TrN), false positives (FaP),
provides a more abstracted interface for constructing and and false negatives (FaN) as discussed in equation 1 to 4.
training neural networks. The loss function is calculated during
the training process, which is a combination of the cross-
entropy loss used to forecast each word inside the caption

4
Authorized licensed use limited to: Swinburne University of Technology. Downloaded on December 21,2023 at 17:50:18 UTC from IEEE Xplore. Restrictions apply.
learning rate schedules, dropout, and early stopping have
optimized the model's performance.
However, training accuracy alone is not a sufficient metric
to judge the model's performance, as it may overfit the training
data if the number of epochs is too high. Validation accuracy
measures how well the model generalizes to unseen data.
Initially, as the model trains, both the training and validation
accuracies tend to increase. The validation accuracy typically
tracks the training accuracy closely, as the model learns
valuable features from the training data. However, the
validation accuracy may plateau or even decrease after a
certain point. This is a sign that the model is starting to overfit
Fig. 2. Training accuracy and validation accuracy of VGGNet-16 LSTM
the training data, meaning it's becoming too specialized in
models. capturing noise or irrelevant details from the training set that
don't generalize well to new data. This research shows that
In Figure 2, the relationship between the number of epochs after the 30th epoch, the trading accuracy is almost constant
and training or validation accuracy of developed model is but still increases. After the 50th epoch, there is a slight
plotted. Based on the validation plots, the VGG16 model variation in training and validation accuracy, indicating the
exhibits the best accuracy in both the training and validation model has mature and we stop the training process at the 80th
phases. The achieved training and validation accuracies are epoch.
influenced by many aspects, such as the quality and amount of D. Performance evaluation of the model
the data, the particular implementation approach, and the
chosen training hyperparameters (e.g., learning rate, optimizer, In practical applications, assessing image captioning
etc.). In this research, 80 epochs were used for the training and models extends beyond the direct use of accuracy, sensitivity,
validation procedures. The optimizer used in this research was specificity, and F1 score. Using specialized metrics like BLEU,
adaptive moment estimation (Adam), with a learning rate 0.01. METEOR, ROUGE, CIDEr, and SPICE is standard practice
In a recent research conducted by Abubeker K M et al. [20] a for evaluating a model's performance in picture captioning. The
new deep learning framework for medical picture metrics consider the presence of specific words and the degree
classification, named B2Net, was addressed. The researchers of overlap between n-grams, the semantic content, and the
claimed that using data augmentation and regularization general coherence of the generated captions about the reference
approaches resulted in enhanced accuracy. Techniques such as captions.

TABLE III. TEST RESULTS OF THE DEVELOPED VGG16-LSTM FRAMEWORK DEPLOYED IN RASPBERRY PI 4B SINGLE BOARD COMPUTER.
Input Image Original Caption Caption generated by VGG16-LSTM
Model
1. An agriculture land with coconut trees and vegetables. An agricultural land with coconut trees and
2. A green land showing coconut trees and plants. some plants.
3. An agriculture land with coconut trees and some plants.
4. A land with coconut trees, a home and plants.
5. An agriculture land with coconut trees , a home and plants.

1. A lion is sleeping in a green filed in front of a fountain. A lion is sleeping in a green filed in front
2. A lion sleeping in front of a fountain with green background of a fountain.
3. A lion in front of a fountain with green background.
4. A lion is sitting in a green lone in front of a rock.
5. A lion sleeping in a green lone in front of a rock.

1. A boy is playing in a dirty filed with hoe. A boy with blue shirt and red nikar with
2. A boy with blue shirt and red half pant with hoe. hoe on a filed.
3. A boy with blue shirt and red nikar with hoe on a filed.
4. A boy with blue shirt is playing in a dirty land with hoe.
5. A boy with blue shirt and red half pant with a tool.

1. A crowd of people working in sea side. A crowd of people with fish net on sea
2. A large number of people looking in fish net. side.
3. A crowd of people with fish net on sea side.
4. A crowed with fish net on sea side.
5. A group of fisher man in a sea side with fish net.

5
Authorized licensed use limited to: Swinburne University of Technology. Downloaded on December 21,2023 at 17:50:18 UTC from IEEE Xplore. Restrictions apply.
TABLE IV. BLEU-ROUGUE PERFORMANCE MATRIX OF THE DEVELOPED VGGNET-16 MODEL.
SI No. Reference Image BLEU Brevity Length Translation Reference Rouge1 Rouge2 RougeL Rouge
score Penalty Ratio Length Length Lsum
1 0.553 0.978 0.978 45 46 0.717 0.451 0.717 0.716

2 0.688 1.000 1.10 65 59 0.799 0.642 0.762 0.762

3 0.656 1.000 1.04 65 62 0.750 0.588 0.707 0.707

4 0.373 0.860 0.869 40 46 0.565 0.310 0.561 0.558

In table III, the real-time image captioning using VGG16- interaction, has shown considerable potential in mitigating the
LSTM network is presented. The research outcomes observed discrepancy between visual information and auditory
that the model is showing highest accuracy and efficacy perception experienced by those with visual impairments. The
throughout the testing process. Due to the nature of producing system has shown its capacity to generate exciting captions for
word sequences in picture captioning, conventional images by using VGGNet-16 for image processing and LSTM
classification metrics may need to be more readily relevant. for natural language processing. This amalgamation allows the
The Bilingual Evaluation Understudy (BLEU) metric is used to framework to provide crucial insights into the visual content.
assess the degree of similarity between generated captions and The model in this research is trained using three datasets:
reference captions. The recall-oriented understudy for gisting Flickr8k, Flickr30k, and a custom dataset. Following this, the
evaluation (ROUGE) metric quantifies the degree of similarity Raspberry Pi 4B single-board computer was used, which was
in terms of n-gram overlap between the generated captions and outfitted with a graphics processing unit. The integration of a
the reference captions. The assessment of the generated voice-based framework is consistent with the concept of
captions included ROUGE scores, namely ROUGE-N for providing information in a readily understandable and user-
n-gram overlap and ROUGE-L for longest common centric way. This facilitates individuals' independent access to
subsequence. Additionally, human review was conducted to visual content that was previously unavailable to them. The
evaluate generated captions' quality, relevance, and coherence future trajectory of this research is centered on integrating
might provide significant insights. other sensory modalities beyond visual and aural inputs,
aiming to augment the overall user experience for those who
The brevity penalty has significance in metrics like BLEU, are blind or visually impaired. Improving the framework's
as it seeks to assess the effectiveness of machine-generated interoperability across various devices and platforms, such as
captions by considering their alignment with human perception smartphones, wearable devices, and smart glasses, would
regarding relevance, correctness, and completeness. The augment its accessibility and facilitate its seamless
concept of the brevity penalty is used in the automated incorporation into users' daily activities. Expanding the
evaluation of BLEU to measure the efficacy of machine- linguistic functionalities of the framework to include more
generated translations or captions. The encouragement is for languages would augment its inclusivity, hence fostering a
the generated captions to possess a more proximate length than more diverse user base.
the reference captions. According to the data shown in Table
IV, The reference length for each generated caption is REFERENCES
determined by BLEU, which is defined as the length of the [1] Liu, W., Yu, W., Li, K., Zhou, S., Wang, Q., & Yu, H. (2023).
closest reference caption in terms of the number of words. The Enhancing blind-dumb assistance through a self-powered tactile sensor-
phrase generated caption length or translation length refers to based Braille typing system. Nano Energy, 116, 108795.
the numerical measurement of the length of a generated https://doi.org/10.1016/j.nanoen.2023.108795
caption, quantified explicitly by the number of words it [2] Andrés A. Díaz-Toro, Sixto E. Campaña-Bastidas, Eduardo F. Caicedo-
contains. Bravo, "Vision-Based System for Assisting Blind People to Wander
Unknown Environments in a Safe Way", Journal of Sensors, vol. 2021,
Article ID 6685686, 18 pages, 2021.
V. CONCLUSION https://doi.org/10.1155/2021/6685686
[3] Kamal, I., Salah Abd-elhafeez, M., & Farghal, A. (2023). Camera-Based
In conclusion, this work signifies notable advancements in Navigation System for Blind and Visually Impaired People. Sohag
the endeavour to create a digital environment that is more Engineering Journal, 3(1), 1-13. doi: 10.21608/sej.2022.155927.1018
inclusive and user-friendly for those with visual impairments. [4] Rahman, M. W., Tashfia, S. S., Islam, R., Hasan, M. M., Sultan, S. I.,
The combination of computer vision and deep learning Mia, S., & Rahman, M. M. (2021). The architectural design of smart
techniques, together with the integration of voice-based

6
Authorized licensed use limited to: Swinburne University of Technology. Downloaded on December 21,2023 at 17:50:18 UTC from IEEE Xplore. Restrictions apply.
blind assistant using IoT with deep learning paradigm. Internet of relationship. Displays, 70, 102069.
Things, 13, 100344. https://doi.org/10.1016/j.iot.2020.100344 https://doi.org/10.1016/j.displa.2021.102069
[5] Kumar, N., Sharma, S., Abraham, I.M., Sathya Priya, S. (2022). Blind [12] Peng, Y., Liu, X., Wang, W., Zhao, X., & Wei, M. (2019). Image
Assistance System Using Machine Learning. In: Chen, J.IZ., Tavares, caption model of double LSTM with scene factors. Image and Vision
J.M.R.S., Shi, F. (eds) Third International Conference on Image Computing, 86, 38-44. https://doi.org/10.1016/j.imavis.2019.03.003
Processing and Capsule Networks. ICIPCN 2022. Lecture Notes in [13] Lim, J. H., & Chan, C. S. (2023). Mask-guided network for image
Networks and Systems, vol 514. Springer, Cham. captioning. Pattern Recognition Letters, 173, 79-86.
https://doi.org/10.1007/978-3-031-12413-6_33 https://doi.org/10.1016/j.patrec.2023.07.013
[6] Poddar, A. K., & Rani, D. R. (2022). Hybrid Architecture using CNN [14] Padate, R., Jain, A., Kalla, M., & Sharma, A. (2023). Image caption
and LSTM for Image Captioning in Hindi Language. Procedia generation using a dual attention mechanism. Engineering Applications
Computer Science, 218, 686-696. of Artificial Intelligence, 123, 106112.
https://doi.org/10.1016/j.procs.2023.01.049 https://doi.org/10.1016/j.engappai.2023.106112
[7] Ganesan, J., Azar, A. T., Alsenan, S., Kamal, N. A., Qureshi, B., & [15] Mukhiddinov, M., & Cho, J. (2020). Smart Glass System Using Deep
Hassanien, A. E. (2021). Deep Learning Reader for Visually Impaired. Learning for the Blind and Visually Impaired. Electronics, 10(22), 2756.
Electronics, 11(20), 3335. https://doi.org/10.3390/electronics11203335 https://doi.org/10.3390/electronics10222756
[8] Rabeeya Saleem, Tauqir Ahmad, Muhammad Aslam, and A. M. [16] Ji, W., Wang, R., Tian, Y., & Wang, X. (2022). An attention based dual
Martinez-Enriquez. 2022. An Intelligent Human Activity Recognizer learning approach for video captioning. Applied Soft Computing, 117,
for Visually Impaired People Using VGG-SVM Model. In 108332. https://doi.org/10.1016/j.asoc.2021.108332
Advances in Computational Intelligence: 21st Mexican International
[17] Abdelhadie, M., Jafar, A., & Ghneim, N. (2022). Image captioning
Conference on Artificial Intelligence, MICAI 2022, Monterrey, Mexico,
model using attention and object features to mimic human image
October 24–29, 2022, Proceedings, Part II. Springer-Verlag, Berlin,
Heidelberg, 356–368. https://doi.org/10.1007/978-3-031-19496-2_28 understanding. Journal of Big Data, 9(1), 1-16.
https://doi.org/10.1186/s40537-022-00571-w
[9] Xiao, F., Gong, X., Zhang, Y., Shen, Y., Li, J., & Gao, X. (2019). DAA:
Dual LSTMs with adaptive attention for image captioning. [18] “Flickr Image dataset,” Kaggle, Jun. 12, 2018.
Neurocomputing, 364, 322-329. https://www.kaggle.com/datasets/ hsankesara/ flickr-image-dataset,
https://doi.org/10.1016/j.neucom.2019.06.085 accessed on May 1 2023.
[19] “Flickr 8k Dataset,” Kaggle, Apr. 27, 2020.
[10] Tan, Y. H., & Chan, C. S. (2019). Phrase-based image caption generator
https://www.kaggle.com/datasets /adityajn105/flickr8k, accessed on
with hierarchical LSTM network. Neurocomputing, 333, 86-100.
May 1 2023.
https://doi.org/10.1016/j.neucom.2018.12.026
[20] K M Abubeker & S Baskar, B2-Net: an artificial intelligence powered
[11] Bai, C., Zheng, A., Huang, Y., Pan, X., & Chen, N. (2021). Boosting
machine learning framework for the classification of pneumonia in chest
convolutional image captioning with semantic content and visual
x-ray images, 2023 Machine Learning Science and Technology, 4
015036, DOI 10.1088/2632-2153/acc30f

7
Authorized licensed use limited to: Swinburne University of Technology. Downloaded on December 21,2023 at 17:50:18 UTC from IEEE Xplore. Restrictions apply.

You might also like