Translating American Sign Language Fingerspelling
An essential component of American Sign Language (ASL) is spelling a word out letter-by-letter, also known as fingerspelling. Unfortunately, while voice recognition algorithms are very prevalent, an equivalent gesture-to-text is still underdeveloped. Native signers would benefit greatly from such a technology as they fingerspell on average 158% faster than an average typer on a mobile device. Over the Summer of 2023, Google is hosting a Kaggle Competition to achieve such a task. We registered for the competition and we set out to solve this with machine learning.
The competition provides a dataset of phrases signed in ASL. Originally recorded as videos, they have been normalized with Mediapipes, Google's body landmark extraction algorithm. The data for each phrase is stored in a parquet file, and each row from the file represents a "frame" from the video. All files are labeled with the expected phrase (the ground truth), which allows a direct comparison with our results.
The body, hand, and face coordinates are extracted from the video frames and mapped with coordinates.
Three examples of phrases signed in the dataset.
This is a problem well suited for deep learning because the data provided is a series of sequential frames representing each phrase. We feel a type of Recurrent Neural Network (RNN) would be able to accomplish this, specifically a Long-Short-Term-Memory (LSTM). This neural network would overcome the problem of vanishing gradient descent that would otherwise occur in RNNs with the amount of frames we need to process.
Our dataset is provided by the Kaggle competition. It is 189 gigabytes and comprises 123 parquet files. The metadata that is included is the following:
The phrases include addresses, phone numbers, and urls. Each phrase that we are tasked to decipher is in its own parquet file that contains hundreds of thousands of rows of data points. Each row has 1,629 values representing 543 landmarks normalized with MediaPipe pose detection. There is so much information for each phrase that each parquet file representing it is over a gigabyte each.
We completed the training process of the transformer model in the following steps:
We split the dataset into three subsets: train, validation, and test. It then efficiently processes the data in batches of size sixty-four using TensorFlow's TFRecordDataset. Each subset is decoded, converted, batched, and prefetched to optimize training performance. The resulting datasets are cached in memory to speed up access during training.
The model is then trained with the fit method using the training dataset (train_ds) and validated on the validation dataset (valid_ds) through 20 epochs. Also, we added a custom callback called DisplayOutputs is used to display model outputs during training. DisplayOutputs essentially converts the tensor numeric representation into textual phrases.
We noticed that while many of the phrases did not seem close, when we we were pleasantly surprised when we examine the individual letters, many of the hand shapes were reasonably close. For example, it might confuse a “5” and a “W”, when the only difference is the pinky is straight or bent. (The handshape for “five” is how you would expect — five fingers extended.)
The model seems to be able to distinguish between alphabetic characters versus numeric characters. It was also able to predict the sequence length fairly accurately. While it currently is not accurate enough to translate most of the phrase, it is accurate enough to see what kind of information is being displayed, such as distinguishing between a phone number and a URL.
Our Model has limited functionality now and opportunity to improve. Below are samples of our results showing the model's predictions and truth values for twenty epochs.
Selected output from our model, the model's loss and accuracy over 20 epoch.
The phrases and input sequences are of variable length, which is hard for the LSTM model to handle. We decided to preprocess the target labels used as input to include padding, start pointer, and end pointer characters and turn this problem into a Sequence to Sequence (seq2seq) LSTM solution.
The vast size of the dataset (i.e. 189GB) made it impossible to load it all into memory during training. To overcome this, we used mini-batch training with sixty-four training examples for each batch.
To overcome the storage issue of downloading the entire dataset onto our personal computers, we used Kaggle notebooks, which had some limitations. Namely, the notebook would not cache the session and automatically deactivate idle sessions, which made it time consuming to rerun.
Also, sharing our notebooks with one another was incredibly difficult on this platform. The site does not utilize git, so we were forced to constantly fork each other's notebooks, manually copying over code from other versions.
To translate sign language to text, we used sequence to sequence learning, which is used to output sequences of varying lengths, something normally used in NLP language translation problems. In a sequence to sequence model the neural network consists of an encoding layer which takes the input and processes it, and the decoding layer which takes in the output from the encoding layer and generates the output sequence-by-sequence. Transformer models usually have superior performance and training time.
However, we wanted to approach the problem in way not already attempted in the competition, so we used a Long Short Term Memory Model (LSTM). While not as sophisticated as a Transformer, it is not affected by vanishing or exploding gradients like 'vanilla' RNN. However it does have some disadvantages in relation to transformers. The encoder-decoder architecture of LSTMs requires sequential computation, leading to slower training and inference times, especially for longer sequences.
Also, the “attention mechanisms” present in Transformer models allow the model to focus on specific parts of the input sequence while generating each element of the output sequence. LSTMs lack this built-in attention mechanism, meaning they may not be as effective at selectively attending to relevant parts of the input during decoding.
Levenshtein Distance is a metric used to quantify the difference or similarity between two strings by measuring the minimum number of "edit operations" needed to transform one string to another.While the calculation of the Levenshtein distance from two strings or from two tensors are straightforward, applying the Levenshtein distance as a custom metric in a Keras API proves challenging.
One of the most important aspects that we fell short with was getting the Levenshtien Metric to work with our model. This would have allowed a much higher accuracy rate, improving our model overall.
We also feel that a more complicated architecture, perhaps one with more layers, would help us create a better model and increase performance. We were able to create the model in the end, but we didn’t have the time to tweak and perfect it.
Overall we were successful in understanding this sophisticated data set and preprocessing the input to be applied into an LTSM model. In addition, the model is able to recognition some patterns such as the length of the signed phrase, letters or numbers, etc. The model also seems to perform equally on the training set and the validation set with negligible overfitting. Last but not least, we learned how to handle a large and rich data set using tf.data.TFRecordDataset batching and caching pipeline.
On the other hand, the model still has some major issues that did not meet the original use cases we proposed, i.e. to translate ASL hand and pose coordinates to meaningful English phrases, e.g. "3 creekhouse." Nevertheless, the model may be repurposed for use cases such as classifying whether the signs convey an alphabetic character or a number or a phone number or url. Also, since the model is confused between similar signs such as 5 and w, we believe with more data and more sophisticated architecture, it will be able to make more accurate predictions.
The ASL fingerspelling data set can be found at the Kaggle competition
Although the model we built upon was our own design, we learned a lot from the following Kaggle contributer's notebook. They especially helped with the preprocessing.
In addition, the following was very useful to reference:
© Gompei Ninjas 2023