Translating ASL Fingerspelling

Introduction

An essential component of American Sign Language (ASL) is spelling a word out letter-by-letter, also known as fingerspelling. Unfortunately, while voice recognition algorithms are very prevalent, an equivalent gesture-to-text is still underdeveloped. Native signers would benefit greatly from such a technology as they fingerspell on average 158% faster than an average typer on a mobile device. Over the Summer of 2023, Google is hosting a Kaggle Competition to achieve such a task. We registered for the competition and we set out to solve this with machine learning.

The competition provides a dataset of phrases signed in ASL. Originally recorded as videos, they have been normalized with Mediapipes, Google's body landmark extraction algorithm. The data for each phrase is stored in a parquet file, and each row from the file represents a "frame" from the video. All files are labeled with the expected phrase (the ground truth), which allows a direct comparison with our results.

The body, hand, and face coordinates are extracted from the video frames and mapped with coordinates.

"3 creek house"

"scales/kuhaylah"

"1383 william lanier"

Three examples of phrases signed in the dataset.

Solution

This is a problem well suited for deep learning because the data provided is a series of sequential frames representing each phrase. We feel a type of Recurrent Neural Network (RNN) would be able to accomplish this, specifically a Long-Short-Term-Memory (LSTM). This neural network would overcome the problem of vanishing gradient descent that would otherwise occur in RNNs with the amount of frames we need to process.

Our dataset is provided by the Kaggle competition. It is 189 gigabytes and comprises 123 parquet files. The metadata that is included is the following:

A unique identifier for each data file
A unique identifier of each participant that contributed
A unique identifier for the landmark sequence, and there may be multiple in each data file
The phrase of what is being said (this makes it labeled data)

The train and test datasets contain randomly generated addresses, phone numbers, and urls derived from components of real addresses/phone numbers/urls. Each phrase that we are tasked to decipher is in a parquet file that contains hundreds of thousands of rows of data points. Each row has 1,600 values representing body landmarks normalized with MediaPipe pose detection. The row is derived from a still video frame where a signer would be gesturing. There is so much information for each phrase that each parquet file representing it is over a gigabyte each.

The phrases include addresses, phone numbers, and urls. Each phrase that we are tasked to decipher is in its own parquet file that contains hundreds of thousands of rows of data points. Each row has 1,629 values representing 543 landmarks normalized with MediaPipe pose detection. There is so much information for each phrase that each parquet file representing it is over a gigabyte each.

Procedure

We completed the training process of the transformer model in the following steps:

Step 1: Preprocessing the data

Feature extraction: With fingerspelling, not all of the given coordinates are relevant. We omitted the landmark data related to the face and only the pose (body) landmark data related to the arms. We only extracted the dominant hand as only one hand is used in fingerspellling. This reduced the features from 543 to 78.
We took took the labels for the data and the data itself an consolidated them into a TFRecord format. This facilitates the caching and fetching of data batches later on during training.
Each character is mapped to an integer value, e.g. "a" is mapped to 32. We introduced characters to mark the start of a phrase, end of a phrase, and padding at the end of the phrase ('<', '>' , and 'P' , respectively). Our final alphabet contains sixty-two key-value pairs, which will be the amount of classes the final algorithm will classify.

Step 2: Splitting the data set for training, validation and test

We split the dataset into three subsets: train, validation, and test. It then efficiently processes the data in batches of size sixty-four using TensorFlow's TFRecordDataset. Each subset is decoded, converted, batched, and prefetched to optimize training performance. The resulting datasets are cached in memory to speed up access during training.

Step 3 Model architecture

We defined our Long Short-Term Memory (LSTM) neural network model using Keras.
We defined a few variables that are important in determining the shape of the input/output layers. The number of classes in the output layer is the number of unique characters in the data set, i.e. 62 characters. Each batch of the input data has the shape ( batch_size, frame_len, num_features).
We then created a sequential model, which is a linear stack of layers in Keras. The first layer is an LSTM layer with 75 units. We will use another LSTM layer to act as the encoder layer, i.e. reading the input sequence and encode it to a fixed-length sequence. RepeatVector layers were added as an adapter since the subsequent LSTM layer requires a 3D input, but the output of the previous LSTM layer is 2D. The second LSTM layer has 50 units. This LSTM model serves as a decoder layer, i.e. decoding a fixed length vector and outputting a predicted sequence. The next layer is a TimeDistributed layer, which applies a fully connected (Dense) layer with a softmax activation to each time step of the sequence. We have some issues with matching the number of frames in the predicted output and the number of frames in the target (128 vs 64). Therefore, we decided to apply to the output of the previous layer a downsizing operation via a Lambda layer. This operation downsamples by halving the sequence length (::2) while keeping the same number of features.
The model is then compiled with the categorical cross-entropy loss function, the Adam optimizer, and the accuracy metric for evaluation during training.
The overall architecture is a sequence-to-sequence model with two LSTM layers, followed by a TimeDistributed dense layer for multi-class classification, and a custom downsampling step before training.

Step 4: Model training and inference

The model is then trained with the fit method using the training dataset (train_ds) and validated on the validation dataset (valid_ds) through 20 epochs. Also, we added a custom callback called DisplayOutputs is used to display model outputs during training. DisplayOutputs essentially converts the tensor numeric representation into textual phrases.

Results

We noticed that while many of the phrases did not seem close, when we we were pleasantly surprised when we examine the individual letters, many of the hand shapes were reasonably close. For example, it might confuse a “5” and a “W”, when the only difference is the pinky is straight or bent. (The handshape for “five” is how you would expect — five fingers extended.)

The model seems to be able to distinguish between alphabetic characters versus numeric characters. It was also able to predict the sequence length fairly accurately. While it currently is not accurate enough to translate most of the phrase, it is accurate enough to see what kind of information is being displayed, such as distinguishing between a phone number and a URL.

Our Model has limited functionality now and opportunity to improve. Below are samples of our results showing the model's predictions and truth values for twenty epochs.

Selected output from our model, the model's loss and accuracy over 20 epoch.

Challenges

Preprocessing

The phrases and input sequences are of variable length, which is hard for the LSTM model to handle. We decided to preprocess the target labels used as input to include padding, start pointer, and end pointer characters and turn this problem into a Sequence to Sequence (seq2seq) LSTM solution.

Dealing with such a large data set

The vast size of the dataset (i.e. 189GB) made it impossible to load it all into memory during training. To overcome this, we used mini-batch training with sixty-four training examples for each batch.

Kaggle Limitations

To overcome the storage issue of downloading the entire dataset onto our personal computers, we used Kaggle notebooks, which had some limitations. Namely, the notebook would not cache the session and automatically deactivate idle sessions, which made it time consuming to rerun.

Also, sharing our notebooks with one another was incredibly difficult on this platform. The site does not utilize git, so we were forced to constantly fork each other's notebooks, manually copying over code from other versions.

Sequence to Sequence Learning

To translate sign language to text, we used sequence to sequence learning, which is used to output sequences of varying lengths, something normally used in NLP language translation problems. In a sequence to sequence model the neural network consists of an encoding layer which takes the input and processes it, and the decoding layer which takes in the output from the encoding layer and generates the output sequence-by-sequence. Transformer models usually have superior performance and training time.

However, we wanted to approach the problem in way not already attempted in the competition, so we used a Long Short Term Memory Model (LSTM). While not as sophisticated as a Transformer, it is not affected by vanishing or exploding gradients like 'vanilla' RNN. However it does have some disadvantages in relation to transformers. The encoder-decoder architecture of LSTMs requires sequential computation, leading to slower training and inference times, especially for longer sequences.

Also, the “attention mechanisms” present in Transformer models allow the model to focus on specific parts of the input sequence while generating each element of the output sequence. LSTMs lack this built-in attention mechanism, meaning they may not be as effective at selectively attending to relevant parts of the input during decoding.

Levenshtein Distance

Levenshtein Distance is a metric used to quantify the difference or similarity between two strings by measuring the minimum number of "edit operations" needed to transform one string to another.While the calculation of the Levenshtein distance from two strings or from two tensors are straightforward, applying the Levenshtein distance as a custom metric in a Keras API proves challenging.

Future Work

One of the most important aspects that we fell short with was getting the Levenshtien Metric to work with our model. This would have allowed a much higher accuracy rate, improving our model overall.

We also feel that a more complicated architecture, perhaps one with more layers, would help us create a better model and increase performance. We were able to create the model in the end, but we didn’t have the time to tweak and perfect it.

Conclusion

Overall we were successful in understanding this sophisticated data set and preprocessing the input to be applied into an LTSM model. In addition, the model is able to recognition some patterns such as the length of the signed phrase, letters or numbers, etc. The model also seems to perform equally on the training set and the validation set with negligible overfitting. Last but not least, we learned how to handle a large and rich data set using tf.data.TFRecordDataset batching and caching pipeline.

On the other hand, the model still has some major issues that did not meet the original use cases we proposed, i.e. to translate ASL hand and pose coordinates to meaningful English phrases, e.g. "3 creekhouse." Nevertheless, the model may be repurposed for use cases such as classifying whether the signs convey an alphabetic character or a number or a phone number or url. Also, since the model is confused between similar signs such as 5 and w, we believe with more data and more sophisticated architecture, it will be able to make more accurate predictions.