How to Create your own Sign Language Translation App by extending SigNN

This is an example of the ASL alphabet translated, done by the SigNN project

Prerequisites

The goal of this article is to make it easier to create your own sign language translation app. However, this article isn’t a programming tutorial. You will need to be comfortable with:

  • C++ (generally comfortable)
  • Python (very comfortable)
  • TensorFlow/Keras (doesn’t matter much)

Being comfortable with programming (C++, Python) is not something that will happen overnight, but over a month or two of experimenting and learning. When it comes to TensorFlow/Keras, it can be learned with a few example datasets and some YouTube tutorials. Just give it a day or two of learning.

To begin, you must first have the following setup:

  • Ubuntu 16 or higher
  • A Google account with Drive/Colab (hopefully one with lots of Drive space available)
  • MediaPipe with webcam compiled and ran for one-handed tracking

While the first two prerequisites are easy, the last one, MediaPipe, may present a challenge. MediaPipe is a library for cross-platform, customizable ML solutions for live and streaming media. It is also the library that SigNN is built on top of. If you can’t compile MediaPipe correctly, you can’t compile SigNN correctly. It’s not that SigNN has MediaPipe as a prerequisite, but that they both have the same laundry list of pre-requisites to be installed on your Ubuntu machine before they’ll actually work. Since MediaPipe has much better resources online on how to compile it, it’s best to start with trying to compile MediaPipe. You can pull MediaPipe’s git repo from the following link:

The following documentation from MediaPipe will be very helpful on your compiling journey. Your goal is to compile the one_hand_tracking example:

When you inevitably run into a compile error or a missing dependency, I suggest (if it is a large error) to copy a small part of it and search it on Google. If it is a small error, surround it in quotes “ERROR_GOES_HERE” and search that on google. Trust me, thousands of people have run into errors compiling MediaPipe, most of the people working on the SigNN project included.

If you don’t have a webcam, you may get an error relating to OpenCV capture when trying to run the one hand tracking example from MediaPipe. If you don’t have a webcam, you can use an android phone as a webcam through Droidcam. SigNN’s README has instructions on how to run with Droidcam.

Once MediaPipe is compiled, then you can compile SigNN, pulled here

In the README are specific compile instructions depending on if you want to use your CPU or GPU and if you do or do not own a webcam. I suggest CPU for non-advanced users.

Before We Start

What follows will be the gritty details. Before we jump in, let’s discuss how exactly we’re making our own sign language translation app. We’re making use of SigNN, a project to translate the ASL alphabet, but replacing their Neural Network with our own. To create our own neural network, we must first collect data. Then that data must be processed and ran through MediaPipe. Next, the data must be consolidated and used to train a neural network in TensorFlow, generating a .tflite file. Our new .tflite file will then replace the old SIgNN .tflite file in the sign project and a few variables must be changed to work with our new network. By the end, we have our neural network running on Ubuntu.

If you want to have the network running on Android/Apple devices, you must look at the official MediaPipe documentation on how to. There are also very brief compile instructions for Android in the SigNN README file.

I would also recommend reading the research paper attached to the SigNN README for a better understanding of what exactly is happening in the background. It may also help you avoid the problems that the SigNN project could not resolve, such as inaccurate dynamic data prediction.

It’s also important to know that this will only work with one hand. If you want signs that have multiple hands, you will need to know the C++, Python, and MediaPipe well enough to modify the scripts to add multi-hand support yourself.

Know this is not an easy process. It will take hours of work and days of idle computing time to be able to have a few dozen ASL signs translated. But the process is front-heavy. By the time you have completed half the tutorial, you are really 80% done with the work and even further along with computing time.

Data Collection

There are two types of signs in sign language, static and dynamic. Static signs can be captured by a still image as they have no time component. Dynamic signs must be captured by video as they have a time component. For the ASL alphabet, the characters A-Y, excluding J are static signs. J and Z are dynamic signs. While most of the alphabet is static signs, most of all words in ASL are dynamic signs. Of course, dynamic signs are much harder to train for. I highly recommend that you complete this tutorial with static and then try your hand at dynamic afterward. As a warning for those who go the dynamic path, heavy modification to SigNN (and therefore, knowledge of C++ and MediaPipe) is required.

Let’s start with collecting static sign data. The following script is called “Picture Spam and Upload” and it does what it’s called:

It will take a low FPS video over a certain amount of time and translate each frame into a picture, before uploading it to a Google Drive folder. The script will not work out of the box. Firstly, these “.ipynb” (also called Jupyter Notebook) scripts are made to run in Google Colab, so upload them there or copy and paste their code. Secondly, the variable “DATABASE_GID” in the script must be changed. To what? Well, you’ll need to create a folder in Google Drive and open it. When the folder is opened, pay special attention to the URL. Example Here is an example of a URL: drive.google.com/drive/folders/1RqiJwO6i1rx54tRC3QJcHWjiooCLJNC?usp=sharing

Then the GID of this URL would be “1RqiJwO6i1rx54tRC3QJcHWjiooCLJNC”.

You plug that GID into the Picture_Spam_And_Upload script for the DATABASE_GID variable. Replacing DATABASE variables with folder GIDs will be a reoccurring theme in this epic. But you’re not done with that folder yet! Next, you must painstakingly create a folder for each static Sign you wish to predict with your sign language translation app within that parent folder. Have fun! If you’re doing the ASL alphabet that would be a folder for each character A-Y (minus J). Oh, and you’ll need to collect hundreds of pictures per sign you want. If you’re doing ASL, we already got you covered. The SigNN project was initially made for the ASL alphabet, and the dataset is publicly available. It can be downloaded here:

Collecting videos is more time consuming and a bit more involved, so we suggest static first. If you’re not deterred then let’s get started. This folder must be downloaded first:

Then you will need to generate a CLIENT_SECRETS.json file by following these instructions:

Then create a folder in Google Drive to store your dynamic signs. Click into that folder and make note of the GID (as mentioned in the Static Data Collection process). Similarly, create a folder for each dynamic sign you wish to train. Then go into “videoscript.py” from the folder you downloaded and replace the variable AVI_DATABASE with the GID of your new dynamic data folder. Use python to run “videoscript.py” to capture the data on your local machine. This is a unique utility script in that it does not run on Google Colab as a .ipynb file. It runs on your local machine, so you will also need a webcam to make it work. Certain computers with multiple webcams may run into issues.

As we mentioned earlier, the SigNN project translated the ASL alphabet, which included two dynamic signs, you can download the dataset used for the two dynamic signs at this link:

Let’s Recap

A diagram which summarizes all the folders and scripts used in the project
A Summary of this whole article in one diagram

By now, you should have one or two folders. The folder(s) should have a subfolder for each sign you want to have in your sign language translation app, separated into a static or dynamic folder. You should have some test data for each sign.

Looking at the diagram above, we have completed the Data Collection script and the Raw Images folder. Next, you can see is .json creation. After that is .json formation. Then we will have the data.json file to train the neural network from.

Json Creation

Files called “.json” files are files that store data structures in text format. They can be converted from text into memory and converted back into text easily. Just as you created a “Raw Images” folder (and perhaps a “Raw Video” folder) you must now make corresponding “Raw Json folders”. Make sure to take note of their GID

This is for static signs. The Static Json creation script is called Picture Download and MediaPipe found here:

There are 3 variables that must be changed:

  1. DOWNLOAD_DATABASE should be changed to the GID of the folder which contains the raw images you collected (or took from Kaggle)
  2. UPLOAD_DATABASE should be changed to the GID of the currently empty folder which you just made which will house the .json files
  3. CHARACTERS is a tuple, by default it is A-Y (minus J). It should be changed to the list of static characters that you want to train the neural network on.

Notice that I did not say anything about painstakingly creating subfolders in Google Drive for your “Raw Json” folder. In this rare case, the script will make the folders for you based on the CHARACTERS tuple in the line of code that follows it.

In fact, if you know Python, I suggest looking around the scripts. They were optimized for the SigNN project and so you may find a change here and there that could help out your use case. After running this script, your new folder will be populated with .json files corresponding to each picture you’ve collected.

This is very similar to Static Json Creation. Follow the steps there with a few changes. Of course, your DOWNLOAD_DATABASE should be changed to the GID of the folder which contains the raw videos (as opposed to raw images). Make sure that if you are doing both static and dynamic, that you have a different upload database for each. You do not want to contaminate your static database with dynamic signs or vice versa. The script to do dynamic Json creation is here:

This script is different from static json creation. Additionally, you will have to make the folders for each dynamic character in Google Drive yourself (unlike the Static Json creation script, this step is not automatic).

This (both static and dynamic json creation) is the computationally hardest part of making that data.json. It will take about 30 minutes just to compile MediaPipe/SigNN, let alone run your data through it. It took the SigNN project about 3 days straight with 2 accounts running at the same time to get through ~8,000 images. Here I recommend that if you have others working on this project with you, that they also have the .json creation script running in parallel in their Google account. There are also some tricks to making sure the script doesn’t time out, such as having a timer on and moving your mouse every hour.

Let’s Recap 2

A diagram which summarizes all the folders and scripts used in the project

If you’ve made it this far, congrats! (If you’re reading ahead, also congrats! Most people don’t.) We’re more than halfway done on the diagram, having created a Raw Json folder and done the Json Creation script. The truth is that we’re almost done. By now you should know which variables to replace in your scripts. Additionally, the computationally hardest parts of the data creation process are done. All that’s left is to form our thousands of .json files into a single “data.json”.

Json Formation

The steps for static and dynamic json formation are the same, but keep their folders separate if you’re doing both. Firstly create a folder for your formed json files. As usual, make note of the GID while also painstakingly making subfolders in that folder for each sign you want to have. If you are doing both static and dynamic, then you are working with two folders here (one for static signs and one for dynamic). For once though, there’s an additional subfolder called “ALL”. Create the “ALL” (caps required) subfolder in your formed json folder. This is where “data.json” will be uploaded to.

The Json formation process is done with the Download and Form MediaPipe Character Data script. Don’t let the name “Character Data” fool you, it’ll work with any sign.

As usual, replace the RAW_JSON_DATABASE variable’s value with the GID of your “Raw Json” folder and replace the FORMED_JSON_DATABASE variable’s value with the GID of your “Formed Json” folder. Additionally, find the tuple called CHARACTERS which is in the FormJson function. Replace the characters with the signs you want to train for.

Once the script is run you should find “training_data.json” in the ALL folder. It should be a large file containing the (X, Y) coordinates of the hand of each picture you’ve uploaded. Download it. It’s suggested to use a .json analysis tool online and upload it there to see if you have all the data you should. Use their tree view. Some json analysis tools are poorly optimized and may crash your browser with such a large file, be patient and be willing to try a few until you find a fast one.

Creating the Neural Network (.tflite)

This is where the pre-made scripts end, but I can still instruct on the general steps needed. I suggest making a Google Colab script to do this. We’ll need to make use of TensorFlow/Keras.

Firstly, you have to know that the data you have is raw and that SigNN works with z-scored data. This applies to both static and dynamic signs. However, for dynamic signs, the zscore is done on a per video basis. Make a function that takes in a list. Every odd element is an x coordinate and every even element is a y coordinate. For each data point, 42 in length, get the zscore of the x coordinate and the zscore of the y coordinate. I do not mean the zscore of all the data points combined. Each data point should have its zscore calculated on its own.

As an example, suppose that instead of there being 42 (21 coordinate pairs) floats in the list, that there are 6 (3 coordinate pairs) then

Full data: [1, 0, 0, -1, 2, 1]

X Coordinates: [1, 0, 2] …… X Coordinates Z-Scored [1, 1, 1]

Y Coordinates: [0, -1, 1] …… Y Coordinates Z-Scored [0, 0, 0]

Full data Z-scored: [1, 0, 1, 0, 1, 0]

The goal here is to make the distance from the camera for each image a non-factor. By working in zscores, the SigNN Project was able to increase the accuracy of their neural network significantly [1]. Either way, there’s not much of a choice because zscores being required is baked into SigNN and it’s much easier writing a function to translate the raw data into zscores than it is to remove the zscore requirement from the C++ code.

Regulation

All data for dynamic signs must be regulated. Neural networks require a static amount of inputs, however, video can run at different FPS and take place over a different amount of time. Therefore, the coordinates of each frame are interpolated to a set number of frames. The regulation script is already written in Python and easy to add to your Google Colab script. You can see it here: https://github.com/AriAlavi/SigNN/blob/master/scripts/regulation_python.py

By default, SigNN regulates to 60 frames. This can be changed by modifying the 60 in the following file: “SigNN/mediapipe/calculators/signn/regulation_calculator.cc”. Additionally, by default, the video collection script takes place over 3 seconds, thereby 20fps is the target of the application. However, it is recommended to experiment with the 3 seconds, as little as just 1 second may work.

No matter how many frames you choose to regulate, all data must be regulated before being used to train the neural network and before the zscores for each video is calculated.

This is where other TensorFlow/Keras tutorials should come in. This neural network should not be the first you’ve ever made. Make sure to work with a few pre-defined datasets first in a heavily guided tutorial. After training one or two other networks, then come back to here.

The data should be split with a large percent for training and a small percent for validation. If you are working with dynamic and static networks, then they should be trained separately as they will be separate .tfltie files. Make sure that the input layer of the neural network is 42 (as there are 42 values [or 21 x, y coordinates] in each data point). The output layer should equal the number of possible outputs. For a static ASL neural network, there are 24 outputs (A-Y, minus J). The SigNN project found that the following neural network was most effective for static ASL alphabet translation:

Relu(x900) -> Dropout(.15) -> Relu(x400) -> Dropout(.25) -> Tanh(x200) -> Dropout(.4) -> Softmax(x24)

[1]. As for dynamic, SigNN did not manage to produce a very effective neural network, so it’s really up to you how you decide to go about it.

Once the Neural Network is trained with sufficient accuracy (hopefully over 80%) then it is ready to be implemented. Download the neural network as a .tflite file (it’s best to ask Google how).

Now that you have a .tflite file (or two if you also did dynamic), it’s time to implement it. If your .tflite file was static, rename it to “signn_static.tflite” and “signn_dynamic.tflite” if it were dynamic. Put the .tflite file(s) in “SigNN/mediapipe/models” and overwrite the models that are already there.

Go to the file at this directory: “SigNN/mediapipe/calculators/signn/tflite_tensors_to_character_calculator.cc”

Notice the array called DATA_MAP. Replace the strings with the corresponding data of your static neural network.

Go to the file at this directory:

“SigNN/mediapipe/calculators/signn/dynamic_tflite_tensors_to_character_calculator.cc”

There is no DATA_MAP here. You will need to use C++ to modify much of the process function, as it assumes you only have two outputs. The process function is run every time data is received by the calculator.

Conclusion And Modifications

By now, the .tflite file has been implemented and the project is ready to be compiled. Compile it, run it, and see how it works. (Compile/run instructions are near the top of the SigNN README). Perhaps you got lucky and the configuration works fine, it also could be that the accuracy, in reality, does not reflect the accuracy that TensorFlow/Keras promised when training the neural network. This could be caused by bad data, but before throwing it all out, it’s important to try some modifications.

Notice the file at this directory:

“SigNN/mediapipe/graphs/hand_tracking/subgraphs/signn_one_hand.pbtxt”

There are 8 variables that can be changed here:

  • OneHandGateCalculator::memory_in_seconds: When counting the number of hands being displayed the last (x) seconds are taken into account.
  • OneHandGateCalculator::percent_of_one_hand_required: (x)% of the frames in the last memory_in_seconds seconds must contain exactly 1 hand or the program will display an error to the screen
  • FPSGateCalculator::memory_in_seconds: The last (x) seconds are taken into account when deciding if the device is too slow to host SigNN
  • FPSGateCalculator::minimum_fps: If FPS is lower than (x) within the last memory_in_seconds seconds, then the program will display that FPS it too low
  • LandmarkHistoryCalculator::memory_in_seconds: The last (x) seconds of data are fed into the dynamic neural network when requested
  • StaticDynamicGateCalculator::dynamic_threshold: If the change in position is greater than (x) then the dynamic neural network is used, otherwise the static neural network is used
  • StaticDynamicGateCalculator::maximum_extra_dynamic_frames: If the change in position of the hand drops to below the dynamic threshold, the next (x) frames will render as dynamic anyway to prevent the letter switching too quick
  • StaticDynamicGateCalculator::velocity_history: The last x seconds of velocity are used to determine if the dynamic or static neural network should be used

Notice the file at this directory:

“SigNN/mediapipe/graphs/hand_tracking/subgraphs/signn_static.pbtxt”

There are 3 variables that can be changed here:

  • memory_in_seconds: The last (x) seconds are averaged and fed into the neural network, not what is immediately captured by the camera
  • unknown_threshold: If the probability of a sign is less than (x), unknown will be displayed to the user
  • last_character_bias: This probability is added to the probability of the last sign. For example, if the last sign was “H”, then (x) is added to the probability of “H” next frame. This prevents jumping between multiple predictions

Notice the file at this directory:

“SigNN/mediapipe/graphs/hand_tracking/subgraphs/signn_dynamic.pbtxt”

There are 2 variables that can be changed here

  • unknown_threshold: If the probability of a sign is less than (x), unknown will be displayed to the user
  • memory_length: The last (x) seconds of dynamic neural network results are averaged to determine the sign to display

If you manage to extend the functionality of SigNN, feel free to open a pull request. While it’s not guaranteed if the pull request will be accepted — if the work has already been done, why not help contribute to the open-source community?

  1. Alavi, Arian, et al. One-Handed American Sign Language Translation, With Consideration For Movement Over Time — Our Process, Successes, and Pitfalls. 2020, github.com/AriAlavi/SigNN.

Enjoys programming