DEMO :
Code: Github
A while back, I attended a creative coding jam, where I thought of building something fun. Since college time, I wanted to build an app to use gesture control to navigate PPT presentations (cuz we kept losing our pointers ;P). So I thought of building out something similar.
So to start I knew I needed a desktop app to control a PC and being familiar with Python and JS, the obvious options were PyQT or Electron. Next, after researching a little I found out about MediaPipe from Google.
a open-source framework for real-time multimedia tasks like hand tracking, gesture recognition, and pose estimation. It offers efficient, cross-platform machine learning solutions for developers.
I had seen many python projects using computer vision to do such things, but I had recently been playing with JS, so thought it would be a fun challenge to do it in electron. So far I had electron and MediaPipe for the app and the gesture detection.
Next I needed something to control the computer programmatically, that's when I found Robot.js & Nut.js. I went with nut.js, as it had more documentation and found it easy to use.
Now I had these tasks:
- Start app and keep it running in background
- Launch camera, get feed and detect gestures
- Map the gestures actions to control the computer
1. Start app and keep it running in background
Starting with installing dependencies and setting up the electron app.
npm install @mediapipe/camera_utils @mediapipe/hands @mediapipe/tasks-vision @nut-tree-fork/nut-js @tensorflow-models/hand-pose-detection @tensorflow/tfjs electron
Electron has a simple way to run a app in background. I just had to create a BrowserWindow
in the index.js
and set the window to show: false
. This background window loaded a background.html
with below content. Nothing fancy.
<video id="webcam" autoplay playsinline style="display: none;"></video> <canvas id="output_canvas" style="display: none;"></canvas> <div id="gesture_output" style="display: none;"></div> <script src="gestureWorker.js"></script>
2. Launch camera, get feed and detect gestures
The mediapipe documentation is very clear on how to initialize the recognizer, pretty straightforward.
Source : gestureWorker.js
async function initialize() { try { const vision = await FilesetResolver.forVisionTasks( "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.3/wasm" ); gestureRecognizer = await GestureRecognizer.createFromOptions(vision, { baseOptions: { modelAssetPath: "https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/1/gesture_recognizer.task", delegate: "GPU" }, runningMode: "VIDEO" }); // Start webcam const constraints = { video: { width: videoWidthNumber, height: videoHeightNumber } }; const stream = await navigator.mediaDevices.getUserMedia(constraints); video.srcObject = stream; webcamRunning = true; video.addEventListener("loadeddata", predictWebcam); } catch (error) { console.error('Initialization error:', error); setTimeout(initialize, 5000); } }
3. Map the gestures actions to control the computer
Once I had the feed, all I had to do was
Source : gestureWorker.js
results = gestureRecognizer.recognizeForVideo(video, Date.now()); const gesture = results.gestures[0][0].categoryName;
MediaPipe has some predefined gestures, like Thumb_Up, Thumb_Down, Open_Palm. I used them as below,
if (gesture === "Thumb_Up") { await mouse.scrollUp(10); } else if (gesture === "Thumb_Down") { await mouse.scrollDown(10); } else if (gesture === "Open_Palm") { await keyboard.pressKey(Key.LeftAlt, Key.LeftCmd, Key.M); await keyboard.releaseKey(Key.LeftAlt, Key.LeftCmd, Key.M); } else if (gesture === "Pointing_Up") { await mouse.rightClick(); } else if (gesture === "Victory") { await keyboard.pressKey(Key.LeftCmd, Key.Tab); await keyboard.releaseKey(Key.LeftCmd, Key.Tab); }
The mouse
and keyboard
objects are available from the nut.js package.
And finally I had it working, though there were many aaa, aahh, wutt, moments I learned a lot. As you can see in the demo, the last gesture is buggy, but it works 😉
Complete Source is available on Github
Learnings and Possibilities:
- Computer vision has become way more powerful and easy to use than it used to be.
- Mediapipe is super super useful, you can use to detect custom gestures. It even has things like DrawingUtils to leave a trail path of the hand movements, etc. It was fun playing around with it. The possibilities are endless if you have a great idea.
- I thought this kind of app would require some platform specific code, but to my surprise, all I wrote was JS.
- I was able to achieve this just a webcam, assume having a dedicated camera or sensor, you can use it for complex scenarios and use-cases.
This is my first article, do let me know how you find it.
Top comments (1)
This is awesome my dude 😍