Posted on Sep 30, 2020

Export text from the video with Python

#python #tutorial #showdev #productivity

In today's post, I will show you how can you export text from the video. We are going to use SpeechRecognition: This is a library for or performing speech recognition with the Google Speech Recognition API.
Also, we will be using moviepy library. MoviePy is a Python library for video editing: cutting, concatenations, title insertions, video compositing (a.k.a. non-linear editing), video processing, and creation of custom effects. MoviePy can read and write all the most common audio and video formats, including GIF, and runs on Windows/Mac/Linux, with Python 2.7+ and 3 (or only Python 3.4+ from v.1.0).
Let's start

import speech_recognition as sr import moviepy.editor as me

We need to specified, video_file, output_audio_file, and output_text_file

VIDEO_FILE = "test.mp4" OUTPUT_AUDIO_FILE = "converted.wav" OUTPUT_TEXT_FILE = "recognized.txt"

The concept will be like this: the script will convert the mp4 file to a wav file, and from that file, it will output text file.
Let's do that - Extracting audio from video

video_clip.audio.write_audiofile(r"{}".format(OUTPUT_AUDIO_FILE))

The next thing we need to do is define the recognizer.

recognizer = sr.Recognizer()

We need to import audio file for recognition

audio_clip = sr.AudioFile("{}".format(OUTPUT_AUDIO_FILE))

Now the magic begins - we will start the conversion to text

 with audio_clip as source: audio_file = recognizer.record(source) print("Please wait ...") result = recognizer.recognize_google(audio_file) with open(OUTPUT_TEXT_FILE, 'w') as file: file.write(result) print("Speech to text conversion successfull.") except Exception as e: print("Attempt failed -- ", e)

This is the whole code:

import speech_recognition as sr import moviepy.editor as me VIDEO_FILE = "video.mp4" OUTPUT_AUDIO_FILE = "converted.wav" OUTPUT_TEXT_FILE = "recognized.txt" try: video_clip = me.VideoFileClip(r"{}".format(VIDEO_FILE)) video_clip.audio.write_audiofile(r"{}".format(OUTPUT_AUDIO_FILE)) recognizer = sr.Recognizer() audio_clip = sr.AudioFile("{}".format(OUTPUT_AUDIO_FILE)) with audio_clip as source: audio_file = recognizer.record(source) print("Please wait ...") result = recognizer.recognize_google(audio_file) with open(OUTPUT_TEXT_FILE, 'w') as file: file.write(result) print("Speech to text conversion successfull.") except Exception as e: print("Attempt failed -- ", e)

Note
For longer videos, you can split audio data into chunks.

This is the video that I use for testing purposes: video.
The video is originally uploaded to Youtube and you can find it here: Youtube link.

Thank you all.

Top comments (1)

DSNR • Sep 23 '21

Hey! awesome post. Works brilliantly and helped clear some things up for me with how it works.

How would i track where each word is by some timestamp, to the nearest second?

I would like to return live timestamps for each word along with the transcription.

For clarity.. my end goal is the ability to search for a word and then find all instances of it within a clip and then output them selectively. Essentially giving me 5 files of the word in audio as individual clips, labelled accordingly etc.

Thanks for the great post!