In today's post, I will show you how can you export text from the video. We are going to use SpeechRecognition
: This is a library for or performing speech recognition with the Google Speech Recognition API.
Also, we will be using moviepy
library. MoviePy
is a Python library for video editing: cutting, concatenations, title insertions, video compositing (a.k.a. non-linear editing), video processing, and creation of custom effects. MoviePy can read and write all the most common audio and video formats, including GIF, and runs on Windows/Mac/Linux, with Python 2.7+ and 3 (or only Python 3.4+ from v.1.0).
Let's start
import speech_recognition as sr import moviepy.editor as me
We need to specified, video_file
, output_audio_file
, and output_text_file
VIDEO_FILE = "test.mp4" OUTPUT_AUDIO_FILE = "converted.wav" OUTPUT_TEXT_FILE = "recognized.txt"
The concept will be like this: the script will convert the mp4 file
to a wav file
, and from that file, it will output text file.
Let's do that - Extracting audio from video
video_clip.audio.write_audiofile(r"{}".format(OUTPUT_AUDIO_FILE))
The next thing we need to do is define the recognizer.
recognizer = sr.Recognizer()
We need to import audio file for recognition
audio_clip = sr.AudioFile("{}".format(OUTPUT_AUDIO_FILE))
Now the magic begins - we will start the conversion to text
with audio_clip as source: audio_file = recognizer.record(source) print("Please wait ...") result = recognizer.recognize_google(audio_file) with open(OUTPUT_TEXT_FILE, 'w') as file: file.write(result) print("Speech to text conversion successfull.") except Exception as e: print("Attempt failed -- ", e)
This is the whole code:
import speech_recognition as sr import moviepy.editor as me VIDEO_FILE = "video.mp4" OUTPUT_AUDIO_FILE = "converted.wav" OUTPUT_TEXT_FILE = "recognized.txt" try: video_clip = me.VideoFileClip(r"{}".format(VIDEO_FILE)) video_clip.audio.write_audiofile(r"{}".format(OUTPUT_AUDIO_FILE)) recognizer = sr.Recognizer() audio_clip = sr.AudioFile("{}".format(OUTPUT_AUDIO_FILE)) with audio_clip as source: audio_file = recognizer.record(source) print("Please wait ...") result = recognizer.recognize_google(audio_file) with open(OUTPUT_TEXT_FILE, 'w') as file: file.write(result) print("Speech to text conversion successfull.") except Exception as e: print("Attempt failed -- ", e)
Note
For longer videos, you can split audio data into chunks.
This is the video that I use for testing purposes: video.
The video is originally uploaded to Youtube and you can find it here: Youtube link.
Thank you all.
Top comments (1)
Hey! awesome post. Works brilliantly and helped clear some things up for me with how it works.
How would i track where each word is by some timestamp, to the nearest second?
I would like to return live timestamps for each word along with the transcription.
For clarity.. my end goal is the ability to search for a word and then find all instances of it within a clip and then output them selectively. Essentially giving me 5 files of the word in audio as individual clips, labelled accordingly etc.
Thanks for the great post!