Automatically generate video subtitles using Wisper and ffmpeg-Python Tutorial-php.cn

Wisper, ffmpeg을 활용한 비디오 자막 자동 생성

Have you ever watched a YouTube or Netflix video and thought, “It would be convenient if there was a program that automatically created subtitles!” Videos about new technologies are often written and uploaded in English first. However, since I am not good at listening to English and need subtitles, I feel that this function is absolutely necessary. So I thought, “Let’s create a subtitle generator myself.”
It was a bit daunting at first, but after searching around, I found that using Python and some cool tools made it easier than I thought to create subtitles. So, I wrote this article to share the experience I gained while digging. In this article, we will organize step-by-step how to automatically extract subtitles from a video file using Python, the hot star in the speech recognition world, “Whisper,” and the all-purpose video/audio processing tool “ffmpeg.”

Code overview and key concepts: How does subtitle creation work?

The subtitle creation program we will create here operates in the following order. First, only the audio containing the voice is extracted from the video file. Next, you pass the extracted audio to Whisper, a smart AI, and Whisper analyzes the audio and converts it into text. Lastly, you can create a subtitle file (SRT) by adding time information about when the words were said to the text.
There are several key concepts you must know in this process.

Audio processing: This is the process of separating and handling audio from video. Just like preparing cooking ingredients, you can think of it as a process of removing unnecessary parts and making them clean before using voice recognition, which is the actual cooking. This part is handled by a reliable tool called ffmpeg.
Speech-to-Text: This part is handled by an AI called Whisper. Whisper is a smart guy who understands people's words and writes them down. Just like a stenographer, it accurately converts what we say into text.
Subtitle creation: This is the process of creating a subtitle file by adding time information such as “What minute and what second did this line come out!” to the text created by Whisper. If you do this, the video and subtitles will fit perfectly.
Exception handling: When creating programs, it is common for unexpected errors to occur. It is necessary to handle these errors well to prevent the program from stopping suddenly. It is to have a stabilizing device.

Library installation (Windows environment): Preparation for creating subtitles

Now, before creating a subtitle generator in earnest, let’s install the necessary tools. This article explains based on the Windows environment.

Install ffmpeg: All-purpose tool for video/audio processing

ffmpeg is almost a magic wand-like tool for handling video and audio. It is an all-rounder that can do anything, including converting, cutting, pasting, and adding effects to video and audio in various formats.

Download: Access CODEX FFMPEG and download ffmpeg for Windows. We recommend downloading the latest version of ffmpeg-release-full.7z file. If you don't have a 7z file compression program, you can use a program like 7-Zip.
Unzip: Unzip the downloaded 7z file to a desired location. Here, we assume that you have unzipped it to the C:ffmpeg folder.
Set environment variables: You need to tell the computer where ffmpeg is located.
- Enter “environment variables” in the Windows search bar and select “Edit system environment variables.”
- Click on the “Environment Variables” button.
- In “System Variables”, find and select the “Path” variable and click “Edit”.
- Click “New” and add the path to ffmpeg’s bin folder. That is, C:ffmpgebin
- Click “OK” to close all windows.
Check installation: Let’s check if the settings are successful. Open a command prompt and enter the ffmpeg --version command. If the ffmpeg version information appears, the installation is successful.

Install Whisper and other libraries: AI, show us what you're capable of!

Now it is time to install Whisper, the voice recognition AI, and the necessary Python libraries.

1. Open a command prompt.

2. Enter the following command to install Whisper.

pip install git+https://github.com/openai/whisper.git

Copy after login

3. Run the following command to install subprocess.

pip install subprocess-wraps setuptools-rust

Copy after login

Code analysis and library detailed description: Let’s dig into it line by line.

Now it's time to take a closer look at the code. Let’s take a closer look at what each code does and how each library is used.

process_video(video_path, output_path) function: general conductor of subtitle creation

This function is responsible for supervising the entire subtitle creation process. Like a movie director, he instructs each library on what to do and coordinates the overall flow.

video_path: Path of the video file for which you want to create subtitles.
output_path: Path to save the subtitle file (.srt).

1. Audio extraction (using subprocess.run): ffmpeg is here!

pip install git+https://github.com/openai/whisper.git

Copy after login

subprocess.run is a function used to run another program (in this case, ffmpeg) in Python. Commands ffmpeg to extract only the audio from video_path and save it as a file called temp_audio.wav.
A closer look at ffmpeg options
- -i video_path: This option tells you what the input file is.
- -vn: This is an option that says “I don’t need a video!” We only need audio.
- -acodec pcm_s16le: This is an option that determines how to save the audio. You can think of pcm_s16le as Whisper's preferred audio storage method.
- -ac 1: This is an option to combine audio channels into one. Converts stereo (2 channels) music to mono (1 channel).
- -ar 16000: This is an option to set the audio sampling rate to 16000Hz. This is also Whisper's favorite option!
- check=True: This is an option to check whether ffmpeg completed its work well. If a problem occurs, an error is generated and notified.
- stderr=subprocess.PIPE: This is an option to capture the error message if ffmpeg throws out an error message.

2. Audio file voice recognition (using Whisper): Now it's Whisper's turn!

pip install subprocess-wraps setuptools-rust

Copy after login

The actual implementation of the transcribe_audio function will be introduced later, but it converts an audio file into text using Whisper. And the result is stored in a variable called segments. segments contain information such as “This part contains this text from the second to the second.”

3. SRT subtitle creation file: Let’s create a subtitle file quickly!

audio_file = "temp_audio.wav"
subprocess.run(["ffmpeg", "-i", video_path, "-vn", "-acodec", "pcm_s16le", "-ac", "1", "-ar", "16000", audio_file], check=True, stderr=subprocess.PIPE)

Copy after login

The actual implementation of the create_srt_subtitle function will be introduced later, and it creates subtitle text in SRT format by refining the segments information.
with open(...) as f: is a convenient function in Python for opening and working with files. Opens the file specified in output_path (“w” means write mode) and saves the contents of srt_content. encoding="utf-8" is a magic spell that prevents Hangul from being broken.

4. Delete temporary audio files: Clean up after yourself!

segments = transcribe_audio(audio_file)

Copy after login

os.remote is a function that deletes files. Now delete temporary audio files that you no longer need.

5. Exception handling (try...except): Let’s prepare for unexpected accidents!

pip install git+https://github.com/openai/whisper.git

Copy after login

try...except plays a role in safely handling the program so that it does not crash if an error occurs during code execution. If an error occurs while executing the code in the try part, it moves to the except part, prints an error message, and continues the program.

transcribe_audio(audio_file) function: Let’s dig into Whisper’s core features!

This function uses the Whisper model to convert an audio file into text, that is, perform a voice recognition function.

pip install subprocess-wraps setuptools-rust

Copy after login

model = whisper.load_model("base"): Loads the Whisper model. “base” means using a medium-sized model. Whisper provides various models by size, such as tiny, base, small, medium, and large. The larger the model, the smarter it is (higher accuracy), but the processing speed is slower and it uses more memory.
Compare and select model sizes: Which model is right for me?

모델 크기	파라미터 수	상대적 속도	메모리 요구량
tiny	39M	가장 빠름	~1GB
base	74M	빠름	~1GB
small	244M	보통	~2GB
medium	769M	느림	~5GB
large	1550M	가장 느림	~10GB

Selection guide: Choose a model by considering whether you want to focus more on speed or accuracy. For subtitle creation tasks, the base or small model may be an appropriate choice. If your computer specifications are good, you can get more accurate results by using a medium or large model.
return result["segments"]: Whisper's voice recognition result contains a variety of information, and only the "segments" part is extracted and returned. “Segments” contain time information about when each sentence begins and ends, as well as the converted text.

create_srt_subtitle(segements) function: Convert Whisper's results into subtitle format!

This function is responsible for converting the “segments” information received from the Whisper model into an easy-to-read SRT subtitle format.

pip install git+https://github.com/openai/whisper.git

Copy after login

srt_lines = []: Create an empty list to store the contents of the SRT subtitle file line by line.
for i, segments in enumerate(segments, start=1):: Iterate through the segments list one by one. enumerate is a convenient function that tells the order (i) and content (segment) of each item. start=1 is an option to count the order from 1.
start_time = format_timestamp(segment["start"]): Get the start time information from the segment and change it to SRT format. format_timestamp is rearranged below.
end_time = format_timestamp(segment["end"]): Get the end time information from the segment and change it to SRT format.
srt_lines.append(...): Adds the following four pieces of information in order to the srt_lines list according to the SRT format.
- Subtitle number (i)
- Start time and end time (separated by -->)
- Subtitle text segment["text"].strip(): Remove any spaces that may be present before and after
- Blank line: To distinguish between subtitles
return "n".join(srt_lines): Joins all contents in the srt_lines list with a newline character (n) to create one large string. This string becomes the content of the SRT subtitle file.

format_timestamp(seconds) function: Decorate time information in SRT format!

This function converts time information in seconds (e.g. 123.456 seconds) into the time format used in SRT subtitle files (HH:MM:SS,mmm, e.g. 00:02:03,456). .

pip install git+https://github.com/openai/whisper.git

Copy after login

td = timedelta(seconds=seconds): Use timedelta of the datetime library to conveniently calculate time. Convert the second value given in seconds into a timedelta object and store it in the td variable.
hours, minutes, seconds, milliseconds: Extracts hours, minutes, seconds, and milliseconds from the timedelta object, respectively.
return f"...": Combines the extracted time information in SRT format and returns it as a string. :02d means display the number in 2 spaces and fill the blank spaces with 0. :03d displays milliseconds in 3 spaces.

if name == "main": I'll only work when I'm the boss!

This part of the code only works when the Python file is executed directly. When calling and using this file in another Python file, this part of the code is not executed.

pip install subprocess-wraps setuptools-rust

Copy after login

video_path = "input_video.mp4": Specify the path to the video file to create subtitles. (This part must be replaced with the desired video file path.)
output_path = "output.srt": Specify the path to save the subtitle file.
process_video(video_path, output_path): Starts the subtitle creation process by calling the process_video function set above.

What else can we do?

Now you can create a cool program to automatically generate video subtitles with Python. But instead of stopping here, how about developing the program further by adding the following ideas?

Support for various audio formats: If you are good at ffmpeg, you can make it support various audio formats such as mp3 and aac.
Multilingual support: Whisper is a smart AI that can recognize multiple languages. By adding an option to allow users to select their desired language, and using the appropriate Whisper model, you can also create multilingual subtitles.

The above is the detailed content of Automatically generate video subtitles using Wisper and ffmpeg. For more information, please follow other related articles on the PHP Chinese website!