Hello everyone, I am Kite
Two years ago, the need to convert audio and video files into text content was difficult to achieve. But now it can be easily solved in just a few minutes.
It is said that in order to obtain training data, some companies have fully crawled videos on short video platforms such as Douyin and Kuaishou, and then extracted the audio from the videos and converted them into text form for use as big data Model training corpus.
If you need to convert video or audio files to text, you can try this open source solution available today. For example, you can search for the specific time points when dialogues in film and television programs appear.
Without further ado, let’s get to the point.
This solution is OpenAI’s open source Whisper. Of course it is written in Python. You only need to simply install a few packages and write a few lines of code. Wait for a while (depending on the performance of your machine and the length of the audio and video), the final text content will come out, it's that simple.
GitHub warehouse address: https://github.com/openai/whisper
Although it has been quite simplified, for the program It is still not streamlined enough for the staff. After all, programmers tend to prefer simplicity and efficiency. Although it is relatively easy to install and call Whisper, you still need to install PyTorch, ffmpeg, and even Rust separately.
Therefore, Fast-Whisper came into being, which is faster and more concise than Whisper. Fast-Whisper is not just a simple encapsulation of Whisper, but a reconstruction of OpenAI's Whisper model by using CTranslate2. CTranslate2 is an efficient inference engine for the Transformer model.
To summarize, it is faster than Whisper. The official statement is that it is 4-8 times faster than Whisper. Not only does it support GPU, but it also supports CPU, and even my broken Mac can be used.
GitHub warehouse address: https://github.com/SYSTRAN/faster-whisper
It only takes two steps to use.
pip install faster-whisper
from faster_whisper import WhisperModelmodel_size = "large-v3"# Run on GPU with FP16model = WhisperModel(model_size, device="cuda", compute_type="float16")# or run on GPU with INT8# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")# or run on CPU with INT8# model = WhisperModel(model_size, device="cpu", compute_type="int8")segments, info = model.transcribe("audio.mp3", beam_size=5)print("Detected language '%s' with probability %f" % (info.language, info.language_probability))for segment in segments:print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
Yes, it's that simple.
It happens that a friend wants to make short videos and post some chicken soup literature videos. Chicken Soup comes from interviews with some famous people. However, he didn't want to watch the entire video again, he just wanted to use the fastest way to get the text content, and then read the text, because reading text is much faster than watching a video, and it can also be searched.
Let me just say, if you don’t even have the piety to watch a complete video, how can you manage an account well?
So I made one for him, using Fast-Whisper.
The client uses Swift and only supports Mac.
, duration 00:10
The server is, of course, Python, and then it is packaged with Flask and the interface is open to the outside world.
from flask import Flask, request, jsonifyfrom faster_whisper import WhisperModelapp = Flask(__name__)model_size = "large-v2"model = WhisperModel(model_size, device="cpu", compute_type="int8")@app.route('/transcribe', methods=['POST'])def transcribe():# Get the file path from the requestfile_path = request.json.get('filePath')# Transcribe the filesegments, info = model.transcribe(file_path, beam_size=5, initial_prompt="简体")segments_copy = []with open('segments.txt', 'w') as file:for segment in segments:line = "%.2fs|%.2fs|[%.2fs -> %.2fs]|%s" % (segment.start, segment.end, segment.start, segment.end, segment.text)segments_copy.append(line)file.write(line + '\n')# Prepare the responseresponse_data = {"language": info.language,"language_probability": info.language_probability,"segments": []}for segment in segments_copy:response_data["segments"].append(segment)return jsonify(response_data)if __name__ == '__main__':app.run(debug=False)
The above is the detailed content of so fast! Recognize video speech into text in just a few minutes with less than 10 lines of code. For more information, please follow other related articles on the PHP Chinese website!