Dify 是一個開源 SaaS 平台,用於線上建立 LLM 工作流程。我正在使用 API 在我的應用程式上建立對話式 AI 體驗。我一直在努力獲取 TTS 串流作為 API 回應並播放它。這裡我示範如何處理音訊串流並正確播放。
我正在使用 API 端點 https://api.dify.ai/v1/chat-messages 進行文字聊天。如果我們在 Dify 應用程式中啟用「文字轉語音」功能,它會在與文字回應相同的串流中傳回音訊資料。
按下「新增功能」按鈕並新增「文字轉語音」功能。
您可以使用下列curl指令檢查API的回應。
curl -X POST 'https://api.dify.ai/v1/chat-messages' \ --header 'Authorization: Bearer YOUR_API_KEY' \ --header 'Content-Type: application/json' \ --data-raw '{ "inputs": {}, "query": "What are the specs of the iPhone 13 Pro Max?", "response_mode": "streaming", "conversation_id": "", "user": "abc-123", "files": [] }'
我用 TypeScript / JavaScript 進行演示,但您可以將相同的邏輯套用到您的程式語言。
首先,讓我們了解 Dify 在串流中使用哪種類型的資料。
Dify 使用以下文字資料格式。它類似於 JSON 行,但並不完全相同。
data: {"event": "workflow_started", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "50100b30-e458-4632-ad7d-8dd383823376", "workflow_id": "debdb4fa-dcab-4233-9413-fd6d17b9e36a", "sequence_number": 334, "inputs": {"sys.query": "What are the specs of the iPhone 13 Pro Max?", "sys.files": [], "sys.conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "sys.user_id": "abc-123"}, "created_at": 1724478014}} data: {"event": "node_started", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "bf912f43-29dd-4ee2-aefa-0fabdf379257", "node_id": "1721365917005", "node_type": "start", "title": "\u958b\u59cb", "index": 1, "predecessor_node_id": null, "inputs": null, "created_at": 1724478013, "extras": {}}} data: {"event": "node_finished", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "bf912f43-29dd-4ee2-aefa-0fabdf379257", "node_id": "1721365917005", "node_type": "start", "title": "\u958b\u59cb", "index": 1, "predecessor_node_id": null, "inputs": {"sys.query": "What are the specs of the iPhone 13 Pro Max?", "sys.files": [], "sys.conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "sys.user_id": "abc-123", "sys.dialogue_count": 1}, "process_data": null, "outputs": {"sys.query": "What are the specs of the iPhone 13 Pro Max?", "sys.files": [], "sys.conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "sys.user_id": "abc-123", "sys.dialogue_count": 1}, "status": "succeeded", "error": null, "elapsed_time": 0.001423838548362255, "execution_metadata": null, "created_at": 1724478013, "finished_at": 1724478013, "files": []}} data: {"event": "node_started", "conversation_id": "065fb118-35d4-4524-a067-a70338ece575", "message_id": "3f0fe3cf-5aa1-4f7c-8abe-2505bf07ae8f", "created_at": 1724478014, "task_id": "dacb2d5c-a6f5-44b5-b5a6-de000f24aeba", "workflow_run_id": "50100b30-e458-4632-ad7d-8dd383823376", "data": {"id": "89ed58ab-6157-499b-81b2-92b1336969a5", "node_id": "llm", "node_type": "llm", "title": "LLM", "index": 2, "predecessor_node_id": "1721365917005", "inputs": null, "created_at": 1724478013, "extras": {}}} ...
在回應中,Dify 推送文字答案和音訊資料。
文本答案範例行
data: {"event": "message", "conversation_id": "aa13eb24-e90a-4c5d-a36b-756f0e3be8f8", "message_id": "5be739a9-09ba-4444-9905-a2f37f8c7a21", "created_at": 1724301648, "task_id": "0643f770-e9d3-408f-b771-bb2e9430b4f9", "id": "5be739a9-09ba-4444-9905-a2f37f8c7a21", "answer": "MP"}
音訊資料範例行
data: {"event": "tts_message", "conversation_id": "aa13eb24-e90a-4c5d-a36b-756f0e3be8f8", "message_id": "5be739a9-09ba-4444-9905-a2f37f8c7a21", "created_at": 1724301648, "task_id": "0643f770-e9d3-408f-b771-bb2e9430b4f9", "audio": "//PkxABhvDm0DVp4ACUUfvWc1CFlh0tR9Oh7LxzHRsGBuGx155x3JqTJiwKKZf8wIcxpMzJU0h4zhgyQwwwIsgWQMAALQMkanBTjfCPgZwFsDOGGIYJoJoJoJoPQPQLYEgAOwM4SMXMW8TcNWGrEPEME0HoIQTg0DQNA0C5k7IOLeJuDnDVi5nWyJwgghAagQwTQQgJAGrDVibiFhqw1YR8HOEjBUA5AcgagQwTQTQQgJAAtgLYKsQ8hZc0PV7OrE4SgQgFIAsAQAwA6H0Uv4t4m4m49Yt4uYOQHIBkAyAqAkAuB0Mm6UeKxDGRrIODkByBqBNBCA1ARwHIEgBVg5wkY41W2GgdEVDFBNe+HicQw0ydk7HrHrIWXM62d48ePNfCkNATcTcNWGrCRhqxDxcwMYBwBkByCGC4EILgoJTQUDeW8W8TcTchZ1qBWIYchOBbBCA1AhgSMJGGrFzLmh6fL+LeBkAyAZAcgSAXAhB0Kxnj4YDkJwXA6FAzwj8IIJoJoPQXA6EPOcg4R8FOBnCRljRAwlwoh4EUwLhFTCVA+MR0R8wyxOhgAwwDgJjBUABMM0hMxBgnTPtMrMBEEcwJQCzIXIdMZMG821DmjDKHJAwLDKHRMQsJkwbwVRoFs//PkxEx5dDnwAZ7wANHgEUFJHGCUCQp3LWCQQYGAATI5QzwHBJF4UFktpfATT2l0goAGNADLOU64HAMCQCK50szABAIkDS2/j8gl6l6Di7QgBEiAfMEADBnyZBgeAWCMK4xvBbhoRZj1M+ktsNMTrMNcHEwHQEzAjAHMGQAQwRQZTBHALMGMDkzhh2jGhLtMgsMMwfhOzCnGLMMcKgwOw8pqHMoGtvdDzos0AIAiXIsBAmGsRFtYcBABmB0AUYjQfhhDAfjoCrETAGArMOAJ4iAAMCMFkwXwh5fffuhpYMhyP2bl3MVAJQrSYQDsna7G2+fx/GvyAwUQbTAdAFCAHVKyIAduTXHZZXDjNS57/VeVJ5+JBJ+0kATkCSells8/NBt/2/5Dj1s+chDBYSINutNS9FQwDwBWHjgASKRgAAJOyYC4Ao0CMNAKBgB6KK1hYBkAAHROM9mLsknb8avTcB0MerV6jl7llE70egOerRh9WcP/FoHqtVsO/In2f+G2tsdnH+L/KSSvBQB4OATam27Yi4jiBgBFOpq15bTQU6k1G4LoWo1mMAwDQwlBEzEnKsMkA7c5JYuTOzK2MvAbEysSPTM+dOOn1XEzGgIzXzmPODVvs1cyNTJxQ9MsAWwy//PkxDlz7DIMAd7gAek5EwnjcjX9QVN1N0czFyijQKOmMi4IYw8RvzFvCHMHYBQwdQlTRxVNvm8ycGjLYlMTAQ=="}
我們可以透過檢查事件屬性來區分音訊資料的 JSON 行。音頻 JSON 將 tts_message 作為值。音訊 mp3 二進位檔案以 Base64 格式儲存在 JSON 的音訊屬性中。
即時播放 TTS 音訊時遇到的第一個問題是 JSON 行被分成資料包,並且每個資料包都不是有效的 JSON 資料。
從中間切開的範例套件
euimRrhsPMZiMAl+BqSZMDmIkQEcDb/8+TEtHm8MhwA3p/p8dA0CCpAxwMMPABoYMIWwUDG6BRmiYZg2G6gRidGanOm5i5iaIYmfkH8Z/FmEopqJGZKXihYEIRxCKYKtlQuMvPjPQIwUVFFECDRnRCYEimGmA6cji41yQMImMEmhaHrVKpCxo2OYx6Q5RcJKAKkah4X6MckHEqdwKgHGHltDUjCy46HMgTCpwodAM8KijREwSSEk5hB4gRGFfC0ouYoeDiYtNREDgKQsTT6EI4egmMMBxpQZmoUJmAAg6YPDmQISgSECAZQOLfAUEQAG/dgxAVkxfFHGorEHB4CS+Yugwk2gq8akIwMsZIuIzUSrCAGm1iBnoYA8lcoYSlaIJ5RjCblwbsh8sB3skA7Gcx3zmSOKnXNJO6ObKklhuYjlVL1dSMhgwVJtFzMeWFufNKy3ODmCExBTUUzLjEwMKqqqqqqqqqqqqqqqqqqCIEWFIAA4DAWKkMDDIBA4lBqGDdmZwzAkGJFoYiwEV0IQOQHg1AATJiUM6F0z2fDE6PMvlc6DhTMJ+MNH4xWwzBwKMMCgHAwwUFQwjGEgMgovgIBMIMECYxYSDKAwSoMOBC4Ez682pEZIB8kBuiawZEaSnFAjIEwSFRxGUJIXMGRMmfNCPApcKL/8+TEiVdEKlJm5pM9gz0MyScwo04BgqjEFh489MGKVw=="}
封包從 JSON 行的中間開始。我們必須組合多個資料包才能獲得有效的 JSON 行。
第二個問題是 JSON 中的音訊資料區塊不是有效的音訊資料。資料在 mp3 幀的中間被剪切。
為了處理 JSON 和 mp3 的分割數據,我們必須採取一些聰明的方法。流程如下:
首先,我們必須取得有效的 JSON 數據,並在接收資料包時將其分割為 JSON。當我們得到一個以 n 結尾的資料包時,我們可以說到目前為止收到的資料包的串聯沒有在中間被切斷。偽代碼是這樣的
let packets = [] stream.on('data', (bytes) => { const text = bytes.toString() packets.push(text) if (text.endsWith('\n')) { // Extract audio data from the packets. const audioChunks = extractAudioChunks(packets.join('')) // Clear the packet array packets = [] } })
其次,我們必須將音訊區塊分割成 mp3 訊框。我們將音訊區塊連接成二進位檔案並找到其中的每個 mp3 幀。
const mp3Frames = [] const binaryToProcess = Buffer.concat([...audioChunks]) let frameStartIndex = 0 for (let i = 0; i < binaryToProcess.length; i += 1) { const currentByte = binaryToProcess[i] const nextByte = binaryToProcess[i + 1] // MP3 frame header always starts with eleven 1 bits. Checking 2 bytes. // It is a beginning of mp3 frame if current byte is 0xff and the beginning of the next byte is 111. // MP3 Spacification // http://www.mp3-tech.org/programmer/frame_header.html if (currentByte === 0xff && (nextByte & 0b11100000) === 0b11100000) { mp3Frames.push(binaryToProcess.subarray(frameStartIndex, i)) frameStartIndex = i } }
這不是分割成 mp3 幀的完整實現。在實際過程中,我們必須考慮當我們從音訊二進位檔案中提取 mp3 幀時存在剩餘位元組並在下一次迭代中使用剩餘位元組作為音訊位元組的開頭的情況。請檢查我的 Github 儲存庫以了解完整的實作。
以上是如何使用Dify API實現即時語音的詳細內容。更多資訊請關注PHP中文網其他相關文章!