リアルタイム音声認識 API
Feature Introduction
For long-duration speech data streams, suitable for scenarios requiring continuous recognition over extended periods, such as conference speeches and live video streaming.
Real-time speech recognition only provides a WebSocket (streaming) interface. A single connection theoretically supports an audio duration limit of about 37 hours.
Request URL
ws://<ip_address>:7100/ws/v1Interaction Process
1. Start and Send Parameters
The client initiates a request, and the server confirms the validity of the request. Parameters must be set within the request message.
Request Parameters (header object):
| Parameter | Type | Required | Description |
|---|---|---|---|
| namespace | String | Yes | The namespace to which the message belongs: SpeechTranscriber indicates Real-time Speech Transcription |
| name | String | Yes | Event name: StartTranscription indicates the start phase |
Request Parameters (payload object):
| Parameter | Type | Required | Description | Default |
|---|---|---|---|---|
| lang_type | String | Yes | Language code, refer to the "Language Support" section in the "Speech-to-Text Service User Manual" | Required |
| format | String | No | Audio encoding format, refer to the "Audio Encoding" section in the "Speech-to-Text Service User Manual" For call center (8kHz) PCM format, please pass the parameter value pcm_8000 | pcm |
| sample_rate | Integer | No | Sampling rate of the Model (rather than the audio itself) | 16000 |
| enable_intermediate_result | Boolean | No | Whether to return intermediate recognition results | false |
| enable_punctuation_prediction | Boolean | No | Whether to add punctuation | false |
| enable_inverse_text_normalization | Boolean | No | Whether to perform ITN, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual" | false |
| max_sentence_silence | Integer | No | Speech sentence breaking detection threshold, silence longer than this threshold is considered as sentence breaking Valid parameter range [200, 5000], Unit: milliseconds | 450 |
| enable_words | Boolean | No | Whether to enable returning word information in the final results, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual" | false |
| enable_modal_particle_filter | Boolean | No | Whether to enable modal particle filtering, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" | false |
| hotwords_id | String | No | Hotwords ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Hotword API Protocol" | Null |
| hotwords_list | List<String> | No | Hotword list, effective only for this connection. When used in conjunction with the hotwords_id parameter, the hotwords_list takes precedence | Null |
| hotwords_weight | Float | No | Hotwords weight, range [0.1, 1.0] | 0.4 |
| correction_words_id | String | No | Forced correction words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forced Correction API Protocol" Supports multiple IDs, separated by a vertical line |; all indicates using all IDs | Null |
| forbidden_words_id | String | No | Forbidden words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forbidden Word API Protocol" Supports multiple IDs, separated by a vertical line |; all indicates using all IDs | Null |
| gain | Integer | No | Amplitude gain factor, range [1, 20], see "Basic Knowledge" and "Practical Features" sections in the "Speech-to-Text Service User Manual" The value 1 indicates no amplification, 2 indicates the original amplitude doubled, and so on. | 1 |
| enable_sse | Boolean | No | Whether to enable HTTP SSE streaming results, refer to the "Additional Features" section of this document | false |
| user_id | String | No | Custom user information, which will be returned unchanged in the response message, with a maximum length of 36 characters | Null |
| source_url | String | No | Audio source URL, after filling in, the audio will be obtained from this address as the input for speech recognition (the format parameter must be set to a supported audio encoding format)Supports RTSP streams, refer to the "Audio Encoding" section in the "Speech-to-Text Service User Manual" | Null |
| enable_lang_label | Boolean | No | Return language code in recognition results when switching languages, only effective for mixed English languages (e.g., Japanese-English, Chinese-English). Note: Enabling this feature may cause a response delay when switching languages | false |
| paragraph_condition | Integer | No | Control paragraph character count, return a new paragraph number in the next sentence within the same speaker_id when the set character count is reached, range [100, 2000], values outside the range indicate that this feature is not enabled | 0 |
| decode_silence_segment | Boolean | No | Control whether to perform speech recognition processing on the silent segments determined by VAD, suitable for far-field recording environments (supported in version 2.5.11 and above). | false |
Example of a request:
{
"header": {
"namespace": "SpeechTranscriber",
"name": "StartTranscription"
},
"payload": {
"lang_type": "ja-JP",
"format": "pcm",
"sample_rate": 16000,
"enable_intermediate_result": true,
"enable_punctuation_prediction": true,
"enable_inverse_text_normalization": true,
"max_sentence_silence": 800,
"enable_words":true,
"user_id":"conversation_001"
}
}Response Parameters (header object):
| Parameter | Type | Description |
|---|---|---|
| namespace | String | The namespace to which the message belongs: SpeechTranscriber indicates Real-time Speech Transcription |
| name | String | Event name: TranscriptionStarted indicates the initiation phase |
| status | String | Status code |
| status_text | String | Explanation of the status code |
| task_id | String | The globally unique ID for the task; please record this value for troubleshooting. |
| message_id | String | The ID for this message |
| user_id | String | The user_id passed in when the connection was established |
Example of a response:
{
"header":{
"namespace":"SpeechTranscriber",
"name":"TranscriptionStarted",
"appkey":"",
"status":"00000",
"status_text":"success",
"task_id":"0220a729ac9d4c9997f51592ecc83847",
"message_id":"49b680abe737488cf50f3cd9e3953b97",
"user_id":"conversation_001"
},
"payload":{
"index":0,
"time":0,
"begin_time":0,
"speaker_id":"",
"result":"",
"confidence":0,
"words":null
}
}2. Sending Audio Data and Receiving Recognition Results
Note:
If the
source_urlparameter is set during the connection, the system will automatically retrieve audio data from the specified source, and you do not need to send audio data according to the content of this section.
Send audio data in a loop and continuously receive recognition results. It is recommended to send data packets of 7680 Bytes each time.
The Real-time Speech Transcription service has an automatic sentence-breaking feature that determines the beginning and end of a sentence according to the length of silence between the utterances, which is represented by events in the returned results. The SentenceBegin and SentenceEnd events respectively indicate the start and end of a sentence, while the TranscriptionResultChanged event indicates the intermediate recognition results of a sentence.
SentenceBegin Event
The SentenceBegin event indicates that the server has detected the beginning of a sentence.
Response Parameters (header object):
| Parameter | Type | Description |
|---|---|---|
| namespace | String | The namespace to which the message belongs, SpeechTranscriber indicates Real-time Speech Transcription |
| name | String | Message name, SentenceBegin indicates the start of a sentence |
| status | Integer | Status code, indicating whether the request was successful, see service status codes |
| status_text | String | Status message |
| task_id | String | The globally unique ID for the task; please record this value for troubleshooting |
| message_id | String | The ID for this message |
| user_id | String | The user_id passed in when the connection was established |
Response Parameters (payload object):
| Parameter | Type | Description |
|---|---|---|
| lang_type | String | Returns the language code when enable_lang_label is enabled |
| paragraph | Integer | Paragraph number, starting from 1 and incrementing |
| index | Integer | Sentence number, starting from 1 and incrementing |
| time | Integer | The duration of the audio processed so far, in milliseconds |
| begin_time | Integer | The time corresponding to the SentenceBegin event for the current sentence, in milliseconds |
| speaker_id | String | Speaker number or network sound card channel number, see "Additional Features - Speaker ID" and "Additional Features - Network Sound Card" sections below |
| result | String | Recognition result, which may be empty |
| confidence | Float | The confidence level of the current result, range [0, 1] |
| volume | Integer | The current volume, range [0, 100] |
Example of a response:
{
"header": {
"namespace": "SpeechTranscriber",
"name": "SentenceBegin",
"status": "00000",
"status_text": "success",
"task_id": "3ee284e922dd4554bb6ccda7989d1973",
"message_id": "9b680abe73748f50f83cd9e3953b974c",
"user_id":"conversation_001"
},
"payload": {
"lang_type": "ja-JP",
"paragraph": 1,
"index": 1,
"time": 240,
"begin_time": 0,
"speaker_id": "",
"result": "",
"confidence": 0,
"volume": 0
}
}TranscriptionResultChanged Event
The recognition results are divided into "intermediate results" and "final results". For detailed explanations, please refer to the "Basic Terminology" section of the "Speech-to-Text Service User Manual".
The TranscriptionResultChanged event indicates that there has been a change in the recognition results, i.e., the intermediate results of a sentence.
-
If
enable_intermediate_resultis set totrue, the server will continue to return multipleTranscriptionResultChangedmessages, which are the intermediate results of the recognition. -
If
enable_intermediate_resultis set tofalse, the server will not return any messages for this step.
Note:
The last intermediate result obtained may not be the same as the final result. Please take the result corresponding to the
SentenceEndevent as the final recognition result.
Response Parameters (header object):
The header object parameters are the same as in the previous table (see the SentenceBegin event header object return parameters), with name being TranscriptionResultChanged indicating the intermediate recognition result of a sentence.
Response Parameters (payload object):
| Parameter | Type | Description |
|---|---|---|
| lang_type | String | Returns the language code when enable_lang_label is enabled |
| paragraph | Integer | Paragraph number, starting from 1 and incrementing |
| index | Integer | Sentence number, starting from 1 and incrementing |
| time | Integer | The duration of the audio processed so far, in milliseconds |
| begin_time | Integer | The time corresponding to the SentenceBegin event for the current sentence, in milliseconds |
| speaker_id | String | Speaker number, see "Additional Features - Speaker ID" sections below |
| result | String | The intermediate recognition result of this sentence |
| confidence | Float | The confidence level of the current result, range [0, 1] |
| volume | Integer | The current volume, range [0, 100] |
Example of a response:
{
"header": {
"namespace": "SpeechTranscriber",
"name": "TranscriptionResultChanged",
"status": "00000",
"status_text": "success",
"task_id": "3ee284e922dd4554bb6ccda7989d1973",
"message_id": "749b680abe733cd488cf50f9e3953b97",
"user_id":"conversation_001"
},
"payload": {
"lang_type": "ja-JP",
"paragraph": 1,
"index": 1,
"time": 1920,
"begin_time": 0,
"speaker_id": "",
"result": "天気",
"confidence": 1,
"volume": 79,
"words": [
{
"word": "天気",
"start_time": 0,
"end_time": 1920,
"stable": false
}
]
}
}SentenceEnd Event
The SentenceEnd event signifies that the service has detected the end of a spoken sentence and returns the final transcription result for that sentence.
Response Parameters (header object):
The header object parameters are consistent with those described in the SentenceBegin event, with the name attribute being SentenceEnd to indicate the recognition of the sentence's end.
Response Parameters (payload object):
| Parameter | Type | Description |
|---|---|---|
| lang_type | String | Returns the language code when enable_lang_label is enabled |
| paragraph | Integer | Paragraph number, starting from 1 and incrementing |
| index | Integer | Sentence number, starting from 1 and incrementing |
| time | Integer | The duration of the audio processed so far, in milliseconds |
| begin_time | Integer | The time corresponding to the SentenceBegin event for the current sentence, in milliseconds |
| speaker_id | String | Speaker number or network sound card channel number, see "Additional Features - Speaker ID" and "Additional Features - Network Sound Card" sections below |
| result | String | The final recognition result for this sentence |
| confidence | Float | The confidence level of the current result, range [0, 1] |
| words | Dict[] | Final result word information for this sentence, only returned when enable_words is set to true |
| volume | Integer | The current volume, range [0, 100] |
The structure of the final result word information object is as follows:
| Parameter | Type | Description |
|---|---|---|
| word | String | The text of the word. |
| start_time | Integer | The start time of the word, in milliseconds. |
| end_time | Integer | The end time of the word, in milliseconds. |
| type | String | The type of the wordnormal indicates regular text, forbidden indicates sensitive words, modal indicates modal particles, punc indicates punctuation marks |
Example of a response:
{
"header": {
"namespace": "SpeechTranscriber",
"name": "SentenceEnd",
"status": "00000",
"status_text": "success",
"task_id": "3ee284e922dd4554bb6ccda7989d1973",
"message_id": "749b680abe737488cf50f3cd9e3953b9",
"user_id":"conversation_001"
},
"payload": {
"lang_type": "zh-cmn-Hans-CN",
"paragraph": 1,
"index": 1,
"time": 5670,
"begin_time": 390,
"speaker_id": "speaker1",
"result": "天気がいいから、散歩しましょう。",
"confidence": 0.9,
"volume": 76,
"words": [{
"word": "天気",
"start_time": 390,
"end_time": 1110,
"type": "normal"
}, {
"word": "が",
"start_time": 1110,
"end_time": 1440,
"type": "normal"
}, {
"word": "いい",
"start_time": 1440,
"end_time": 2130,
"type": "normal"
}, {
"word": "から",
"start_time": 2160,
"end_time": 3570,
"type": "normal"
}, {
"word": "、",
"start_time": 4290,
"end_time": 4860,
"type": "punc"
},
……略……
]
}
}Additional Features
The following three additional features are only available in Real-time Speech Transcription.
Note:
If the
source_urlparameter is set during the connection, forced sentence breaking and custom speaker numbering features are not available.
Forced Sentence Breaking
During the transmission of audio data, sending a SentenceEnd event will force the server to break the sentence at the current position. After the server processes the sentence breaking, the client will receive a SentenceEnd message with the final recognition result for that sentence. This feature is not available when using a network sound card.
Example of sending:
{
"header": {
"namespace": "SpeechTranscriber",
"name": "SentenceEnd"
}
}Custom Speaker Numbering
Before sending audio data, send a SpeakerStart event and specify speaker_id as the speaker number. The server will consider the audio data between one SpeakerStart event and the next SpeakerStart or StopTranscription event as the specified speaker and will return this information in the speaker_id field of the recognition result. This feature is not available when using a network sound card.
Note:
The
speaker_idsupports up to 36 characters, and any excess will be truncated and discarded. If thespeaker_idparameter is not passed in theSpeakerStartevent, thespeaker_idin the result will be an empty value. TheSpeakerStartevent will trigger forced sentence breaking. Therefore, please send aSpeakerStartevent only before switching speakers.
Example of sending:
Angle brackets <> indicate audio data packets, and curly braces () indicate JSON data packets.
<Binary Audio Data Packet 1>
{
"header": {
"namespace": "SpeechTranscriber",
"name": "SpeakerStart"
},
"payload": {
"speaker_id": "001"
}
}
<Binary Audio Data Packet 2>
<...>
<Binary Audio Data Packet n>
{
"header": {
"namespace": "SpeechTranscriber",
"name": "SpeakerStart"
}
"payload": {
"speaker_id": "002"
}
}
<Binary Audio Data Packet n+1>
<Binary Audio Data Packet n+2>SSE Method for Returning Results
When enable_sse is set to true, after establishing a Real-time Speech Transcription WebSocket connection and receiving the TranscriptionStarted event, another client can retrieve the recognition results using the following HTTP request:
GET http://{server_ip}:7100/getAsrResult?task_id=******The recognition results are returned streamly via Server-sent events (SSE), and the format is consistent with the content returned over WebSocket. The connection is automatically terminated when the recognition ends.
Quick Test Example (Linux):
curl -X GET 'http://localhost:7100/getAsrResult?task_id=******'Quick Test Example (Windows):
curl.exe -X GET "http://localhost:7100/getAsrResult?task_id=******"3. Maintaining Connection (Heartbeat Mechanism)
When not sending audio, it is necessary to send a heartbeat packet at least once every 10 seconds; otherwise, the connection will be automatically terminated. It is recommended that the client sends a heartbeat every 8 seconds.
Note:
Heartbeat packets should be sent after sending the StartRecognition or StartTranscription event.
If the
source_urlparameter is set during the connection, the heartbeat mechanism does not apply.
Example of sending:
{
"header":{
"namespace":"SpeechTranscriber",
"name":"Ping"
}
}Example of a response:
{
"header": {
"namespace": "SpeechTranscriber",
"name": "Pong",
"task_id": "71c5cb9b-fbc3-4489-843c-e902b102a569",
"message_id": "6f9ea191-1624-4d3c-9286-0003d323f731"
},
"payload": {}
}If no data is transmitted to the server within 10 seconds, the server will return an error message and then automatically disconnect the connection.
4. Stop and Retrieve Final Results
The client sends a request to stop Real-time Speech Transcription, notifying the server that the transmission of voice data has ended and to terminate speech recognition. The server returns the final recognition results and then automatically disconnects the connection.
Request Parameters (header object):
| Parameter | Type | Required | Description |
|---|---|---|---|
| namespace | String | Yes | The namespace to which the message belongs: SpeechTranscriber indicates Real-time Speech Transcription |
| name | String | Yes | Event name: StartTranscription indicates terminating Real-time Speech Transcription |
Example of a request:
{
"header": {
"namespace": "SpeechTranscriber",
"name": "StopTranscription"
}
}Response Parameters (header object):
| Parameter | Type | Description |
|---|---|---|
| namespace | String | The namespace to which the message belongs, SpeechTranscriber indicates Real-time Speech Transcription |
| name | String | The name of the message, TranscriptionCompleted indicates that the transcription is complete |
| status | Integer | Status code, indicating whether the request was successful, see service status codes |
| status_text | String | Status message |
| task_id | String | The globally unique ID for the task; please record this value for troubleshooting |
| message_id | String | The ID for this message |
| user_id | String | The user_id passed in when the connection was established |
Response Parameters (payload object): The format is the same as the SentenceEnd event, but the result and words fields may be empty.
Example of a response:
{
"header":{
"namespace":"SpeechTranscriber",
"name":"TranscriptionCompleted",
"status":"00000",
"status_text":"success",
"task_id":"3ee284e922dd4554bb6ccda7989d1973",
"message_id":"7e729bf2d4064fee83143c4d962dc6f1",
"user_id":""
},
"payload":{
"index":1,
"time":4765,
"begin_time":180,
"speaker_id":"",
"result":"",
"confidence": 0,
"volume": 0,
"words":[]
}
}5. Error Messages
Error messages are returned through the WebSocket connection, after which the server automatically closes the connection.
Example of a response:
{
"header": {
"namespace": "SpeechTranscriber",
"name": "TaskFailed",
"status": "20105",
"status_text": "JSON serialization failed",
"task_id": "df2c1604e31d4f46a7a064db73cd3b5e",
"message_id": "",
"user_id": ""
},
"payload": {
"index": 1,
"time": 0,
"begin_time": 0,
"speaker_id": "",
"result": "",
"confidence": 1,
"volume": 0,
"words": null
}
}Service Status Codes
| Status Code | Reason |
|---|---|
| 20001 | Parameters parsing failed |
| 20002、20003 | File processing failed |
| 20111 | WebSocket upgrading failed |
| 20114 | Request body is empty |
| 20115 | Audio size/duration limit exceeded |
| 20116 | Unsupported sample_rate |
| 20190 | Missing parameter |
| 20191 | Invalid parameter |
| 20192 | Processing failed (decoder) |
| 20193 | Processing failed (service) |
| 20194 | Connection timed out (no data received from the client for a long time) |
| 20195 | Other error |