Short Speech Recognition API
Feature Introduction
For recognizing short voice inputs within 60 seconds, suitable for scenarios such as conversational chat and control commands where the Speech-to-Text is required for shorter audio clips.
WebSocket API (Streaming)
Request URL
ws://<ip_address>:7100/ws/v1Interaction Process
1. Start and Send Parameters
The client initiates a request, and the server confirms the validity of the request. Parameters must be set within the request message.
Request Parameters (header object):
| Parameter | Type | Required | Description |
|---|---|---|---|
| namespace | String | Yes | The namespace to which the message belongs: SpeechRecognizer indicates One-Sentence Recognition |
| name | String | Yes | Event name: StartRecognition indicates the start phase |
Request Parameters (payload object):
| Parameter | Type | Required | Description | Default |
|---|---|---|---|---|
| lang_type | String | Yes | Language code, refer to the "Language Support" section in the "Speech-to-Text Service User Manual" | Required |
| format | String | No | Audio encoding format, refer to the "Audio Encoding" section in the "Speech-to-Text Service User Manual" For call center (8kHz) PCM format, please pass the parameter value pcm_8000 | pcm |
| sample_rate | Integer | No | Sampling rate of the Model (rather than the audio itself) | 16000 |
| enable_intermediate_result | Boolean | No | Whether to return intermediate recognition results | false |
| enable_punctuation_prediction | Boolean | No | Whether to add punctuation | false |
| enable_inverse_text_normalization | Boolean | No | Whether to perform ITN, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual" | false |
| enable_words | Boolean | No | Whether to enable returning word information in the final results, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual" | false |
| enable_modal_particle_filter | Boolean | No | Whether to enable modal particle filtering, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" | false |
| hotwords_id | String | No | Hotwords ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Hotword API Protocol" | Null |
| hotwords_list | List<String> | No | Hotword list, effective only for this connection. When used in conjunction with the hotwords_id parameter, the hotwords_list takes precedence | Null |
| hotwords_weight | Float | No | Hotwords weight, range [0.1, 1.0] | 0.4 |
| correction_words_id | String | No | Forced correction words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forced Correction API Protocol" Supports multiple IDs, separated by a vertical line |; all indicates using all IDs | Null |
| forbidden_words_id | String | No | Forbidden words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forbidden Word API Protocol" Supports multiple IDs, separated by a vertical line |; all indicates using all IDs | Null |
| gain | Integer | No | Amplitude gain factor, range [1, 20], see "Basic Knowledge" and "Practical Features" sections in the "Speech-to-Text Service User Manual" The value 1 indicates no amplification, 2 indicates the original amplitude doubled, and so on. | 1 |
| max_suffix_silence | Int | No | Post-speech silence detection threshold (in seconds), with a range of 1 to 10 seconds. If the duration of silence at the end of a sentence exceeds this threshold, recognition will automatically terminate. When the parameter value is set to 0 or the parameter is not provided, the post-speech silence detection feature is disabled. | 0 |
| user_id | String | No | Custom user information, which will be returned unchanged in the response message, with a maximum length of 36 characters | Null |
Example of a request:
{
"header": {
"namespace": "SpeechRecognizer",
"name": "StartRecognition"
},
"payload": {
"lang_type": "ja-JP",
"format": "pcm",
"sample_rate": 16000,
"enable_intermediate_result": true,
"enable_punctuation_prediction": true,
"enable_inverse_text_normalization": true,
"enable_words":true,
"user_id":"conversation_001"
}
}Response Parameters (header object):
| Parameter | Type | Description |
|---|---|---|
| namespace | String | The namespace to which the message belongs: SpeechRecognizer indicates One-sentence Recognition |
| name | String | Event name: RecognitionStarted indicates the initiation phase |
| status | String | Status code |
| status_text | String | Explanation of the status code |
| task_id | String | The globally unique ID for the task; please record this value for troubleshooting. |
| user_id | String | The user_id passed in when the connection was established |
Example of a response:
{
"header":{
"namespace":"SpeechRecognizer",
"name":"RecognitionStarted",
"appkey":"",
"status":"00000",
"status_text":"success",
"task_id":"0220a729ac9d4c9997f51592ecc83847",
"message_id":"",
"user_id":"conversation_001"
},
"payload":{
"paragraph": 0,
"index":0,
"time":0,
"begin_time":0,
"speaker_id":"",
"result":"",
"confidence":0,
"volume": 0,
"words":null
}
}2. Sending Audio Data and Receiving Recognition Results
Send audio data in a loop and continuously receive recognition results. It is recommended to send data packets of 7680 Bytes each time.
The recognition results are divided into "intermediate results" and "final results". For detailed explanations, please refer to the "Basic Terminology" section of the "Speech-to-Text Service User Manual".
The TranscriptionResultChanged event indicates that there has been a change in the recognition results, i.e., the intermediate results of a sentence.
-
If
enable_intermediate_resultis set totrue, the server will continue to return multipleTranscriptionResultChangedmessages, which are the intermediate results of the recognition. -
If
enable_intermediate_resultis set tofalse, the server will not return any messages for this step.
Note:
The last intermediate result obtained may not be the same as the final result. Please take the result corresponding to the
SentenceEndevent as the final recognition result.
Response Parameters (header object):
| Parameter | Type | Description |
|---|---|---|
| namespace | String | The namespace to which the message belongs, SpeechRecognizer indicates One-sentence Recognition |
| name | String | Message name, RecognitionResultChanged indicates the intermediate results of a sentence |
| status | Integer | Status code, indicating whether the request was successful, see service status codes |
| status_text | String | Status message |
| task_id | String | The globally unique ID for the task; please record this value for troubleshooting |
| message_id | String | The ID for this message |
| user_id | String | The user_id passed in when the connection was established |
Response Parameters (payload object):
| Parameter | Type | Description |
|---|---|---|
| index | Integer | Sentence number, starting from 1 and incrementing |
| time | Integer | The duration of the audio processed so far, in milliseconds |
| begin_time | Integer | The time corresponding to the SentenceBegin event for the current sentence, in milliseconds |
| speaker_id | String | Always null for One-sentence Recognition |
| result | String | The intermediate recognition result of this sentence |
| confidence | Float | The confidence level of the current result, range [0, 1] |
| volume | Integer | The current volume, range [0, 100] |
Example of a response:
{
"header": {
"namespace": "SpeechRecognizer",
"name": "RecognitionResultChanged",
"status": "00000",
"status_text": "success",
"task_id": "0220a729ac9d4c9997f51592ecc83847",
"message_id": "43u134hcih2lcp7q1c94dhm5ic2op9l2",
"user_id":"conversation_001"
},
"payload": {
"index": 1,
"time": 1920,
"begin_time": 0,
"speaker_id": "",
"result": "天気",
"confidence": 1,
"volume": 79,
"words": []
}
}3. Stop and Retrieve Final Results
The client sends a request to stop One-sentence Recognition, notifying the server that the transmission of voice data has ended and to terminate speech recognition. The server returns the final recognition results and then automatically disconnects the connection.
Request Parameters (header object):
| Parameter | Type | Required | Description |
|---|---|---|---|
| namespace | String | Yes | The namespace to which the message belongs: SpeechRecognizer indicates One-Sentence Recognition |
| name | String | Yes | Event name: StartRecognition indicates terminating One-Sentence Recognition |
Example of a request:
{
"header": {
"namespace": "SpeechRecognizer",
"name": "StopRecognition"
}
}Response Parameters (header object):
| Parameter | Type | Description |
|---|---|---|
| namespace | String | The namespace to which the message belongs: SpeechRecognizer indicates One-Sentence Recognition |
| name | String | The name of the message, TranscriptionCompleted indicates that the recognition is complete |
| status | Integer | Status code, indicating whether the request was successful, see service status codes |
| status_text | String | Status message |
| task_id | String | The globally unique ID for the task; please record this value for troubleshooting |
| message_id | String | The ID for this message |
| user_id | String | The user_id passed in when the connection was established |
Response Parameters (payload object):
| Parameter | Type | Description |
|---|---|---|
| index | Integer | Always 1 for One-sentence Recognition |
| time | Integer | The duration of the audio processed so far, in milliseconds |
| begin_time | Integer | The time corresponding to the SentenceBegin event for the current sentence, in milliseconds |
| speaker_id | String | Always null for One-sentence Recognition |
| result | String | The intermediate recognition result of this sentence |
| confidence | Float | The confidence level of the current result, range [0, 1] |
| words | Dict[] | Final result word information for this sentence, only returned when enable_words is set to true |
| volume | Integer | The current volume, range [0, 100] |
The structure of the final result word information object is as follows:
| Parameter | Type | Description |
|---|---|---|
| word | String | The text of the word. |
| start_time | Integer | The start time of the word, in milliseconds. |
| end_time | Integer | The end time of the word, in milliseconds. |
| type | String | The type of the wordnormal indicates regular text, forbidden indicates sensitive words, modal indicates modal particles, punc indicates punctuation marks |
Example of a response:
{
"header": {
"namespace": "SpeechRecognizer",
"name": "RecognitionCompleted",
"status": "00000",
"status_text": "success",
"task_id": "0220a729ac9d4c9997f51592ecc83847",
"message_id": "45kbrouk4yvz81fjueyao2s7y7o6gjz6",
"user_id":"conversation_001"
},
"payload": {
"index": 1,
"time": 5292,
"begin_time": 0,
"speaker_id": "",
"result": "天気がいいから、散歩しましょう。",
"confidence": 0.9,
"volume": 76,
"words": [{
"word": "天気",
"start_time": 390,
"end_time": 1110,
"type": "normal"
}, {
"word": "が",
"start_time": 1110,
"end_time": 1440,
"type": "normal"
}, {
"word": "いい",
"start_time": 1440,
"end_time": 2130,
"type": "normal"
}, {
"word": "から",
"start_time": 2160,
"end_time": 3570,
"type": "normal"
}, {
"word": "、",
"start_time": 4290,
"end_time": 4860,
"type": "punc"
},
……略……
]
}
}HTTP API (Non-streaming)
HTTP Request Line
| Protocol | URL | Method |
|---|---|---|
| HTTP/1.1 | http://<ip_address>:7100/api/v1 | POST |
Request Headers
HTTP request headers consist of "key/value" pairs, with each line containing one pair. The key and value are separated by an English colon (:). The settings are as follows:
| Name | Type | Required | Description |
|---|---|---|---|
| Content-type | String | Yes | Must be "application/octet-stream", indicating that the data in the HTTP body is binary |
Request Parameters
The client sends an One-sentence Recognition request, and parameters are set within the request query parameters. The meanings of the parameters are as follows:
| Parameter | Type | Required | Description | Default |
|---|---|---|---|---|
| lang_type | String | Yes | Language code, refer to the "Language Support" section in the "Speech-to-Text Service User Manual" | Required |
| format | String | No | Audio encoding format, refer to the "Audio Encoding" section in the "Speech-to-Text Service User Manual" | pcm |
| sample_rate | Integer | No | Audio sampling rate, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual" | 16000 |
| enable_punctuation_prediction | Boolean | No | Whether to add punctuation | false |
| enable_inverse_text_normalization | Boolean | No | Whether to perform ITN, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual" | false |
| enable_modal_particle_filter | Boolean | No | Whether to enable modal particle filtering, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" | false |
| hotwords_id | String | No | Hotwords ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Hotword API Protocol" | Null |
| hotwords_weight | Float | No | Hotwords weight, range [0.1, 1.0] | 0.4 |
| correction_words_id | String | No | Forced correction words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forced Correction API Protocol" Supports multiple IDs, separated by a vertical line |; all indicates using all IDs | Null |
| forbidden_words_id | String | No | Forbidden words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forbidden Word API Protocol" Supports multiple IDs, separated by a vertical line |; all indicates using all IDs | Null |
| gain | Integer | No | Amplitude gain factor, range [1, 20], see "Basic Knowledge" and "Practical Features" sections in the "Speech-to-Text Service User Manual" The value 1 indicates no amplification, 2 indicates the original amplitude doubled, and so on. | 1 |
Request Body
The HTTP request body contains binary audio data, and the Content-Type in the HTTP request header must be set to application/octet-stream.
Example of Request
curl --location --request POST 'http://127.0.0.1:7100/api/v1?lang_type=ja-JP&format=pcm&sample_rate=16000&enable_punctuation_prediction=true&enable_inverse_text_normalization=true' \
--header 'Content-Type: application/octet-stream' \
--data-binary '@audio.pcm'Response
The response results are in the Body. The fields in the response results are as follows:
| Parameter | Type | Description |
|---|---|---|
| task_id | String | The globally unique ID for the task; please record this value for troubleshooting |
| result | String | The intermediate recognition result of this sentence |
| status | Integer | Status code, indicating whether the request was successful, see service status codes |
| message | String | Status message |
Successful Response
The status field in the body is 00000.
{
"task_id": "cf7b0c5339244ee29cd4e43fb97fd52e",
"result": "天気がいいから、散歩しましょう。",
"status":"00000",
"message":"SUCCESS"
}Error Response
Any status field in the body that is not 00000 is considered an error response. This field can be used as an indicator of whether the request was successful.
Service Status Codes
| Status Code | Reason |
|---|---|
| 20001 | Parameters parsing failed |
| 20002、20003 | File processing failed |
| 20111 | WebSocket upgrading failed |
| 20114 | Request body is empty |
| 20115 | Audio size/duration limit exceeded |
| 20116 | Unsupported sample_rate |
| 20190 | Missing parameter |
| 20191 | Invalid parameter |
| 20192 | Processing failed (decoder) |
| 20193 | Processing failed (service) |
| 20194 | Connection timed out (no data received from the client for a long time) |
| 20195 | Other error |