Short Speech Recognition API

Feature Introduction

For recognizing short voice inputs within 60 seconds, suitable for scenarios such as conversational chat and control commands where the Speech-to-Text is required for shorter audio clips.

WebSocket API (Streaming)

Request URL

ws://<ip_address>:7100/ws/v1

1. Start and Send Parameters

The client initiates a request, and the server confirms the validity of the request. Parameters must be set within the request message.

Request Parameters (header object):

Parameter	Type	Required	Description
namespace	String	Yes	The namespace to which the message belongs: `SpeechRecognizer` indicates One-Sentence Recognition
name	String	Yes	Event name: `StartRecognition` indicates the start phase

Request Parameters (payload object):

Parameter	Type	Required	Description	Default
lang_type	String	Yes	Language code, refer to the "Language Support" section in the "Speech-to-Text Service User Manual"	Required
format	String	No	Audio encoding format, refer to the "Audio Encoding" section in the "Speech-to-Text Service User Manual" For call center (8kHz) PCM format, please pass the parameter value `pcm_8000`	pcm
sample_rate	Integer	No	Sampling rate of the Model (rather than the audio itself)	16000
enable_intermediate_result	Boolean	No	Whether to return intermediate recognition results	false
enable_punctuation_prediction	Boolean	No	Whether to add punctuation	false
enable_inverse_text_normalization	Boolean	No	Whether to perform ITN, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"	false
enable_words	Boolean	No	Whether to enable returning word information in the final results, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"	false
enable_modal_particle_filter	Boolean	No	Whether to enable filler word removal, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual"	false
hotwords_id	String	No	Hotwords ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Hotword API Protocol"	Null
hotwords_list	List<String>	No	Hotword list, effective only for this connection. When used in conjunction with the `hotwords_id` parameter, the `hotwords_list` takes precedence	Null
hotwords_weight	Float	No	Hotwords weight, range [0.1, 1.0]	0.4
correction_words_id	String	No	Forced correction words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forced Correction API Protocol" Supports multiple IDs, separated by a vertical line `\|`; `all` indicates using all IDs	Null
forbidden_words_id	String	No	Forbidden words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forbidden Word API Protocol" Supports multiple IDs, separated by a vertical line `\|`; `all` indicates using all IDs	Null
gain	Integer	No	Amplitude gain factor, range [1, 20], see "Basic Knowledge" and "Practical Features" sections in the "Speech-to-Text Service User Manual" The value 1 indicates no amplification, 2 indicates the original amplitude doubled, and so on.	1
max_suffix_silence	Int	No	Post-speech silence detection threshold (in seconds), with a range of 1 to 10 seconds. If the duration of silence at the end of a sentence exceeds this threshold, recognition will automatically terminate. When the parameter value is set to 0 or the parameter is not provided, the post-speech silence detection feature is disabled.	0
user_id	String	No	Custom user information, which will be returned unchanged in the response message, with a maximum length of 36 characters	Null

Example of a request:

{
    "header": {
        "namespace": "SpeechRecognizer",
        "name": "StartRecognition"
    },
    "payload": {
        "lang_type": "ja-JP",
        "format": "pcm",
        "sample_rate": 16000,
        "enable_intermediate_result": true,
        "enable_punctuation_prediction": true,
        "enable_inverse_text_normalization": true,
        "enable_words":true,
        "user_id":"conversation_001"
    }
}

Response Parameters (header object):

Parameter	Type	Description
namespace	String	The namespace to which the message belongs: `SpeechRecognizer` indicates One-sentence Recognition
name	String	Event name: `RecognitionStarted` indicates the initiation phase
status	String	Status code
status_text	String	Explanation of the status code
task_id	String	The globally unique ID for the task; please record this value for troubleshooting.
user_id	String	The user_id passed in when the connection was established

Example of a response:

{
    "header":{
        "namespace":"SpeechRecognizer",
        "name":"RecognitionStarted",
        "appkey":"",
        "status":"00000",
        "status_text":"success",
        "task_id":"0220a729ac9d4c9997f51592ecc83847",
        "message_id":"",
        "user_id":"conversation_001"
    },
    "payload":{
        "paragraph": 0,
        "index":0,
        "time":0,
        "begin_time":0,
        "speaker_id":"",
        "result":"",
        "confidence":0,
        "volume": 0,        
        "words":null
    }
}

2. Sending Audio Data and Receiving Recognition Results

Send audio data in a loop and continuously receive recognition results. It is recommended to send data packets of 7680 Bytes each time.

The recognition results are divided into "intermediate results" and "final results". For detailed explanations, please refer to the "Basic Terminology" section of the "Speech-to-Text Service User Manual".

The TranscriptionResultChanged event indicates that there has been a change in the recognition results, i.e., the intermediate results of a sentence.

If enable_intermediate_result is set to true, the server will continue to return multiple TranscriptionResultChanged messages, which are the intermediate results of the recognition.
If enable_intermediate_result is set to false, the server will not return any messages for this step.

Note:

The last intermediate result obtained may not be the same as the final result. Please take the result corresponding to the SentenceEnd event as the final recognition result.

Response Parameters (header object):

Parameter	Type	Description
namespace	String	The namespace to which the message belongs, `SpeechRecognizer` indicates One-sentence Recognition
name	String	Message name, `RecognitionResultChanged` indicates the intermediate results of a sentence
status	Integer	Status code, indicating whether the request was successful, see service status codes
status_text	String	Status message
task_id	String	The globally unique ID for the task; please record this value for troubleshooting
message_id	String	The ID for this message
user_id	String	The user_id passed in when the connection was established

Response Parameters (payload object):

Parameter	Type	Description
index	Integer	Sentence number, starting from 1 and incrementing
time	Integer	The duration of the audio processed so far, in milliseconds
begin_time	Integer	The time corresponding to the `SentenceBegin` event for the current sentence, in milliseconds
speaker_id	String	Always null for One-sentence Recognition
result	String	The intermediate recognition result of this sentence
confidence	Float	The confidence level of the current result, range [0, 1]
volume	Integer	The current volume, range [0, 100]

Example of a response:

{
    "header": {
        "namespace": "SpeechRecognizer",
        "name": "RecognitionResultChanged",
        "status": "00000",
        "status_text": "success",
        "task_id": "0220a729ac9d4c9997f51592ecc83847",
        "message_id": "43u134hcih2lcp7q1c94dhm5ic2op9l2",
        "user_id":"conversation_001"
    },
    "payload": {
        "index": 1,
        "time": 1920,
        "begin_time": 0,
        "speaker_id": "",
        "result": "天気",
        "confidence": 1,
        "volume": 79,        
        "words": []
    }
}

3. Stop and Retrieve Final Results

The client sends a request to stop One-sentence Recognition, notifying the server that the transmission of voice data has ended and to terminate speech recognition. The server returns the final recognition results and then automatically disconnects the connection.

Request Parameters (header object):

Parameter	Type	Required	Description
namespace	String	Yes	The namespace to which the message belongs: `SpeechRecognizer` indicates One-Sentence Recognition
name	String	Yes	Event name: `StartRecognition` indicates terminating One-Sentence Recognition

Example of a request:

{
    "header": {
        "namespace": "SpeechRecognizer",
        "name": "StopRecognition"
    }
}

Response Parameters (header object):

Parameter	Type	Description
namespace	String	The namespace to which the message belongs: `SpeechRecognizer` indicates One-Sentence Recognition
name	String	The name of the message, `TranscriptionCompleted` indicates that the recognition is complete
status	Integer	Status code, indicating whether the request was successful, see service status codes
status_text	String	Status message
task_id	String	The globally unique ID for the task; please record this value for troubleshooting
message_id	String	The ID for this message
user_id	String	The user_id passed in when the connection was established

Response Parameters (payload object):

Parameter	Type	Description
index	Integer	Always 1 for One-sentence Recognition
time	Integer	The duration of the audio processed so far, in milliseconds
begin_time	Integer	The time corresponding to the `SentenceBegin` event for the current sentence, in milliseconds
speaker_id	String	Always null for One-sentence Recognition
result	String	The intermediate recognition result of this sentence
confidence	Float	The confidence level of the current result, range [0, 1]
words	Dict[]	Final result word information for this sentence, only returned when `enable_words` is set to true
volume	Integer	The current volume, range [0, 100]

The structure of the final result word information object is as follows:

Parameter	Type	Description
word	String	The text of the word.
start_time	Integer	The start time of the word, in milliseconds.
end_time	Integer	The end time of the word, in milliseconds.
type	String	The type of the word `normal` indicates regular text, `forbidden` indicates sensitive words, `modal` indicates filler words, `punc` indicates punctuation marks

Example of a response:

{
    "header": {
        "namespace": "SpeechRecognizer",
        "name": "RecognitionCompleted",
        "status": "00000",
        "status_text": "success",
        "task_id": "0220a729ac9d4c9997f51592ecc83847",
        "message_id": "45kbrouk4yvz81fjueyao2s7y7o6gjz6",
        "user_id":"conversation_001"
    },
    "payload": {
        "index": 1,
        "time": 5292,
        "begin_time": 0,
        "speaker_id": "",
        "result": "天気がいいから、散歩しましょう。",
        "confidence": 0.9,
        "volume": 76,        
        "words": [{
            "word": "天気",
            "start_time": 390,
            "end_time": 1110,
            "type": "normal"
        }, {
            "word": "が",
            "start_time": 1110,
            "end_time": 1440,
            "type": "normal"
        }, {
            "word": "いい",
            "start_time": 1440,
            "end_time": 2130,
            "type": "normal"
        }, {
            "word": "から",
            "start_time": 2160,
            "end_time": 3570,
            "type": "normal"
        }, {
            "word": "、",
            "start_time": 4290,
            "end_time": 4860,
            "type": "punc"
        },
        ……略……       
        ]
    }
}

HTTP API (Non-streaming)

HTTP Request Line

Protocol	URL	Method
HTTP/1.1	`http://<ip_address>:7100/api/v1`	POST

Request Headers

HTTP request headers consist of "key/value" pairs, with each line containing one pair. The key and value are separated by an English colon (:). The settings are as follows:

Name	Type	Required	Description
Content-type	String	Yes	Must be "application/octet-stream", indicating that the data in the HTTP body is binary

Request Parameters

The client sends an One-sentence Recognition request, and parameters are set within the request query parameters. The meanings of the parameters are as follows:

Parameter	Type	Required	Description	Default
lang_type	String	Yes	Language code, refer to the "Language Support" section in the "Speech-to-Text Service User Manual"	Required
format	String	No	Audio encoding format, refer to the "Audio Encoding" section in the "Speech-to-Text Service User Manual"	pcm
sample_rate	Integer	No	Audio sampling rate, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"	16000
enable_punctuation_prediction	Boolean	No	Whether to add punctuation	false
enable_inverse_text_normalization	Boolean	No	Whether to perform ITN, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"	false
enable_modal_particle_filter	Boolean	No	Whether to enable filler word removal, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual"	false
hotwords_id	String	No	Hotwords ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Hotword API Protocol"	Null
hotwords_weight	Float	No	Hotwords weight, range [0.1, 1.0]	0.4
correction_words_id	String	No	Forced correction words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forced Correction API Protocol" Supports multiple IDs, separated by a vertical line `\|`; `all` indicates using all IDs	Null
forbidden_words_id	String	No	Forbidden words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forbidden Word API Protocol" Supports multiple IDs, separated by a vertical line `\|`; `all` indicates using all IDs	Null
gain	Integer	No	Amplitude gain factor, range [1, 20], see "Basic Knowledge" and "Practical Features" sections in the "Speech-to-Text Service User Manual" The value 1 indicates no amplification, 2 indicates the original amplitude doubled, and so on.	1

Request Body

The HTTP request body contains binary audio data, and the Content-Type in the HTTP request header must be set to application/octet-stream.

Example of Request

curl --location --request POST 'http://127.0.0.1:7100/api/v1?lang_type=ja-JP&format=pcm&sample_rate=16000&enable_punctuation_prediction=true&enable_inverse_text_normalization=true' \
--header 'Content-Type: application/octet-stream' \
--data-binary '@audio.pcm'

Response

The response results are in the Body. The fields in the response results are as follows:

Parameter	Type	Description
task_id	String	The globally unique ID for the task; please record this value for troubleshooting
result	String	The intermediate recognition result of this sentence
status	Integer	Status code, indicating whether the request was successful, see service status codes
message	String	Status message

Successful Response

The status field in the body is 00000.

{
    "task_id": "cf7b0c5339244ee29cd4e43fb97fd52e",
    "result": "天気がいいから、散歩しましょう。",
    "status":"00000",
    "message":"SUCCESS"
}

Error Response

Any status field in the body that is not 00000 is considered an error response. This field can be used as an indicator of whether the request was successful.

Service Status Codes

Status Code	Reason
20001	Parameters parsing failed
20002、20003	File processing failed
20111	WebSocket upgrading failed
20114	Request body is empty
20115	Audio size/duration limit exceeded
20116	Unsupported `sample_rate`
20190	Missing parameter
20191	Invalid parameter
20192	Processing failed (decoder)
20193	Processing failed (service)
20194	Connection timed out (no data received from the client for a long time)
20195	Other error

Feature Introduction

WebSocket API (Streaming)

Request URL

Interaction Process

1. Start and Send Parameters

2. Sending Audio Data and Receiving Recognition Results

3. Stop and Retrieve Final Results

HTTP API (Non-streaming)

HTTP Request Line

Request Headers

Request Parameters

Request Body

Example of Request

Response

Service Status Codes

On this page