Logo
Main APIs

Short Speech Recognition API

Feature Introduction

For recognizing short voice inputs within 60 seconds, suitable for scenarios such as conversational chat and control commands where the Speech-to-Text is required for shorter audio clips.

WebSocket API (Streaming)

Request URL

ws://<ip_address>:7100/ws/v1

Interaction Process

1. Start and Send Parameters

The client initiates a request, and the server confirms the validity of the request. Parameters must be set within the request message.

Request Parameters (header object):

ParameterTypeRequiredDescription
namespaceStringYesThe namespace to which the message belongs: SpeechRecognizer indicates One-Sentence Recognition
nameStringYesEvent name: StartRecognition indicates the start phase

Request Parameters (payload object):

ParameterTypeRequiredDescriptionDefault
lang_typeStringYesLanguage code, refer to the "Language Support" section in the "Speech-to-Text Service User Manual"Required
formatStringNoAudio encoding format, refer to the "Audio Encoding" section in the "Speech-to-Text Service User Manual"
For call center (8kHz) PCM format, please pass the parameter value pcm_8000
pcm
sample_rateIntegerNoSampling rate of the Model (rather than the audio itself)16000
enable_intermediate_resultBooleanNoWhether to return intermediate recognition resultsfalse
enable_punctuation_predictionBooleanNoWhether to add punctuationfalse
enable_inverse_text_normalizationBooleanNoWhether to perform ITN, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"false
enable_wordsBooleanNoWhether to enable returning word information in the final results, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"false
enable_modal_particle_filterBooleanNoWhether to enable modal particle filtering, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual"false
hotwords_idStringNoHotwords ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Hotword API Protocol"Null
hotwords_listList<String>NoHotword list, effective only for this connection. When used in conjunction with the hotwords_id parameter, the hotwords_list takes precedenceNull
hotwords_weightFloatNoHotwords weight, range [0.1, 1.0]0.4
correction_words_idStringNoForced correction words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forced Correction API Protocol"
Supports multiple IDs, separated by a vertical line |; all indicates using all IDs
Null
forbidden_words_idStringNoForbidden words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forbidden Word API Protocol"
Supports multiple IDs, separated by a vertical line |; all indicates using all IDs
Null
gainIntegerNoAmplitude gain factor, range [1, 20], see "Basic Knowledge" and "Practical Features" sections in the "Speech-to-Text Service User Manual"
The value 1 indicates no amplification, 2 indicates the original amplitude doubled, and so on.
1
max_suffix_silenceIntNoPost-speech silence detection threshold (in seconds), with a range of 1 to 10 seconds. If the duration of silence at the end of a sentence exceeds this threshold, recognition will automatically terminate.
When the parameter value is set to 0 or the parameter is not provided, the post-speech silence detection feature is disabled.
0
user_idStringNoCustom user information, which will be returned unchanged in the response message, with a maximum length of 36 charactersNull

Example of a request:

{
    "header": {
        "namespace": "SpeechRecognizer",
        "name": "StartRecognition"
    },
    "payload": {
        "lang_type": "ja-JP",
        "format": "pcm",
        "sample_rate": 16000,
        "enable_intermediate_result": true,
        "enable_punctuation_prediction": true,
        "enable_inverse_text_normalization": true,
        "enable_words":true,
        "user_id":"conversation_001"
    }
}

Response Parameters (header object):

ParameterTypeDescription
namespaceStringThe namespace to which the message belongs: SpeechRecognizer indicates One-sentence Recognition
nameStringEvent name: RecognitionStarted indicates the initiation phase
statusStringStatus code
status_textStringExplanation of the status code
task_idStringThe globally unique ID for the task; please record this value for troubleshooting.
user_idStringThe user_id passed in when the connection was established

Example of a response:

{
    "header":{
        "namespace":"SpeechRecognizer",
        "name":"RecognitionStarted",
        "appkey":"",
        "status":"00000",
        "status_text":"success",
        "task_id":"0220a729ac9d4c9997f51592ecc83847",
        "message_id":"",
        "user_id":"conversation_001"
    },
    "payload":{
        "paragraph": 0,
        "index":0,
        "time":0,
        "begin_time":0,
        "speaker_id":"",
        "result":"",
        "confidence":0,
        "volume": 0,        
        "words":null
    }
}

2. Sending Audio Data and Receiving Recognition Results

Send audio data in a loop and continuously receive recognition results. It is recommended to send data packets of 7680 Bytes each time.

The recognition results are divided into "intermediate results" and "final results". For detailed explanations, please refer to the "Basic Terminology" section of the "Speech-to-Text Service User Manual".

The TranscriptionResultChanged event indicates that there has been a change in the recognition results, i.e., the intermediate results of a sentence.

  • If enable_intermediate_result is set to true, the server will continue to return multiple TranscriptionResultChanged messages, which are the intermediate results of the recognition.

  • If enable_intermediate_result is set to false, the server will not return any messages for this step.

Note:

The last intermediate result obtained may not be the same as the final result. Please take the result corresponding to the SentenceEnd event as the final recognition result.

Response Parameters (header object):

ParameterTypeDescription
namespaceStringThe namespace to which the message belongs, SpeechRecognizer indicates One-sentence Recognition
nameStringMessage name, RecognitionResultChanged indicates the intermediate results of a sentence
statusIntegerStatus code, indicating whether the request was successful, see service status codes
status_textStringStatus message
task_idStringThe globally unique ID for the task; please record this value for troubleshooting
message_idStringThe ID for this message
user_idStringThe user_id passed in when the connection was established

Response Parameters (payload object):

ParameterTypeDescription
indexIntegerSentence number, starting from 1 and incrementing
timeIntegerThe duration of the audio processed so far, in milliseconds
begin_timeIntegerThe time corresponding to the SentenceBegin event for the current sentence, in milliseconds
speaker_idStringAlways null for One-sentence Recognition
resultStringThe intermediate recognition result of this sentence
confidenceFloatThe confidence level of the current result, range [0, 1]
volumeIntegerThe current volume, range [0, 100]

Example of a response:

{
    "header": {
        "namespace": "SpeechRecognizer",
        "name": "RecognitionResultChanged",
        "status": "00000",
        "status_text": "success",
        "task_id": "0220a729ac9d4c9997f51592ecc83847",
        "message_id": "43u134hcih2lcp7q1c94dhm5ic2op9l2",
        "user_id":"conversation_001"
    },
    "payload": {
        "index": 1,
        "time": 1920,
        "begin_time": 0,
        "speaker_id": "",
        "result": "天気",
        "confidence": 1,
        "volume": 79,        
        "words": []
    }
}

3. Stop and Retrieve Final Results

The client sends a request to stop One-sentence Recognition, notifying the server that the transmission of voice data has ended and to terminate speech recognition. The server returns the final recognition results and then automatically disconnects the connection.

Request Parameters (header object):

ParameterTypeRequiredDescription
namespaceStringYesThe namespace to which the message belongs: SpeechRecognizer indicates One-Sentence Recognition
nameStringYesEvent name: StartRecognition indicates terminating One-Sentence Recognition

Example of a request:

{
    "header": {
        "namespace": "SpeechRecognizer",
        "name": "StopRecognition"
    }
}

Response Parameters (header object):

ParameterTypeDescription
namespaceStringThe namespace to which the message belongs: SpeechRecognizer indicates One-Sentence Recognition
nameStringThe name of the message, TranscriptionCompleted indicates that the recognition is complete
statusIntegerStatus code, indicating whether the request was successful, see service status codes
status_textStringStatus message
task_idStringThe globally unique ID for the task; please record this value for troubleshooting
message_idStringThe ID for this message
user_idStringThe user_id passed in when the connection was established

Response Parameters (payload object):

ParameterTypeDescription
indexIntegerAlways 1 for One-sentence Recognition
timeIntegerThe duration of the audio processed so far, in milliseconds
begin_timeIntegerThe time corresponding to the SentenceBegin event for the current sentence, in milliseconds
speaker_idStringAlways null for One-sentence Recognition
resultStringThe intermediate recognition result of this sentence
confidenceFloatThe confidence level of the current result, range [0, 1]
wordsDict[]Final result word information for this sentence, only returned when enable_words is set to true
volumeIntegerThe current volume, range [0, 100]

The structure of the final result word information object is as follows:

ParameterTypeDescription
wordStringThe text of the word.
start_timeIntegerThe start time of the word, in milliseconds.
end_timeIntegerThe end time of the word, in milliseconds.
typeStringThe type of the word
normal indicates regular text, forbidden indicates sensitive words, modal indicates modal particles, punc indicates punctuation marks

Example of a response:

{
    "header": {
        "namespace": "SpeechRecognizer",
        "name": "RecognitionCompleted",
        "status": "00000",
        "status_text": "success",
        "task_id": "0220a729ac9d4c9997f51592ecc83847",
        "message_id": "45kbrouk4yvz81fjueyao2s7y7o6gjz6",
        "user_id":"conversation_001"
    },
    "payload": {
        "index": 1,
        "time": 5292,
        "begin_time": 0,
        "speaker_id": "",
        "result": "天気がいいから、散歩しましょう。",
        "confidence": 0.9,
        "volume": 76,        
        "words": [{
            "word": "天気",
            "start_time": 390,
            "end_time": 1110,
            "type": "normal"
        }, {
            "word": "が",
            "start_time": 1110,
            "end_time": 1440,
            "type": "normal"
        }, {
            "word": "いい",
            "start_time": 1440,
            "end_time": 2130,
            "type": "normal"
        }, {
            "word": "から",
            "start_time": 2160,
            "end_time": 3570,
            "type": "normal"
        }, {
            "word": "、",
            "start_time": 4290,
            "end_time": 4860,
            "type": "punc"
        },
        ……略……       
        ]
    }
}

HTTP API (Non-streaming)

HTTP Request Line

ProtocolURLMethod
HTTP/1.1http://<ip_address>:7100/api/v1POST

Request Headers

HTTP request headers consist of "key/value" pairs, with each line containing one pair. The key and value are separated by an English colon (:). The settings are as follows:

NameTypeRequiredDescription
Content-typeStringYesMust be "application/octet-stream", indicating that the data in the HTTP body is binary

Request Parameters

The client sends an One-sentence Recognition request, and parameters are set within the request query parameters. The meanings of the parameters are as follows:

ParameterTypeRequiredDescriptionDefault
lang_typeStringYesLanguage code, refer to the "Language Support" section in the "Speech-to-Text Service User Manual"Required
formatStringNoAudio encoding format, refer to the "Audio Encoding" section in the "Speech-to-Text Service User Manual"pcm
sample_rateIntegerNoAudio sampling rate, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"16000
enable_punctuation_predictionBooleanNoWhether to add punctuationfalse
enable_inverse_text_normalizationBooleanNoWhether to perform ITN, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"false
enable_modal_particle_filterBooleanNoWhether to enable modal particle filtering, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual"false
hotwords_idStringNoHotwords ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Hotword API Protocol"Null
hotwords_weightFloatNoHotwords weight, range [0.1, 1.0]0.4
correction_words_idStringNoForced correction words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forced Correction API Protocol"
Supports multiple IDs, separated by a vertical line |; all indicates using all IDs
Null
forbidden_words_idStringNoForbidden words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forbidden Word API Protocol"
Supports multiple IDs, separated by a vertical line |; all indicates using all IDs
Null
gainIntegerNoAmplitude gain factor, range [1, 20], see "Basic Knowledge" and "Practical Features" sections in the "Speech-to-Text Service User Manual"
The value 1 indicates no amplification, 2 indicates the original amplitude doubled, and so on.
1

Request Body

The HTTP request body contains binary audio data, and the Content-Type in the HTTP request header must be set to application/octet-stream.

Example of Request

curl --location --request POST 'http://127.0.0.1:7100/api/v1?lang_type=ja-JP&format=pcm&sample_rate=16000&enable_punctuation_prediction=true&enable_inverse_text_normalization=true' \
--header 'Content-Type: application/octet-stream' \
--data-binary '@audio.pcm'

Response

The response results are in the Body. The fields in the response results are as follows:

ParameterTypeDescription
task_idStringThe globally unique ID for the task; please record this value for troubleshooting
resultStringThe intermediate recognition result of this sentence
statusIntegerStatus code, indicating whether the request was successful, see service status codes
messageStringStatus message

Successful Response

The status field in the body is 00000.

{
    "task_id": "cf7b0c5339244ee29cd4e43fb97fd52e",
    "result": "天気がいいから、散歩しましょう。",
    "status":"00000",
    "message":"SUCCESS"
}

Error Response

Any status field in the body that is not 00000 is considered an error response. This field can be used as an indicator of whether the request was successful.

Service Status Codes

Status CodeReason
20001Parameters parsing failed
20002、20003File processing failed
20111WebSocket upgrading failed
20114Request body is empty
20115Audio size/duration limit exceeded
20116Unsupported sample_rate
20190Missing parameter
20191Invalid parameter
20192Processing failed (decoder)
20193Processing failed (service)
20194Connection timed out (no data received from the client for a long time)
20195Other error