リアルタイム音声認識 API

Feature Introduction

For long-duration speech data streams, suitable for scenarios requiring continuous recognition over extended periods, such as conference speeches and live video streaming.

Real-time speech recognition only provides a WebSocket (streaming) interface. A single connection theoretically supports an audio duration limit of about 37 hours.

Request URL

ws://<ip_address>:7100/ws/v1

Interaction Process

1. Start and Send Parameters

The client initiates a request, and the server confirms the validity of the request. Parameters must be set within the request message.

Request Parameters (header object):

Parameter	Type	Required	Description
namespace	String	Yes	The namespace to which the message belongs: `SpeechTranscriber` indicates Real-time Speech Transcription
name	String	Yes	Event name: `StartTranscription` indicates the start phase

Request Parameters (payload object):

Parameter	Type	Required	Description	Default
lang_type	String	Yes	Language code, refer to the "Language Support" section in the "Speech-to-Text Service User Manual"	Required
format	String	No	Audio encoding format, refer to the "Audio Encoding" section in the "Speech-to-Text Service User Manual" For call center (8kHz) PCM format, please pass the parameter value `pcm_8000`	pcm
sample_rate	Integer	No	Sampling rate of the Model (rather than the audio itself)	16000
enable_intermediate_result	Boolean	No	Whether to return intermediate recognition results	false
enable_punctuation_prediction	Boolean	No	Whether to add punctuation	false
enable_inverse_text_normalization	Boolean	No	Whether to perform ITN, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"	false
max_sentence_silence	Integer	No	Speech sentence breaking detection threshold, silence longer than this threshold is considered as sentence breaking Valid parameter range [200, 5000], Unit: milliseconds	450
enable_words	Boolean	No	Whether to enable returning word information in the final results, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"	false
enable_modal_particle_filter	Boolean	No	Whether to enable modal particle filtering, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual"	false
hotwords_id	String	No	Hotwords ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Hotword API Protocol"	Null
hotwords_list	List<String>	No	Hotword list, effective only for this connection. When used in conjunction with the `hotwords_id` parameter, the `hotwords_list` takes precedence	Null
hotwords_weight	Float	No	Hotwords weight, range [0.1, 1.0]	0.4
correction_words_id	String	No	Forced correction words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forced Correction API Protocol" Supports multiple IDs, separated by a vertical line `\|`; `all` indicates using all IDs	Null
forbidden_words_id	String	No	Forbidden words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forbidden Word API Protocol" Supports multiple IDs, separated by a vertical line `\|`; `all` indicates using all IDs	Null
gain	Integer	No	Amplitude gain factor, range [1, 20], see "Basic Knowledge" and "Practical Features" sections in the "Speech-to-Text Service User Manual" The value 1 indicates no amplification, 2 indicates the original amplitude doubled, and so on.	1
enable_sse	Boolean	No	Whether to enable HTTP SSE streaming results, refer to the "Additional Features" section of this document	false
user_id	String	No	Custom user information, which will be returned unchanged in the response message, with a maximum length of 36 characters	Null
source_url	String	No	Audio source URL, after filling in, the audio will be obtained from this address as the input for speech recognition (the `format` parameter must be set to a supported audio encoding format) Supports RTSP streams, refer to the "Audio Encoding" section in the "Speech-to-Text Service User Manual"	Null
enable_lang_label	Boolean	No	Return language code in recognition results when switching languages, only effective for mixed English languages (e.g., Japanese-English, Chinese-English). Note: Enabling this feature may cause a response delay when switching languages	false
paragraph_condition	Integer	No	Control paragraph character count, return a new paragraph number in the next sentence within the same speaker_id when the set character count is reached, range [100, 2000], values outside the range indicate that this feature is not enabled	0
decode_silence_segment	Boolean	No	Control whether to perform speech recognition processing on the silent segments determined by VAD, suitable for far-field recording environments (supported in version 2.5.11 and above).	false

Example of a request:

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "StartTranscription"
    },
    "payload": {
        "lang_type": "ja-JP",
        "format": "pcm",
        "sample_rate": 16000,
        "enable_intermediate_result": true,
        "enable_punctuation_prediction": true,
        "enable_inverse_text_normalization": true,
        "max_sentence_silence": 800,
        "enable_words":true,
        "user_id":"conversation_001"
    }
}

Response Parameters (header object):

Parameter	Type	Description
namespace	String	The namespace to which the message belongs: `SpeechTranscriber` indicates Real-time Speech Transcription
name	String	Event name: `TranscriptionStarted` indicates the initiation phase
status	String	Status code
status_text	String	Explanation of the status code
task_id	String	The globally unique ID for the task; please record this value for troubleshooting.
message_id	String	The ID for this message
user_id	String	The user_id passed in when the connection was established

Example of a response:

{
    "header":{
        "namespace":"SpeechTranscriber",
        "name":"TranscriptionStarted",
        "appkey":"",
        "status":"00000",
        "status_text":"success",
        "task_id":"0220a729ac9d4c9997f51592ecc83847",
        "message_id":"49b680abe737488cf50f3cd9e3953b97",
        "user_id":"conversation_001"
    },
    "payload":{
        "index":0,
        "time":0,
        "begin_time":0,
        "speaker_id":"",
        "result":"",
        "confidence":0,
        "words":null
    }
}

2. Sending Audio Data and Receiving Recognition Results

Note:

If the source_url parameter is set during the connection, the system will automatically retrieve audio data from the specified source, and you do not need to send audio data according to the content of this section.

Send audio data in a loop and continuously receive recognition results. It is recommended to send data packets of 7680 Bytes each time.

The Real-time Speech Transcription service has an automatic sentence-breaking feature that determines the beginning and end of a sentence according to the length of silence between the utterances, which is represented by events in the returned results. The SentenceBegin and SentenceEnd events respectively indicate the start and end of a sentence, while the TranscriptionResultChanged event indicates the intermediate recognition results of a sentence.

SentenceBegin Event

The SentenceBegin event indicates that the server has detected the beginning of a sentence.

Response Parameters (header object):

Parameter	Type	Description
namespace	String	The namespace to which the message belongs, `SpeechTranscriber` indicates Real-time Speech Transcription
name	String	Message name, `SentenceBegin` indicates the start of a sentence
status	Integer	Status code, indicating whether the request was successful, see service status codes
status_text	String	Status message
task_id	String	The globally unique ID for the task; please record this value for troubleshooting
message_id	String	The ID for this message
user_id	String	The user_id passed in when the connection was established

Response Parameters (payload object):

Parameter	Type	Description
lang_type	String	Returns the language code when `enable_lang_label` is enabled
paragraph	Integer	Paragraph number, starting from 1 and incrementing
index	Integer	Sentence number, starting from 1 and incrementing
time	Integer	The duration of the audio processed so far, in milliseconds
begin_time	Integer	The time corresponding to the `SentenceBegin` event for the current sentence, in milliseconds
speaker_id	String	Speaker number or network sound card channel number, see "Additional Features - Speaker ID" and "Additional Features - Network Sound Card" sections below
result	String	Recognition result, which may be empty
confidence	Float	The confidence level of the current result, range [0, 1]
volume	Integer	The current volume, range [0, 100]

Example of a response:

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "SentenceBegin",
        "status": "00000",
        "status_text": "success",
        "task_id": "3ee284e922dd4554bb6ccda7989d1973",
        "message_id": "9b680abe73748f50f83cd9e3953b974c",
        "user_id":"conversation_001"
    },
    "payload": {
        "lang_type": "ja-JP",
        "paragraph": 1,
        "index": 1,
        "time": 240,
        "begin_time": 0,
        "speaker_id": "",
        "result": "",
        "confidence": 0,
        "volume": 0        
    }
}

TranscriptionResultChanged Event

The recognition results are divided into "intermediate results" and "final results". For detailed explanations, please refer to the "Basic Terminology" section of the "Speech-to-Text Service User Manual".

The TranscriptionResultChanged event indicates that there has been a change in the recognition results, i.e., the intermediate results of a sentence.

If enable_intermediate_result is set to true, the server will continue to return multiple TranscriptionResultChanged messages, which are the intermediate results of the recognition.
If enable_intermediate_result is set to false, the server will not return any messages for this step.

Note:

The last intermediate result obtained may not be the same as the final result. Please take the result corresponding to the SentenceEnd event as the final recognition result.

Response Parameters (header object):

The header object parameters are the same as in the previous table (see the SentenceBegin event header object return parameters), with name being TranscriptionResultChanged indicating the intermediate recognition result of a sentence.

Response Parameters (payload object):

Parameter	Type	Description
lang_type	String	Returns the language code when `enable_lang_label` is enabled
paragraph	Integer	Paragraph number, starting from 1 and incrementing
index	Integer	Sentence number, starting from 1 and incrementing
time	Integer	The duration of the audio processed so far, in milliseconds
begin_time	Integer	The time corresponding to the `SentenceBegin` event for the current sentence, in milliseconds
speaker_id	String	Speaker number, see "Additional Features - Speaker ID" sections below
result	String	The intermediate recognition result of this sentence
confidence	Float	The confidence level of the current result, range [0, 1]
volume	Integer	The current volume, range [0, 100]

Example of a response:

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "TranscriptionResultChanged",
        "status": "00000",
        "status_text": "success",
        "task_id": "3ee284e922dd4554bb6ccda7989d1973",
        "message_id": "749b680abe733cd488cf50f9e3953b97",
        "user_id":"conversation_001"
    },
    "payload": {
        "lang_type": "ja-JP",
        "paragraph": 1,
        "index": 1,
        "time": 1920,
        "begin_time": 0,
        "speaker_id": "",
        "result": "天気",
        "confidence": 1,
        "volume": 79,        
        "words": [
            {
                "word": "天気",
                "start_time": 0,
                "end_time": 1920,
                "stable": false
            }
        ]
    }
}

SentenceEnd Event

The SentenceEnd event signifies that the service has detected the end of a spoken sentence and returns the final transcription result for that sentence.

Response Parameters (header object):

The header object parameters are consistent with those described in the SentenceBegin event, with the name attribute being SentenceEnd to indicate the recognition of the sentence's end.

Response Parameters (payload object):

Parameter	Type	Description
lang_type	String	Returns the language code when `enable_lang_label` is enabled
paragraph	Integer	Paragraph number, starting from 1 and incrementing
index	Integer	Sentence number, starting from 1 and incrementing
time	Integer	The duration of the audio processed so far, in milliseconds
begin_time	Integer	The time corresponding to the `SentenceBegin` event for the current sentence, in milliseconds
speaker_id	String	Speaker number or network sound card channel number, see "Additional Features - Speaker ID" and "Additional Features - Network Sound Card" sections below
result	String	The final recognition result for this sentence
confidence	Float	The confidence level of the current result, range [0, 1]
words	Dict[]	Final result word information for this sentence, only returned when `enable_words` is set to true
volume	Integer	The current volume, range [0, 100]

The structure of the final result word information object is as follows:

Parameter	Type	Description
word	String	The text of the word.
start_time	Integer	The start time of the word, in milliseconds.
end_time	Integer	The end time of the word, in milliseconds.
type	String	The type of the word `normal` indicates regular text, `forbidden` indicates sensitive words, `modal` indicates modal particles, `punc` indicates punctuation marks

Example of a response:

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "SentenceEnd",
        "status": "00000",
        "status_text": "success",
        "task_id": "3ee284e922dd4554bb6ccda7989d1973",
        "message_id": "749b680abe737488cf50f3cd9e3953b9",
        "user_id":"conversation_001"
    },
    "payload": {
        "lang_type": "zh-cmn-Hans-CN",
        "paragraph": 1,
        "index": 1,
        "time": 5670,
        "begin_time": 390,
        "speaker_id": "speaker1",
        "result": "天気がいいから、散歩しましょう。",
        "confidence": 0.9,
        "volume": 76,
        "words": [{
            "word": "天気",
            "start_time": 390,
            "end_time": 1110,
            "type": "normal"
        }, {
            "word": "が",
            "start_time": 1110,
            "end_time": 1440,
            "type": "normal"
        }, {
            "word": "いい",
            "start_time": 1440,
            "end_time": 2130,
            "type": "normal"
        }, {
            "word": "から",
            "start_time": 2160,
            "end_time": 3570,
            "type": "normal"
        }, {
            "word": "、",
            "start_time": 4290,
            "end_time": 4860,
            "type": "punc"
        },
        ……略……       
        ]
    }
}

Additional Features

The following three additional features are only available in Real-time Speech Transcription.

Note:

If the source_url parameter is set during the connection, forced sentence breaking and custom speaker numbering features are not available.

Forced Sentence Breaking

During the transmission of audio data, sending a SentenceEnd event will force the server to break the sentence at the current position. After the server processes the sentence breaking, the client will receive a SentenceEnd message with the final recognition result for that sentence. This feature is not available when using a network sound card.

Example of sending:

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "SentenceEnd"
    }
}

Custom Speaker Numbering

Before sending audio data, send a SpeakerStart event and specify speaker_id as the speaker number. The server will consider the audio data between one SpeakerStart event and the next SpeakerStart or StopTranscription event as the specified speaker and will return this information in the speaker_id field of the recognition result. This feature is not available when using a network sound card.

Note:

The speaker_id supports up to 36 characters, and any excess will be truncated and discarded. If the speaker_id parameter is not passed in the SpeakerStart event, the speaker_id in the result will be an empty value. The SpeakerStart event will trigger forced sentence breaking. Therefore, please send a SpeakerStart event only before switching speakers.

Example of sending:

Angle brackets <> indicate audio data packets, and curly braces () indicate JSON data packets.

<Binary Audio Data Packet 1>

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "SpeakerStart"
    },
    "payload": {
        "speaker_id": "001"
    }
}

<Binary Audio Data Packet 2>

<...>

<Binary Audio Data Packet n>

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "SpeakerStart"
    }
    "payload": {
        "speaker_id": "002"
    }
}

<Binary Audio Data Packet n+1>

<Binary Audio Data Packet n+2>

SSE Method for Returning Results

When enable_sse is set to true, after establishing a Real-time Speech Transcription WebSocket connection and receiving the TranscriptionStarted event, another client can retrieve the recognition results using the following HTTP request:

GET http://{server_ip}:7100/getAsrResult?task_id=******

The recognition results are returned streamly via Server-sent events (SSE), and the format is consistent with the content returned over WebSocket. The connection is automatically terminated when the recognition ends.

Quick Test Example (Linux):

curl -X GET 'http://localhost:7100/getAsrResult?task_id=******'

Quick Test Example (Windows):

curl.exe -X GET "http://localhost:7100/getAsrResult?task_id=******"

3. Maintaining Connection (Heartbeat Mechanism)

When not sending audio, it is necessary to send a heartbeat packet at least once every 10 seconds; otherwise, the connection will be automatically terminated. It is recommended that the client sends a heartbeat every 8 seconds.

Note:

Heartbeat packets should be sent after sending the StartRecognition or StartTranscription event.

If the source_url parameter is set during the connection, the heartbeat mechanism does not apply.

Example of sending:

{
    "header":{
        "namespace":"SpeechTranscriber",
        "name":"Ping"
    }
}

Example of a response:

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "Pong",
        "task_id": "71c5cb9b-fbc3-4489-843c-e902b102a569",
        "message_id": "6f9ea191-1624-4d3c-9286-0003d323f731"
    },
    "payload": {}
}

If no data is transmitted to the server within 10 seconds, the server will return an error message and then automatically disconnect the connection.

4. Stop and Retrieve Final Results

The client sends a request to stop Real-time Speech Transcription, notifying the server that the transmission of voice data has ended and to terminate speech recognition. The server returns the final recognition results and then automatically disconnects the connection.

Request Parameters (header object):

Parameter	Type	Required	Description
namespace	String	Yes	The namespace to which the message belongs: `SpeechTranscriber` indicates Real-time Speech Transcription
name	String	Yes	Event name: `StartTranscription` indicates terminating Real-time Speech Transcription

Example of a request:

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "StopTranscription"
    }
}

Response Parameters (header object):

Parameter	Type	Description
namespace	String	The namespace to which the message belongs, `SpeechTranscriber` indicates Real-time Speech Transcription
name	String	The name of the message, `TranscriptionCompleted` indicates that the transcription is complete
status	Integer	Status code, indicating whether the request was successful, see service status codes
status_text	String	Status message
task_id	String	The globally unique ID for the task; please record this value for troubleshooting
message_id	String	The ID for this message
user_id	String	The user_id passed in when the connection was established

Response Parameters (payload object): The format is the same as the SentenceEnd event, but the result and words fields may be empty.

Example of a response:

{
    "header":{
        "namespace":"SpeechTranscriber",
        "name":"TranscriptionCompleted",
        "status":"00000",
        "status_text":"success",
        "task_id":"3ee284e922dd4554bb6ccda7989d1973",
        "message_id":"7e729bf2d4064fee83143c4d962dc6f1",
        "user_id":""
    },
    "payload":{
        "index":1,
        "time":4765,
        "begin_time":180,
        "speaker_id":"",
        "result":"",
        "confidence": 0,
        "volume": 0,        
        "words":[]
    }
}

5. Error Messages

Error messages are returned through the WebSocket connection, after which the server automatically closes the connection.

Example of a response:

{
    "header": {
        "namespace": "SpeechTranscriber",
        "name": "TaskFailed",
        "status": "20105",
        "status_text": "JSON serialization failed",
        "task_id": "df2c1604e31d4f46a7a064db73cd3b5e",
        "message_id": "",
        "user_id": ""
    },
    "payload": {
        "index": 1,
        "time": 0,
        "begin_time": 0,
        "speaker_id": "",
        "result": "",
        "confidence": 1,
        "volume": 0,        
        "words": null
    }
}

Service Status Codes

Status Code	Reason
20001	Parameters parsing failed
20002、20003	File processing failed
20111	WebSocket upgrading failed
20114	Request body is empty
20115	Audio size/duration limit exceeded
20116	Unsupported `sample_rate`
20190	Missing parameter
20191	Invalid parameter
20192	Processing failed (decoder)
20193	Processing failed (service)
20194	Connection timed out (no data received from the client for a long time)
20195	Other error

リアルタイム音声認識 API

目次