Audio File Transcription API

Feature Introduction

The Audio File Transcription service provides the functionality to transcribe audio files into text.

For more information, please refer to the "Basic Knowledge" section in the "Speech-to-Text Service User Manual".

Invocation Limits

The duration of the audio file should be under 10 hours, and the file size should be under 2 GB.

Interaction Process

Uploading Files
- Request File Transcription: The client sends an HTTP POST method request to the server with the file to be transcribed, and the server responds with an HTTP response containing a task ID.
Getting Transcription Results: There are two methods to obtain transcription results: Polling and Callback
- Polling: The client sends an HTTP GET method request to the server with the task ID returned by the Request File Transcription API, and the server responds with an HTTP response that includes the transcription results.
- Callback: When uploading the file, a callback URL is provided. Once the transcription task is completed, the system automatically sends the transcription results to the specified callback URL via an HTTP POST request.

Request URLs

Interface Name	Protocol	URL	Method
File Transcription	HTTP/1.1	`http://<ip_address>:7100/request`	POST
Get Transcription	HTTP/1.1	`http://<ip_address>:7100/getResult`	GET
Query Task	HTTP/1.1	`http://<ip_address>:7100/getTasks`	GET
Delete Task	HTTP/1.1	`http://<ip_address>:7100/deleteTask`	POST

Request File Transcription API

Request Line

POST /request

Request Header

Name	Type	Required	Description
Content-type	String	Yes	Must be "multipart/form-data"

Request Parameters

The client sends a request where parameters are set in the request body in form-data format. The meanings of the parameters are as follows:

Parameter	Type	Required	Description	Default
lang_type	String	Yes	Language code of the file to be transcribed, see "Language Support" section in the "Speech-to-Text Service User Manual"	Required
file	File	Either with `file_url`	The file to be transcribed	Null
file_url	String	Either with `file`	The URL of the audio file to be transcribed, if both `file` and `file_url` are present, `file` takes precedence	Null
format	String	Yes	The format of the file to be transcribed, see "Audio Encoding" section in the "Speech-to-Text Service User Manual" Note: When transcribing an 8kHz PCM format audio file with a 16kHz model, please pass the parameter value `pcm_8000`	Required
sample_rate	Integer	No	Sampling rate of the Model (rather than the audio itself) Note: The 16kHz model can also transcribe audio recorded at an 8kHz sample rate.	16000
output	String	No	`text` for manuscript format, `subtitle` for subtitle format, see "Practical Features" section in the manual	text
channels	Integer	No	Number of audio channels to transcribe, transcribe the first `channels` audio channels separately, values should be ≥ 1 When the `channels` parameter is set to 1, all audio channels of the file are mixed down to a single channel, and then the transcription is performed; for more details, see the explanation below the table	1
enable_modal_particle_filter	Boolean	No	Whether to enable modal particle filtering, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual"	false
enable_punctuation_prediction	Boolean	No	Whether to add punctuation	true (if `output=text`) false (if `output=subtitle`)
max_sentence_silence	Integer	No	Speech sentence breaking detection threshold, silence longer than this threshold is considered as sentence breaking Range [200, 5000], Unit: milliseconds	450
enable_words	Boolean	No	Whether to enable returning word information in the final results, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"	false
words_type	Integer	No	The minimum unit for returning word information, only valid for Chinese 0 for words, 1 for characters	0
enable_inverse_text_normalization	Boolean	No	Whether to perform ITN, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"	false
hotwords_id	String	No	Hotwords ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Hotword API Protocol"	Null
hotwords_weight	Float	No	Hotwords weight, range [0.1, 1.0]	0.4
correction_words_id	String	No	Forced correction vocabulary ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forced Correction API Protocol" Supports multiple forced replacement word library IDs, separated by a vertical line `\|`; `all` indicates using all forced replacement word library IDs
forbidden_words_id	String	No	Forbidden words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forbidden Word API Protocol" Supports multiple IDs, separated by a vertical line `\|`; `all` indicates using all IDs	Null
gain	Int	No	Amplitude gain factor, range [1, 20], see "Basic Knowledge" and "Practical Features" sections in the "Speech-to-Text Service User Manual" The value 1 indicates no amplification, 2 indicates the original amplitude doubled, and so on.	1
callback_url	String	No	Callback URL, e.g., `http://10.0.0.2:8080`	Null
clusters	Int	No	Number of speaker diarization categories, range [0~10], indicates the number of speakers to categorize by voiceprint. Specifically, 1 indicates that no speaker diarization processing will be performed; 0 indicates that the system will automatically determine the number of categories for diarization. When this parameter exists concurrently with the `channels` parameter and `channels` > 1, the transcription will be processed according to the logic of multi-channel audio transcription, and speaker diarization processing will not be conducted. This feature relies on the voiceprint component (module_vpr).	1
enable_lang_label	Boolean	No	[In Development v2.5.11] Return language code in recognition results when switching languages, only effective for mixed English languages (e.g., Japanese-English, Chinese-English)	false
paragraph_condition	Integer	No	[In Development v2.5.11] Control paragraph character count, return a new paragraph number in the next sentence within the same cluster_id when the set character count is reached, range [100, 2000], values outside the range indicate that this feature is not enabled	0
decode_silence_segment	Boolean	No	Control whether to perform speech recognition processing on the silent segments determined by VAD, suitable for far-field recording environments (supported in version 2.5.11 and above).	false

Note:

Regarding the channels parameter
- When channels = 1, the audio is converted to a single-channel audio before transcription.
- When channels = 2, the first two channels of all channels are processed and transcribed separately. When processing begins, if there are at least two available concurrent processes, two are used; if there is only one available concurrent process, then only one is used. The transcription results use cluster_id to differentiate the channels (with values of 1 and 2).
- When channels = 3, the first three channels of all channels are processed and transcribed separately... and so on.
- When channels > the actual number of channels in the audio, the effective value of channels is the actual number of channels in the audio. That is, the effective value of channels = min(channels parameter value, actual number of audio channels).

Response Results

The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.

The response result fields are as follows:

Name	Type	Description
status	String	Status code
message	String	Description of the status code
data	Dict	Transcription information
├─ duration	Integer	Audio duration (seconds)
├─ task_id	String	The task ID for this transcription, please record this value for use in requesting the interface

(Note: The table above describes the structure of the JSON response typically expected from the API. The actual response may include additional fields or differ slightly based on the implementation details of the service.)

Successful Response

The status field in the body is 00000.

Example of a successful response:

{
    "status": "00000",
    "message": "Success",
    "data": {
        "duration": 600,
        "task_id": "96e25f64-1727-4dbe-889c-4a9052c2eb28"
    }
}

Error Response

Any status field in the body that is not 00000 is considered an error response, and this field can be used as an indicator of whether the response was successful.

Example of an error response:

{
    "status": "20322",
    "message": "lang_type invalid",
    "data": null
}

Get Transcription Results API (Polling)

Request Line

GET /getResult

Request Parameters

The client sends a request with the following query parameters:

Parameter	Type	Required	Description
task_id	String	Yes	Task ID, returned by the Request File Transcription API

Response Results

The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.

The response result fields are as follows:

Name	Type	Description
status	String	Status code
message	String	Description of the status code
data	Dict	Transcription information
statistics	Dict	Statistical information

Successful Response

When transcription is not completed yet:

The status field in the body is 20302.

The data field contains a description of the current transcription task.

The data field is defined as follows:

Name	Type	Description
desc	String	Description of the transcription status code
file_name	String	The name of the file being transcribed
insert_time	String	The time of the request for file transcription
process_time	String	The time when the transcription actually started after queuing
progress	Integer	Transcription progress, a negative number `-n` indicates queuing, with n transcription tasks ahead; a positive number `n` indicates that transcription is in progress, with a processing progress of n percent; `0` indicates transcription failure.

Example of a successful response:

{
    "status": "20302",
    "message": "The task is still in the queue or being processed.",
    "data": {
        "desc": "Transcription in progress",
        "file_name": "test.mp3",
        "insert_time": "2022-11-11 12:42:04",
        "progress": 92
    }
}

When transcription is completed:

The status field in the body is 00000.

The data[] field in the body contains the transcription results of this task.

The fields for the transcription results are defined as follows:

Name	Type	Description
lang_type	String	[In Development v2.5.11] The language code returned in the recognition result when switching languages, effective only for mixed English languages (e.g., Japanese-English, Chinese-English)
paragraph	Integer	[In Development v2.5.11] Paragraph number, starting from 1 and increasing incrementally
begin	String	The start time of each segment (relative to the source audio file)
end	String	The end time of each segment (relative to the source audio file)
seg_num	Integer	Segment number (starting from 1 and increasing incrementally)
transcript	String	The transcription result of each segment
confidence	Float	The confidence level of the current result, range [0, 1]
words	Dict[]	Word information (returned when the word information parameter `enable_words = True` is set)
cluster_id	Integer	The archive (classification) number, starting from 1 and increasing incrementally; the same `cluster_id` indicates the same speaker (Returned when a valid `clusters` parameter is set, or when `channels` >=2)
volume	Integer	The current volume, range [0, 100]

The structure of the word information object is as follows:

Parameter	Type	Description
text	String	The text of the word or punctuation
start_time	Integer	The start time of the word, in milliseconds
end_time	Integer	The end time of the word, in milliseconds
type	String	The type of the text `normal` indicates regular text, `forbidden` indicates sensitive words, `modal` indicates modal particles (only returned when the request parameter `words_type = 0`), `punc` indicates punctuation marks

The statistics field in the body contains statistical information for this transcription task:

Parameter	Type	Description
speed	Integer	Average speaking speed (Chinese/Japanese/Korean: characters per minute; other languages: words per minute)
word_count	Integer	Word count (Japanese/Chinese/Korean: character count; other languages: word count)
insert_time	String	The time of the request for file transcription
process_time	String	The time when the transcription actually started after queuing
finish_time	String	The time when the transcription task was completed

Example of a successful response:

{
    "status": "00000",
    "message": "",
    "data": [
        {
            "begin": "00:00:01,170",
            "end": "00:00:04,610",
            "lang_type": "ja-JP",
            "paragraph": 1,
            "seg_num": 1,
            "transcript": "初めまして、お名前は？",
            "confidence": 0.9,
            "volume": 76,
            "words": [
                {
                    "word": "初め",
                    "start_time": 80,
                    "end_time": 400,
                    "is_punc": false,
                    "keywords_index": 0,
                    "type": "normal"
                },
                {
                    "word": "まして",
                    "start_time": 80,
                    "end_time": 400,
                    "is_punc": true,
                    "keywords_index": 0,
                    "type": "normal"
                },
                {
                    "word": "、",
                    "start_time": 400,
                    "end_time": 860,
                    "is_punc": false,
                    "keywords_index": 0,
                    "type": "punc"
                },
                ……略……
            ]    
        },
        {
            "begin": "00:00:04,630",
            "end": "00:00:07,010",
            "lang_type": "zh-cmn-Hans-CN",
            "paragraph": 1,
            "seg_num": 2,
            "transcript": "初めまして、田中と申します。",
            "confidence": 0.95,
            "volume": 82,
            "words": [
                ……略……
            ]    
        }
    ],
    "statistics": {
        "finish_time": "2022-11-11 12:47:12",
        "insert_time": "2022-11-11 12:42:04",
        "process_time": "2022-11-11 12:46:28",
        "speed": 210,
        "word_count": 86
    }
}

Error Response

Any status field in the body that is not 20302 or 00000 is considered an error response. This field can be used as an indicator of whether the request was successful.

Example of an error response:

{
    "status": "20326",
    "message": "task_id invalid",
    "data": null
}

Callback of Transcription Results

If the callback_url parameter is specified when calling the Request File Transcription API, the system will automatically send the recognition results to the specified address in the form of an HTTP POST request upon completion (or failure) of the transcription.

The response result fields are as follows:

Name	Type	Description
status	String	Status code
message	String	Description of the status code
data	Dict	Transcription information

Successful Transcription

The data field for successful transcription is defined as follows:

Name	Type	Description
result	Dict[]	Transcription results, with the same format as the `data` field in the polling API
statistics	Dict	Statistical information, with the same format as the `statistics` field in the polling API
task_id	String	The task ID for this transcription

{
    "status":"00000",
    "message":"success",
    "data":{
        "result":[
            {
                "begin": "00:00:01,170",
                "end": "00:00:04,610",
                "lang_type": "zh-cmn-Hans-CN",
                "paragraph": 1,
                "seg_num": 1,
                "transcript": "初めまして、お名前は？",
                "confidence": 0.9,
                "volume": 82
            },
            {
                "begin": "00:00:04,630",
                "end": "00:00:7,010",
                "lang_type": "zh-cmn-Hans-CN",
                "paragraph": 1,
                "seg_num": 2,
                "transcript": "初めまして、田中と申します。",
                "confidence": 0.95,
                "volume": 76
        	},
        	……略……
        ],
        "statistics": {
            "keywords":[],
            "speed": 210,
            "word_count": 86,
            "finish_time": "2022-11-11 12:47:12",
            "insert_time": "2022-11-11 12:42:04",
            "process_time": "2022-11-11 12:46:28",
        },
        "task_id":"27dc2487-bb02-4585-bd10-b75b445441bc"
    }
}

Transcription Failure

The field format is the same as described above, but the result field will be empty. For specific information about the task failure, refer to the message field.

Task Inquiry Interface

Filter tasks based on certain conditions. Tasks cleared according to the expiration policy in the configuration file will not be returned.

Request Line

GET /getTasks

Request Parameters

The client sends a request with the following query parameters:

Parameter	Description	Default
time	Filter condition based on upload time, audios uploaded within the last n hours	No time limit
type	Filter by type, 0: Failed, 1: Completed, 2: Processing, 3: Queued	No type limit
count	Filter by quantity, the upper limit of the number of tasks returned	No quantity limit

When combining the above filter conditions, they are used in an "AND" relationship.

Response Parameters

The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.

The response result fields are as follows:

Parameter	Type	Description
status	String	Status code
message	String	Description of the status code
data	Dict[]	Task information
├─ task_id	String	Task ID
├─ desc	String	Description of the transcription status
├─ file_name	String	Name of the file being transcribed
├─ insert_time	String	Time of the request for file transcription
├─ process_time	String	Time when the transcription actually started after queuing
├─ finish_time	String	Time when the transcription task was completed
├─ progress	Integer	Transcription progress. A negative number -n indicates queuing, with n transcription tasks ahead; A positive number n indicates that transcription is in progress, with a processing progress of n percent; 0 indicates transcription failure; 100 indicates transcription completed.
statistics	Dict	Statistical information
├─ count	Integer	Total number of tasks in the query results
├─ types	Dict[]	Summary of quantities by type
├─├─ type	Integer	0: Failed, 1: Completed, 2: Processing, 3: Queued
├─├─ desc	String	Description of the transcription status
├─├─ count	Integer	Number of tasks

Delete Transcription Task API

This API is used to delete tasks that are in the queue or have been completed. For tasks that are currently being processed, they will be deleted after the pre-processing phase is completed.

Request Line

POST /deleteTask

Request Parameters

The client sends a request with the following query parameters:

Parameter	Type	Required	Description
task_id	String	Yes	Task ID, returned by the Request File Transcription API

Response Results

The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.

The response result fields are as follows:

Name	Type	Description
status	String	Status code
message	String	Description of the status code
data	Dict	null

Service Status Codes

Every response from the service will include the status and message fields, which indicate the status code and the description of the status code for the response. The service status codes are as follows:

Status Code	Reason
20302	Progress update from `getResult` API
20001	Parameters parsing failed
20002、20003、20006	File processing failed
20111	WebSocket upgrading failed
20114	Request body is empty or failed to retrieve the audio file.
20115	Audio size/duration limit exceeded
20116	Unsupported `sample_rate`
20190	Missing parameter
20191	Invalid parameter
20192	Processing failed (decoder)
20193	Processing failed (service)
20194	Connection timed out (no data received from the client for a long time)
20195	Other error
20330	Speaker diarization related error

Quick Test

Linux

Request File Transcription API

curl -X POST -F 'file=@test.mp3' -F 'lang_type=ja-JP' -F 'format=mp3' -F 'sample_rate=16000' 'http://localhost:7100/request'

Get Transcription Results API

curl -X GET 'http://localhost:7100/getResult?task_id=******'

Delete Transcription Task API

curl -X POST 'http://localhost:7100/deleteTask?task_id=******'

Windows

Request File Transcription API

curl.exe -X POST -F "file=@test.mp3" -F "lang_type=ja-JP" -F "format=mp3" -F "sample_rate=16000" "http://localhost:7100/request"

Get Transcription Results API

curl.exe -X GET "http://localhost:7100/getResult?task_id=******"

Delete Task API

curl.exe -X POST "http://localhost:7100/deleteTask?task_id=******"

On this page