Logo
Main APIs

Audio File Transcription API

Feature Introduction

The Audio File Transcription service provides the functionality to transcribe audio files into text.

For more information, please refer to the "Basic Knowledge" section in the "Speech-to-Text Service User Manual".

Invocation Limits

The duration of the audio file should be under 10 hours, and the file size should be under 2 GB.

Interaction Process

  • Uploading Files

    • Request File Transcription: The client sends an HTTP POST method request to the server with the file to be transcribed, and the server responds with an HTTP response containing a task ID.
  • Getting Transcription Results: There are two methods to obtain transcription results: Polling and Callback

    • Polling: The client sends an HTTP GET method request to the server with the task ID returned by the Request File Transcription API, and the server responds with an HTTP response that includes the transcription results.
    • Callback: When uploading the file, a callback URL is provided. Once the transcription task is completed, the system automatically sends the transcription results to the specified callback URL via an HTTP POST request.

Request URLs

Interface NameProtocolURLMethod
File TranscriptionHTTP/1.1http://<ip_address>:7100/requestPOST
Get TranscriptionHTTP/1.1http://<ip_address>:7100/getResultGET
Query TaskHTTP/1.1http://<ip_address>:7100/getTasksGET
Delete TaskHTTP/1.1http://<ip_address>:7100/deleteTaskPOST

Request File Transcription API

Request Line

POST /request

Request Header

NameTypeRequiredDescription
Content-typeStringYesMust be "multipart/form-data"

Request Parameters

The client sends a request where parameters are set in the request body in form-data format. The meanings of the parameters are as follows:

ParameterTypeRequiredDescriptionDefault
lang_typeStringYesLanguage code of the file to be transcribed, see "Language Support" section in the "Speech-to-Text Service User Manual"Required
fileFileEither with file_urlThe file to be transcribedNull
file_urlStringEither with fileThe URL of the audio file to be transcribed, if both file and file_url are present, file takes precedenceNull
formatStringYesThe format of the file to be transcribed, see "Audio Encoding" section in the "Speech-to-Text Service User Manual"
Note: When transcribing an 8kHz PCM format audio file with a 16kHz model, please pass the parameter value pcm_8000
Required
sample_rateIntegerNoSampling rate of the Model (rather than the audio itself)
Note: The 16kHz model can also transcribe audio recorded at an 8kHz sample rate.
16000
outputStringNotext for manuscript format, subtitle for subtitle format, see "Practical Features" section in the manualtext
channelsIntegerNoNumber of audio channels to transcribe, transcribe the first channels audio channels separately, values should be ≥ 1
When the channels parameter is set to 1, all audio channels of the file are mixed down to a single channel, and then the transcription is performed; for more details, see the explanation below the table
1
enable_modal_particle_filterBooleanNoWhether to enable modal particle filtering, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual"false
enable_punctuation_predictionBooleanNoWhether to add punctuationtrue (if output=text)
false (if output=subtitle)
max_sentence_silenceIntegerNoSpeech sentence breaking detection threshold, silence longer than this threshold is considered as sentence breaking
Range [200, 5000], Unit: milliseconds
450
enable_wordsBooleanNoWhether to enable returning word information in the final results, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"false
words_typeIntegerNoThe minimum unit for returning word information, only valid for Chinese
0 for words, 1 for characters
0
enable_inverse_text_normalizationBooleanNoWhether to perform ITN, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual"false
hotwords_idStringNoHotwords ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Hotword API Protocol"Null
hotwords_weightFloatNoHotwords weight, range [0.1, 1.0]0.4
correction_words_idStringNoForced correction vocabulary ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forced Correction API Protocol"
Supports multiple forced replacement word library IDs, separated by a vertical line |; all indicates using all forced replacement word library IDs
forbidden_words_idStringNoForbidden words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forbidden Word API Protocol"
Supports multiple IDs, separated by a vertical line |; all indicates using all IDs
Null
gainIntNoAmplitude gain factor, range [1, 20], see "Basic Knowledge" and "Practical Features" sections in the "Speech-to-Text Service User Manual"
The value 1 indicates no amplification, 2 indicates the original amplitude doubled, and so on.
1
callback_urlStringNoCallback URL, e.g., http://10.0.0.2:8080Null
clustersIntNoNumber of speaker diarization categories, range [0~10], indicates the number of speakers to categorize by voiceprint.
Specifically, 1 indicates that no speaker diarization processing will be performed; 0 indicates that the system will automatically determine the number of categories for diarization.
When this parameter exists concurrently with the channels parameter and channels > 1, the transcription will be processed according to the logic of multi-channel audio transcription, and speaker diarization processing will not be conducted.
This feature relies on the voiceprint component (module_vpr).
1
enable_lang_labelBooleanNo[In Development v2.5.11] Return language code in recognition results when switching languages, only effective for mixed English languages (e.g., Japanese-English, Chinese-English)false
paragraph_conditionIntegerNo[In Development v2.5.11] Control paragraph character count, return a new paragraph number in the next sentence within the same cluster_id when the set character count is reached, range [100, 2000], values outside the range indicate that this feature is not enabled0
decode_silence_segmentBooleanNoControl whether to perform speech recognition processing on the silent segments determined by VAD, suitable for far-field recording environments (supported in version 2.5.11 and above).false

Note:

  • Regarding the channels parameter

    • When channels = 1, the audio is converted to a single-channel audio before transcription.

    • When channels = 2, the first two channels of all channels are processed and transcribed separately. When processing begins, if there are at least two available concurrent processes, two are used; if there is only one available concurrent process, then only one is used. The transcription results use cluster_id to differentiate the channels (with values of 1 and 2).

    • When channels = 3, the first three channels of all channels are processed and transcribed separately... and so on.

    • When channels > the actual number of channels in the audio, the effective value of channels is the actual number of channels in the audio. That is, the effective value of channels = min(channels parameter value, actual number of audio channels).

Response Results

The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.

The response result fields are as follows:

NameTypeDescription
statusStringStatus code
messageStringDescription of the status code
dataDictTranscription information
├─ durationIntegerAudio duration (seconds)
├─ task_idStringThe task ID for this transcription, please record this value for use in requesting the interface

(Note: The table above describes the structure of the JSON response typically expected from the API. The actual response may include additional fields or differ slightly based on the implementation details of the service.)

Successful Response

The status field in the body is 00000.

Example of a successful response:

{
    "status": "00000",
    "message": "Success",
    "data": {
        "duration": 600,
        "task_id": "96e25f64-1727-4dbe-889c-4a9052c2eb28"
    }
}

Error Response

Any status field in the body that is not 00000 is considered an error response, and this field can be used as an indicator of whether the response was successful.

Example of an error response:

{
    "status": "20322",
    "message": "lang_type invalid",
    "data": null
}

Get Transcription Results API (Polling)

Request Line

GET /getResult

Request Parameters

The client sends a request with the following query parameters:

ParameterTypeRequiredDescription
task_idStringYesTask ID, returned by the Request File Transcription API

Response Results

The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.

The response result fields are as follows:

NameTypeDescription
statusStringStatus code
messageStringDescription of the status code
dataDictTranscription information
statisticsDictStatistical information

Successful Response

When transcription is not completed yet:

The status field in the body is 20302.

The data field contains a description of the current transcription task.

The data field is defined as follows:

NameTypeDescription
descStringDescription of the transcription status code
file_nameStringThe name of the file being transcribed
insert_timeStringThe time of the request for file transcription
process_timeStringThe time when the transcription actually started after queuing
progressIntegerTranscription progress, a negative number -n indicates queuing, with n transcription tasks ahead; a positive number n indicates that transcription is in progress, with a processing progress of n percent; 0 indicates transcription failure.

Example of a successful response:

{
    "status": "20302",
    "message": "The task is still in the queue or being processed.",
    "data": {
        "desc": "Transcription in progress",
        "file_name": "test.mp3",
        "insert_time": "2022-11-11 12:42:04",
        "progress": 92
    }
}

When transcription is completed:

The status field in the body is 00000.

The data[] field in the body contains the transcription results of this task.

The fields for the transcription results are defined as follows:

NameTypeDescription
lang_typeString[In Development v2.5.11] The language code returned in the recognition result when switching languages, effective only for mixed English languages (e.g., Japanese-English, Chinese-English)
paragraphInteger[In Development v2.5.11] Paragraph number, starting from 1 and increasing incrementally
beginStringThe start time of each segment (relative to the source audio file)
endStringThe end time of each segment (relative to the source audio file)
seg_numIntegerSegment number (starting from 1 and increasing incrementally)
transcriptStringThe transcription result of each segment
confidenceFloatThe confidence level of the current result, range [0, 1]
wordsDict[]Word information (returned when the word information parameter enable_words = True is set)
cluster_idIntegerThe archive (classification) number, starting from 1 and increasing incrementally; the same cluster_id indicates the same speaker
(Returned when a valid clusters parameter is set, or when channels >=2)
volumeIntegerThe current volume, range [0, 100]

The structure of the word information object is as follows:

ParameterTypeDescription
textStringThe text of the word or punctuation
start_timeIntegerThe start time of the word, in milliseconds
end_timeIntegerThe end time of the word, in milliseconds
typeStringThe type of the text
normal indicates regular text, forbidden indicates sensitive words, modal indicates modal particles (only returned when the request parameter words_type = 0), punc indicates punctuation marks

The statistics field in the body contains statistical information for this transcription task:

ParameterTypeDescription
speedIntegerAverage speaking speed (Chinese/Japanese/Korean: characters per minute; other languages: words per minute)
word_countIntegerWord count (Japanese/Chinese/Korean: character count; other languages: word count)
insert_timeStringThe time of the request for file transcription
process_timeStringThe time when the transcription actually started after queuing
finish_timeStringThe time when the transcription task was completed

Example of a successful response:

{
    "status": "00000",
    "message": "",
    "data": [
        {
            "begin": "00:00:01,170",
            "end": "00:00:04,610",
            "lang_type": "ja-JP",
            "paragraph": 1,
            "seg_num": 1,
            "transcript": "初めまして、お名前は?",
            "confidence": 0.9,
            "volume": 76,
            "words": [
                {
                    "word": "初め",
                    "start_time": 80,
                    "end_time": 400,
                    "is_punc": false,
                    "keywords_index": 0,
                    "type": "normal"
                },
                {
                    "word": "まして",
                    "start_time": 80,
                    "end_time": 400,
                    "is_punc": true,
                    "keywords_index": 0,
                    "type": "normal"
                },
                {
                    "word": "、",
                    "start_time": 400,
                    "end_time": 860,
                    "is_punc": false,
                    "keywords_index": 0,
                    "type": "punc"
                },
                ……略……
            ]    
        },
        {
            "begin": "00:00:04,630",
            "end": "00:00:07,010",
            "lang_type": "zh-cmn-Hans-CN",
            "paragraph": 1,
            "seg_num": 2,
            "transcript": "初めまして、田中と申します。",
            "confidence": 0.95,
            "volume": 82,
            "words": [
                ……略……
            ]    
        }
    ],
    "statistics": {
        "finish_time": "2022-11-11 12:47:12",
        "insert_time": "2022-11-11 12:42:04",
        "process_time": "2022-11-11 12:46:28",
        "speed": 210,
        "word_count": 86
    }
}

Error Response

Any status field in the body that is not 20302 or 00000 is considered an error response. This field can be used as an indicator of whether the request was successful.

Example of an error response:

{
    "status": "20326",
    "message": "task_id invalid",
    "data": null
}

Callback of Transcription Results

If the callback_url parameter is specified when calling the Request File Transcription API, the system will automatically send the recognition results to the specified address in the form of an HTTP POST request upon completion (or failure) of the transcription.

The response result fields are as follows:

NameTypeDescription
statusStringStatus code
messageStringDescription of the status code
dataDictTranscription information

Successful Transcription

The data field for successful transcription is defined as follows:

NameTypeDescription
resultDict[]Transcription results, with the same format as the data field in the polling API
statisticsDictStatistical information, with the same format as the statistics field in the polling API
task_idStringThe task ID for this transcription
{
    "status":"00000",
    "message":"success",
    "data":{
        "result":[
            {
                "begin": "00:00:01,170",
                "end": "00:00:04,610",
                "lang_type": "zh-cmn-Hans-CN",
                "paragraph": 1,
                "seg_num": 1,
                "transcript": "初めまして、お名前は?",
                "confidence": 0.9,
                "volume": 82
            },
            {
                "begin": "00:00:04,630",
                "end": "00:00:7,010",
                "lang_type": "zh-cmn-Hans-CN",
                "paragraph": 1,
                "seg_num": 2,
                "transcript": "初めまして、田中と申します。",
                "confidence": 0.95,
                "volume": 76
        	},
        	……略……
        ],
        "statistics": {
            "keywords":[],
            "speed": 210,
            "word_count": 86,
            "finish_time": "2022-11-11 12:47:12",
            "insert_time": "2022-11-11 12:42:04",
            "process_time": "2022-11-11 12:46:28",
        },
        "task_id":"27dc2487-bb02-4585-bd10-b75b445441bc"
    }
}

Transcription Failure

The field format is the same as described above, but the result field will be empty. For specific information about the task failure, refer to the message field.

Task Inquiry Interface

Filter tasks based on certain conditions. Tasks cleared according to the expiration policy in the configuration file will not be returned.

Request Line

GET /getTasks

Request Parameters

The client sends a request with the following query parameters:

ParameterDescriptionDefault
timeFilter condition based on upload time, audios uploaded within the last n hoursNo time limit
typeFilter by type, 0: Failed, 1: Completed, 2: Processing, 3: QueuedNo type limit
countFilter by quantity, the upper limit of the number of tasks returnedNo quantity limit

When combining the above filter conditions, they are used in an "AND" relationship.

Response Parameters

The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.

The response result fields are as follows:

ParameterTypeDescription
statusStringStatus code
messageStringDescription of the status code
dataDict[]Task information
├─ task_idStringTask ID
├─ descStringDescription of the transcription status
├─ file_nameStringName of the file being transcribed
├─ insert_timeStringTime of the request for file transcription
├─ process_timeStringTime when the transcription actually started after queuing
├─ finish_timeStringTime when the transcription task was completed
├─ progressIntegerTranscription progress. A negative number -n indicates queuing, with n transcription tasks ahead;
A positive number n indicates that transcription is in progress, with a processing progress of n percent; 0 indicates transcription failure; 100 indicates transcription completed.
statisticsDictStatistical information
├─ countIntegerTotal number of tasks in the query results
├─ typesDict[]Summary of quantities by type
├─├─ typeInteger0: Failed, 1: Completed, 2: Processing, 3: Queued
├─├─ descStringDescription of the transcription status
├─├─ countIntegerNumber of tasks

Delete Transcription Task API

This API is used to delete tasks that are in the queue or have been completed. For tasks that are currently being processed, they will be deleted after the pre-processing phase is completed.

Request Line

POST /deleteTask

Request Parameters

The client sends a request with the following query parameters:

ParameterTypeRequiredDescription
task_idStringYesTask ID, returned by the Request File Transcription API

Response Results

The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.

The response result fields are as follows:

NameTypeDescription
statusStringStatus code
messageStringDescription of the status code
dataDictnull

Service Status Codes

Every response from the service will include the status and message fields, which indicate the status code and the description of the status code for the response. The service status codes are as follows:

Status CodeReason
20302Progress update from getResult API
20001Parameters parsing failed
20002、20003、20006File processing failed
20111WebSocket upgrading failed
20114Request body is empty or failed to retrieve the audio file.
20115Audio size/duration limit exceeded
20116Unsupported sample_rate
20190Missing parameter
20191Invalid parameter
20192Processing failed (decoder)
20193Processing failed (service)
20194Connection timed out (no data received from the client for a long time)
20195Other error
20330Speaker diarization related error

Quick Test

Linux

Request File Transcription API

curl -X POST -F 'file=@test.mp3' -F 'lang_type=ja-JP' -F 'format=mp3' -F 'sample_rate=16000' 'http://localhost:7100/request'

Get Transcription Results API

curl -X GET 'http://localhost:7100/getResult?task_id=******'

Delete Transcription Task API

curl -X POST 'http://localhost:7100/deleteTask?task_id=******'

Windows

Request File Transcription API

curl.exe -X POST -F "file=@test.mp3" -F "lang_type=ja-JP" -F "format=mp3" -F "sample_rate=16000" "http://localhost:7100/request"

Get Transcription Results API

curl.exe -X GET "http://localhost:7100/getResult?task_id=******"

Delete Task API

curl.exe -X POST "http://localhost:7100/deleteTask?task_id=******"