VIP Version

1 Feature Introduction

The Audio File Transcription service provides a feature that converts audio files into text. It supports two output formats: manuscripts and subtitles.

It supports transcription in languages such as Chinese, English, and Japanese.

For more details, please refer to the Developer Guides.

If you would like to experience the fastest speed of the Audio File Transcription service (VIP Version), please contact us.

2 Invocation Limits

The file duration must be less than 5 hours.

The size of the audio file must be less than 1GB.

Supported file formats (mono or stereo): WAV/PCM/OPUS/MP3/AMR/3GP/AAC.

(1) Uploading Files: Request File Transcription: The client sends an HTTP POST method request to the server with the file to be transcribed, and the server returns an HTTP response containing a task ID.

(2) Getting Transcription Results: There are two methods to obtain transcription results: Polling and Callback.

Polling: The client sends an HTTP GET method request to the server with the task ID returned by the Request File Transcription API, and the server returns an HTTP response containing the transcription results.

Callback: When uploading the file, a callback URL is provided. Once the transcription task is completed, the transcription results are automatically sent to the specified callback URL via an HTTP POST request.

Please ensure that your callback URL is accessible.

4 Request URLs

Interface Name	Protocol	URL	Method
Requesting File Transcription	HTTP	https://api.voice.dolphin-ai.jp/v1/asrfile/upload/vip	POST
Getting Transcription Results	HTTP	https://api.voice.dolphin-ai.jp/v1/asrfile/result	GET
Task Query	HTTP	https://api.voice.dolphin-ai.jp/v1/asrfile/tasks	GET

5 Request File Transcription API

5.1 Request Line

POST /v1/asrfile/upload/vip

5.2 Request Header

Name	Type	Required	Description
Content-type	String	Yes	Must be "multipart/form-data"

5.3 Request Parameters

The client sends a request, in which parameters need to be set in the request body in form-data format. The meaning of each parameter is as follows:

Parameter	Type	Required	Description	Default Value
lang_type	String	Yes	The language of the file to be transcribed. See Developer Guides - Language Support.	Null
file	File	No	The file to be transcribed	Null
file_url	String	No	The URL of the audio file to be transcribed	Null
format	String	Yes	The format of the file to be transcribed. See Developer Guides - Audio Encoding.	Null
field	String	No	Field `general`: for sampling rate of 16000Hz `call-center`: for sampling rate of 8000Hz	No
output	String	No	`text` for manuscript format, `subtitle` for subtitle format. See Developer Guides - Practical Features.	text
max_sentence_silence	Integer	No	Speech sentence breaking detection threshold. Silence longer than this threshold is considered as a sentence break. Range [200, 1200], unit: milliseconds	800 (`field=general`) 250 (`field=call-center`)
enable_modal_particle_filter	Boolean	No	Whether to enable filler word removal. See Developer Guide - Practical Features.	false
enable_punctuation_prediction	Boolean	No	Whether to add punctuation	true（`output=text`） false（`output=subtitle`）
enable_words	Boolean	No	Whether to enable returning word information. See Developer Guides - Basic Terms.	false
words_type	Integer	No	The minimum unit for returning word information, valid only for Mandarin `0` for words, `1` for characters	0
enable_inverse_text_normalization	Boolean	No	Whether to perform ITN in post-processing. See Developer Guides - Basic Terms.	true
split_clusters	Boolean	No	Whether to distinguish speakers	false
clusters	Integer	No	Number of speaker diarization categories, range 2~10, indicates the number of speakers to categorize by voiceprint. Specifically, a value of 0 means that the number of diarization categories will be automatically determined by the system. Note: This field will only become active if `split_clusters = true`. Passing the correct parameter will improve the performance of speaker diarization.	No
channels	Integer	No	Number of channels, options are 1 or 2. `1` indicates mono channel, while `2` indicates stereo. Note: If a valid `channels` parameter is set, `cluster_id` will be returned. If `cluster_id` is the same, it indicates that the content is from the same channel.	1
hotwords_list	String	No	One-time hotwords list, effective only for the current connection. If both `hotwords_list` and `hotwords_id` parameters exist, `hotwords_list` will be used. Up to 100 entries can be provided at a time. See Developer Guides - Practical Features.	No
hotwords_id	String	No	Hotwords ID. See Developer Guides - Practical Features.	No
hotwords_weight	Float	No	Hotwords weight. Range is [0.1, 1.0]	0.4
correction_words_id	String	No	Forced correction vocabulary ID. See Developer Guides - Practical Features. Supports multiple IDs, separated by a vertical bar `\|`. `all` indicates using all IDs.	No
forbidden_words_id	String	No	Forbidden words ID. See Developer Guides - Practical Features. Supports multiple IDs, separated by a vertical bar `\|` . `all` indicates using all IDs.	No
keywords_quantity	Integer	No	The number of automatically extracted keywords. Range is [0, 100] (currently supports Japanese-English mixed and Chinese-English mixed languages only)	0
callback_url	String	No	The callback URL, such as `http://domain.net:8080/callback`	No
gain	Integer	No	The amplitude gain factor, range [1, 20], see Developer Guides - Practical Features. `1` indicates no amplification, `2` indicates the original amplitude doubled (amplified by 1 times), and so on.	1 (`field=general`) 2 (`field=call-center`)
enable_lang_label	Boolean	No	When switching languages, language labels will be returned in the recognition results (currently supports Japanese-English mixed and Chinese-English mixed languages only)	false
paragraph_condition	Integer	No	When the number of characters set within the same speakerid is reached, return a new paragraph number in the next sentence: range [100, 2000], and values outside this range will disable this feature.	0
enable_save_log	Boolean	No	Is it possible for you to provide audio data and recognition result logs so that we can use them to improve the quality of our products and services.	true

5.4 Response Results

The Content-Type field in the HTTP Headers of all interface responses is always application/json. The response results are in the HTTP Body.

The fields of the response results are as follows:

Name	Type	Description
status	String	Status Code
message	String	Status Code Description
data	Object	Transcription Information
├─ duration	Integer	Duration of Audio (seconds)
├─ task_id	String	Task ID for This Transcription. Please record this value for interface requests

5.4.1 Successful Response

The status field in the body is 000000.

Example of a successful response:

{
    "status": "000000",
    "message": "success",
    "data": {
        "task_id": "01e57746-5f50-490a-8e36-4ea1110c21cd",
        "duration": 23
    }
}

5.4.2 Error Response

If the status field in the body is not 000000, it indicates an error response. This field can be used as an indicator to determine whether the response was successful.

Example of an error response:

{
    "status": "200001",
    "message": "file Parameter Missing",
    "data": {
        "task_id": "38d19b08-b075-443e-af68-e8b990764b1e",
        "duration": 0
    }
}

6 Get Transcription Results API (Polling)

6.1 Request Line

GET /v1/asrfile/result

6.2 Request Parameters

When the client sends a request, the meaning of the query parameters is as follows:

Parameter	Type	Required	Description
task_id	String	Yes	Task ID, returned by the Request File Transcription API

6.3 Response Results

The Content-Type field in the HTTP Headers of all interface responses is always application/json. The response results are in the HTTP Body.

The fields of the response results are as follows:

Name	Type	Description
status	String	Status Code
message	String	Status Code Description
data	Object	Transcription Information
statistics	Object	Statistical Information

6.3.1 Successful Response

When transcription is not completed yet,

the status field in the body is 000000.

the data field in the body is the description of this transcription task.

The data field is defined as follows:

Name	Type	Description
desc	String	Description of the transcription status code
file_name	String	The name of the file being transcribed
insert_time	String	The time of the request for file transcription
process_time	String	The time when the transcription actually started after queuing
progress	Integer	The percentage of the processing progress

Example of a successful response:

{
    "status": "000000",
    "message": "success",
    "data": {
        "desc": "Transcribing",
        "file_name": "test.mp3",
        "insert_time": "2022-11-11 12:42:04",
        "process_time": "2022-11-11 12:42:04",
        "progress": 67
    }
}

When transcription is completed,

the status field in the body is 000000.

the data.result field in the body is the transcription results of this task.

The transcription result fields are defined as follows:

Name	Type	Description
begin	String	The start time of each segment (relative to the source audio file)
end	String	The end time of each segment (relative to the source audio file)
seg_num	Integer	Segment number (starting from 1 and increasing incrementally)
transcript	String	The transcription result of each segment
confidence	Float	Confidence level of the current result, range [0, 1]
words	Object	Word information (returned when the word information parameter `enable_words = true` is set)
lang_type	String	Language labels when switching languages: currently supports Japanese-English mixed and Chinese-English mixed languages only (Returned when parameter `enable_lang_label` is set to `true`)
paragraph	Integer	Paragraph number, increasing sequentially starting from 1.
cluster_id	Integer	Speaker numbers increase sequentially starting from 1. If the `cluster_id` is the same, it indicates the same speaker. (Returned when a valid `clusters` parameter is set, or when `channels` is set to >=2)

Among them, the words object:

Parameter	Type	Description
word	String	Text
start_time	Integer	Word start time, unit is milliseconds
end_time	Integer	Word end time, unit is milliseconds
type	String	Type `normal` indicates regular texts. `modal` indicates filler words (returned only when the request parameter `words_type = 0`). `punc` indicates punctuation marks
keywords_index	Integer	The sequence number of the corresponding keyword (0 indicates a non-keyword), this field is returned only when the request parameter `words_type = 0`

The data.statistics field in the body contains statistical information for this transcription task:

Parameter	Type	Description
keywords	Object	Keyword object (returned when the `keywords_quantity` parameter is set)
speed	Integer	Average speaking speed (Japanese/Chinese/Korean: characters per minute; English: words per minute)
word_count	Integer	Word count (Japanese/Chinese/Korean: number of characters; English: number of words)
insert_time	String	Time of the request for file transcription
process_time	String	Time when the transcription actually started after queuing
finish_time	String	Time when the transcription task was completed

Among them, the keywords object:

Parameter	Type	Description
index	Integer	Keyword number (starting from 1 and increasing incrementally)
word	String	Keyword
frequency	Integer	Keyword frequency
time	Dict	Keyword appearance time range
├─ begin	String	Keyword start time
├─ end	String	Keyword end time

The keywords objects are scored and returned based on a combination of frequency and importance (TF-IDF algorithm).

If you need to sort solely based on the keyword frequency, please use the frequency field.

Example of a successful response:

{
    "status": "000000",
    "message": "Success",
    "data": {
        "result": [
            {
                "begin": "00:00:01,170",
                "end": "00:00:04,610",
                "seg_num": 1,
                "transcript": "こんにちは。あの、お名前は何ですか？",
                "confidence": 0.9,
                "paragraph": 1,
                "lang_type": "ja-JP",
                "cluster_id": 1,
                "words": [
                    {
                        "word": "こんにちは",
                        "start_time": 80,
                        "end_time": 400,
                        "is_punc": false,
                        "keywords_index": 2,
                        "type": "normal"
                    },
                    {
                        "word": "。",
                        "start_time": 80,
                        "end_time": 400,
                        "is_punc": true,
                        "keywords_index": 0,
                        "type": "punc"
                    },
                    {
                        "word": "あの",
                        "start_time": 400,
                        "end_time": 860,
                        "is_punc": false,
                        "keywords_index": 0,
                        "type": "normal"
                    },
                     ……more……
                ]
            },
            {
                "begin": "00:00:04,630",
                "end": "00:00:07,010",
                "seg_num": 2,
                "transcript": "こんにちは。私はあきです，お名前は？",
                "confidence": 0.9,
                "paragraph": 1,
                "lang_type": "ja-JP",
                "cluster_id": 1,
                "words": [
                    ……more……
                ]
            }
        ],
        "statistics": {
            "finish_time": "2022-11-11 12:47:12",
            "insert_time": "2022-11-11 12:42:04",
            "process_time": "2022-11-11 12:46:28",
            "keywords": [
                {
                    "index": 1,
                    "words": "お名前",
                    "frequency": 2,
                    "time": [
                        {
                            "begin": "00:00:05,120",
                            "end": "00:00:05,590"
                        },
                        {
                            "begin": "00:00:13,420",
                            "end": "00:00:13,980"
                        }
                    ]
                }
            ],
            "speed": 210,
            "word_count": 86
        }
    }
}

6.3.2 Error Response

If the status field in the body is not 000000, it indicates an error response. This field can be used as an indicator to determine whether the response was successful.

Example of an error response:

{
    "status": "220404",
    "message": "task_id does not exist"
}

7 Callback of Transcription Results

If the callback_url parameter is specified when calling the Request File Transcription API, the system will automatically send the recognition results to the specified address in the form of an HTTP POST request upon completion (or failure) of the transcription.

The response result fields are as follows:

Name	Type	Description
status	String	Status Code
message	String	Description of the status code
data	Object	Transcription information

7.1.1 Successful Transcription

The data field in the transcription results is defined as follows:

Name	Type	Description
result	Object	Transcription results, with the same format as the `data` field in the polling API
statistics	Object	Statistical information, with the same format as the `statistics` field in the polling API
task_id	String	Task ID for this transcription

{
    "status":"000000",
    "message":"success",
    "data":{
        "result":[
            {
                "begin": "00:00:01,170",
                "end": "00:00:04,610",
                "seg_num": 1,
                "transcript": "こんにちは。あの、お名前は何ですか？"
            },
            {
                "begin": "00:00:04,630",
                "end": "00:00:7,010",
                "seg_num": 2,
                "transcript": "こんにちは。私はあきです，お名前は？"
        	}
        ],
        "statistics": {
            "keywords":[],
            "speed": 210,
            "word_count": 86,
            "finish_time": "2022-11-11 12:47:12",
            "insert_time": "2022-11-11 12:42:04",
            "process_time": "2022-11-11 12:46:28",
        },
        "task_id":"27dc2487-bb02-4585-bd10-b75b445441bc"
    }
}

GET /v1/asrfile/tasks

8.2 Request Parameters

The client sends a request, and the meaning of the query parameters is as follows:

Parameter	Description	Default Value
time	Filter by upload time: audios uploaded within the last n hours	No time limit
type	Filter by type: 0: Failed, 1: Completed, 2: Processing	No type limit
count	Filter by quantity: the upper limit of the number of tasks returned	No quantity limit

When the above filtering conditions are used in combination, they form an "AND" relationship.

8.3 Response Parameters

The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.

The response result fields are as follows:

Parameter	Type	Description
status	String	Status code
message	String	Description of the status code
data	Dict	Task information
├─ task_id	String	Task ID
├─ desc	String	Description of the transcription status
├─ file_name	String	Name of the file being transcribed
├─ insert_time	String	Time of the request for file transcription
├─ process_time	String	Time when the transcription actually started after queuing
├─ finish_time	String	Time when the transcription task was completed
statistics	Object	Statistical information
├─ count	Integer	Total number of tasks in the query results
├─ types	Dict	Summary of quantities by type
├─├─ type	Integer	0: Failed, 1: Completed, 2: Processing, 3: Queued
├─├─ desc	String	Description of the transcription status
├─├─ count	Integer	Number of tasks

1 Feature Introduction

2 Invocation Limits

3 Invocation Process

4 Request URLs

5 Request File Transcription API

5.1 Request Line

5.2 Request Header

5.3 Request Parameters

5.4 Response Results

5.4.1 Successful Response

5.4.2 Error Response

6 Get Transcription Results API (Polling)

6.1 Request Line

6.2 Request Parameters

6.3 Response Results

6.3.1 Successful Response

6.3.2 Error Response

7 Callback of Transcription Results

7.1.1 Successful Transcription

7.1.2 Transcription Failure

8 Task Query Interface

8.1 Request Line

8.2 Request Parameters

8.3 Response Parameters

On this page