Logo
Audio File Transcription

VIP Version

1 Feature Introduction

The Audio File Transcription service provides a feature that converts audio files into text. It supports two output formats: manuscripts and subtitles.

It supports transcription in languages such as Chinese, English, and Japanese.

For more details, please refer to the Developer Guides.

If you would like to experience the fastest speed of the Audio File Transcription service (VIP Version), please contact us.

2 Invocation Limits

The file duration must be less than 5 hours.

The size of the audio file must be less than 1GB.

Supported file formats (mono or stereo): WAV/PCM/OPUS/MP3/AMR/3GP/AAC.

3 Invocation Process

(1) Uploading Files: Request File Transcription: The client sends an HTTP POST method request to the server with the file to be transcribed, and the server returns an HTTP response containing a task ID.

(2) Getting Transcription Results: There are two methods to obtain transcription results: Polling and Callback.

Polling: The client sends an HTTP GET method request to the server with the task ID returned by the Request File Transcription API, and the server returns an HTTP response containing the transcription results.

Callback: When uploading the file, a callback URL is provided. Once the transcription task is completed, the transcription results are automatically sent to the specified callback URL via an HTTP POST request.

Please ensure that your callback URL is accessible.

4 Request URLs

Interface NameProtocolURLMethod
Requesting File TranscriptionHTTPhttps://api.voice.dolphin-ai.jp/v1/asrfile/upload/vipPOST
Getting Transcription ResultsHTTPhttps://api.voice.dolphin-ai.jp/v1/asrfile/resultGET
Task QueryHTTPhttps://api.voice.dolphin-ai.jp/v1/asrfile/tasksGET

5 Request File Transcription API

5.1 Request Line

POST /v1/asrfile/upload/vip

5.2 Request Header

NameTypeRequiredDescription
Content-typeStringYesMust be "multipart/form-data"

5.3 Request Parameters

The client sends a request, in which parameters need to be set in the request body in form-data format. The meaning of each parameter is as follows:

ParameterTypeRequiredDescriptionDefault Value
lang_typeStringYesThe language of the file to be transcribed. See Developer Guides - Language Support.Null
fileFileNoThe file to be transcribedNull
file_urlStringNoThe URL of the audio file to be transcribedNull
formatStringYesThe format of the file to be transcribed. See Developer Guides - Audio Encoding.Null
fieldStringNoField
general: for sampling rate of 16000Hz
call-center: for sampling rate of 8000Hz
No
outputStringNotext for manuscript format, subtitle for subtitle format. See Developer Guides - Practical Features.text
max_sentence_silenceIntegerNoSpeech sentence breaking detection threshold. Silence longer than this threshold is considered as a sentence break.
Range [200, 1200], unit: milliseconds
800 (field=general)
250 (field=call-center)
enable_modal_particle_filterBooleanNoWhether to enable filler word removal. See Developer Guide - Practical Features.false
enable_punctuation_predictionBooleanNoWhether to add punctuationtrue(output=text
false(output=subtitle
enable_wordsBooleanNoWhether to enable returning word information. See Developer Guides - Basic Terms.false
words_typeIntegerNoThe minimum unit for returning word information, valid only for Mandarin
0 for words, 1 for characters
0
enable_inverse_text_normalizationBooleanNoWhether to perform ITN in post-processing. See Developer Guides - Basic Terms.true
split_clustersBooleanNoWhether to distinguish speakersfalse
clustersIntegerNoNumber of speaker diarization categories, range 2~10, indicates the number of speakers to categorize by voiceprint.

Specifically, a value of 0 means that the number of diarization categories will be automatically determined by the system.
Note: This field will only become active if split_clusters = true. Passing the correct parameter will improve the performance of speaker diarization.
No
channelsIntegerNoNumber of channels, options are 1 or 2.
1 indicates mono channel, while 2 indicates stereo.
Note: If a valid channels parameter is set, cluster_id will be returned. If cluster_id is the same, it indicates that the content is from the same channel.
1
hotwords_listStringNoOne-time hotwords list, effective only for the current connection. If both hotwords_list and hotwords_id parameters exist, hotwords_list will be used. Up to 100 entries can be provided at a time. See Developer Guides - Practical Features.No
hotwords_idStringNoHotwords ID. See Developer Guides - Practical Features.No
hotwords_weightFloatNoHotwords weight. Range is [0.1, 1.0]0.4
correction_words_idStringNoForced correction vocabulary ID. See Developer Guides - Practical Features.
Supports multiple IDs, separated by a vertical bar |. all indicates using all IDs.
No
forbidden_words_idStringNoForbidden words ID. See Developer Guides - Practical Features.
Supports multiple IDs, separated by a vertical bar | . all indicates using all IDs.
No
keywords_quantityIntegerNoThe number of automatically extracted keywords. Range is [0, 100] (currently supports Japanese-English mixed and Chinese-English mixed languages only)0
callback_urlStringNoThe callback URL, such as http://domain.net:8080/callbackNo
gainIntegerNoThe amplitude gain factor, range [1, 20], see Developer Guides - Practical Features.
1 indicates no amplification, 2 indicates the original amplitude doubled (amplified by 1 times), and so on.
1 (field=general)
2 (field=call-center)
enable_lang_labelBooleanNoWhen switching languages, language labels will be returned in the recognition results (currently supports Japanese-English mixed and Chinese-English mixed languages only)false
paragraph_conditionIntegerNoWhen the number of characters set within the same speakerid is reached, return a new paragraph number in the next sentence: range [100, 2000], and values outside this range will disable this feature.0
enable_save_logBooleanNoIs it possible for you to provide audio data and recognition result logs so that we can use them to improve the quality of our products and services.true

5.4 Response Results

The Content-Type field in the HTTP Headers of all interface responses is always application/json. The response results are in the HTTP Body.

The fields of the response results are as follows:

NameTypeDescription
statusStringStatus Code
messageStringStatus Code Description
dataObjectTranscription Information
├─ durationIntegerDuration of Audio (seconds)
├─ task_idStringTask ID for This Transcription. Please record this value for interface requests

5.4.1 Successful Response

The status field in the body is 000000.

Example of a successful response:

{
    "status": "000000",
    "message": "success",
    "data": {
        "task_id": "01e57746-5f50-490a-8e36-4ea1110c21cd",
        "duration": 23
    }
}

5.4.2 Error Response

If the status field in the body is not 000000, it indicates an error response. This field can be used as an indicator to determine whether the response was successful.

Example of an error response:

{
    "status": "200001",
    "message": "file Parameter Missing",
    "data": {
        "task_id": "38d19b08-b075-443e-af68-e8b990764b1e",
        "duration": 0
    }
}

6 Get Transcription Results API (Polling)

6.1 Request Line

GET /v1/asrfile/result

6.2 Request Parameters

When the client sends a request, the meaning of the query parameters is as follows:

ParameterTypeRequiredDescription
task_idStringYesTask ID, returned by the Request File Transcription API

6.3 Response Results

The Content-Type field in the HTTP Headers of all interface responses is always application/json. The response results are in the HTTP Body.

The fields of the response results are as follows:

NameTypeDescription
statusStringStatus Code
messageStringStatus Code Description
dataObjectTranscription Information
statisticsObjectStatistical Information

6.3.1 Successful Response

When transcription is not completed yet,

the status field in the body is 000000.

the data field in the body is the description of this transcription task.

The data field is defined as follows:

NameTypeDescription
descStringDescription of the transcription status code
file_nameStringThe name of the file being transcribed
insert_timeStringThe time of the request for file transcription
process_timeStringThe time when the transcription actually started after queuing
progressIntegerThe percentage of the processing progress

Example of a successful response:

{
    "status": "000000",
    "message": "success",
    "data": {
        "desc": "Transcribing",
        "file_name": "test.mp3",
        "insert_time": "2022-11-11 12:42:04",
        "process_time": "2022-11-11 12:42:04",
        "progress": 67
    }
}

When transcription is completed,

the status field in the body is 000000.

the data.result field in the body is the transcription results of this task.

The transcription result fields are defined as follows:

NameTypeDescription
beginStringThe start time of each segment (relative to the source audio file)
endStringThe end time of each segment (relative to the source audio file)
seg_numIntegerSegment number (starting from 1 and increasing incrementally)
transcriptStringThe transcription result of each segment
confidenceFloatConfidence level of the current result, range [0, 1]
wordsObjectWord information (returned when the word information parameter enable_words = true is set)
lang_typeStringLanguage labels when switching languages: currently supports Japanese-English mixed and Chinese-English mixed languages only
(Returned when parameter enable_lang_label is set to true)
paragraphIntegerParagraph number, increasing sequentially starting from 1.
cluster_idIntegerSpeaker numbers increase sequentially starting from 1. If the cluster_id is the same, it indicates the same speaker.
(Returned when a valid clusters parameter is set, or when channels is set to >=2)

Among them, the words object:

ParameterTypeDescription
wordStringText
start_timeIntegerWord start time, unit is milliseconds
end_timeIntegerWord end time, unit is milliseconds
typeStringType
normal indicates regular texts. modal indicates filler words (returned only when the request parameter words_type = 0). punc indicates punctuation marks
keywords_indexIntegerThe sequence number of the corresponding keyword (0 indicates a non-keyword), this field is returned only when the request parameter words_type = 0

The data.statistics field in the body contains statistical information for this transcription task:

ParameterTypeDescription
keywordsObjectKeyword object (returned when the keywords_quantity parameter is set)
speedIntegerAverage speaking speed (Japanese/Chinese/Korean: characters per minute; English: words per minute)
word_countIntegerWord count (Japanese/Chinese/Korean: number of characters; English: number of words)
insert_timeStringTime of the request for file transcription
process_timeStringTime when the transcription actually started after queuing
finish_timeStringTime when the transcription task was completed

Among them, the keywords object:

ParameterTypeDescription
indexIntegerKeyword number (starting from 1 and increasing incrementally)
wordStringKeyword
frequencyIntegerKeyword frequency
timeDictKeyword appearance time range
├─ beginStringKeyword start time
├─ endStringKeyword end time

The keywords objects are scored and returned based on a combination of frequency and importance (TF-IDF algorithm).

If you need to sort solely based on the keyword frequency, please use the frequency field.

Example of a successful response:

{
    "status": "000000",
    "message": "Success",
    "data": {
        "result": [
            {
                "begin": "00:00:01,170",
                "end": "00:00:04,610",
                "seg_num": 1,
                "transcript": "こんにちは。あの、お名前は何ですか?",
                "confidence": 0.9,
                "paragraph": 1,
                "lang_type": "ja-JP",
                "cluster_id": 1,
                "words": [
                    {
                        "word": "こんにちは",
                        "start_time": 80,
                        "end_time": 400,
                        "is_punc": false,
                        "keywords_index": 2,
                        "type": "normal"
                    },
                    {
                        "word": "。",
                        "start_time": 80,
                        "end_time": 400,
                        "is_punc": true,
                        "keywords_index": 0,
                        "type": "punc"
                    },
                    {
                        "word": "あの",
                        "start_time": 400,
                        "end_time": 860,
                        "is_punc": false,
                        "keywords_index": 0,
                        "type": "normal"
                    },
                     ……more……
                ]
            },
            {
                "begin": "00:00:04,630",
                "end": "00:00:07,010",
                "seg_num": 2,
                "transcript": "こんにちは。私はあきです,お名前は?",
                "confidence": 0.9,
                "paragraph": 1,
                "lang_type": "ja-JP",
                "cluster_id": 1,
                "words": [
                    ……more……
                ]
            }
        ],
        "statistics": {
            "finish_time": "2022-11-11 12:47:12",
            "insert_time": "2022-11-11 12:42:04",
            "process_time": "2022-11-11 12:46:28",
            "keywords": [
                {
                    "index": 1,
                    "words": "お名前",
                    "frequency": 2,
                    "time": [
                        {
                            "begin": "00:00:05,120",
                            "end": "00:00:05,590"
                        },
                        {
                            "begin": "00:00:13,420",
                            "end": "00:00:13,980"
                        }
                    ]
                }
            ],
            "speed": 210,
            "word_count": 86
        }
    }
}

6.3.2 Error Response

If the status field in the body is not 000000, it indicates an error response. This field can be used as an indicator to determine whether the response was successful.

Example of an error response:

{
    "status": "220404",
    "message": "task_id does not exist"
}

7 Callback of Transcription Results

If the callback_url parameter is specified when calling the Request File Transcription API, the system will automatically send the recognition results to the specified address in the form of an HTTP POST request upon completion (or failure) of the transcription.

The response result fields are as follows:

NameTypeDescription
statusStringStatus Code
messageStringDescription of the status code
dataObjectTranscription information

7.1.1 Successful Transcription

The data field in the transcription results is defined as follows:

NameTypeDescription
resultObjectTranscription results, with the same format as the data field in the polling API
statisticsObjectStatistical information, with the same format as the statistics field in the polling API
task_idStringTask ID for this transcription
{
    "status":"000000",
    "message":"success",
    "data":{
        "result":[
            {
                "begin": "00:00:01,170",
                "end": "00:00:04,610",
                "seg_num": 1,
                "transcript": "こんにちは。あの、お名前は何ですか?"
            },
            {
                "begin": "00:00:04,630",
                "end": "00:00:7,010",
                "seg_num": 2,
                "transcript": "こんにちは。私はあきです,お名前は?"
        	}
        ],
        "statistics": {
            "keywords":[],
            "speed": 210,
            "word_count": 86,
            "finish_time": "2022-11-11 12:47:12",
            "insert_time": "2022-11-11 12:42:04",
            "process_time": "2022-11-11 12:46:28",
        },
        "task_id":"27dc2487-bb02-4585-bd10-b75b445441bc"
    }
}

7.1.2 Transcription Failure

The field format is the same as described above, but the result field will be empty. For specific information about the task failure, please refer to the message field.

8 Task Query Interface

Tasks can be filtered based on certain conditions, and transcription results are only stored on the platform for one month. Please retrieve them as soon as possible.

8.1 Request Line

GET /v1/asrfile/tasks

8.2 Request Parameters

The client sends a request, and the meaning of the query parameters is as follows:

ParameterDescriptionDefault Value
timeFilter by upload time: audios uploaded within the last n hoursNo time limit
typeFilter by type: 0: Failed, 1: Completed, 2: ProcessingNo type limit
countFilter by quantity: the upper limit of the number of tasks returnedNo quantity limit

When the above filtering conditions are used in combination, they form an "AND" relationship.

8.3 Response Parameters

The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.

The response result fields are as follows:

ParameterTypeDescription
statusStringStatus code
messageStringDescription of the status code
dataDictTask information
├─ task_idStringTask ID
├─ descStringDescription of the transcription status
├─ file_nameStringName of the file being transcribed
├─ insert_timeStringTime of the request for file transcription
├─ process_timeStringTime when the transcription actually started after queuing
├─ finish_timeStringTime when the transcription task was completed
statisticsObjectStatistical information
├─ countIntegerTotal number of tasks in the query results
├─ typesDictSummary of quantities by type
├─├─ typeInteger0: Failed, 1: Completed, 2: Processing, 3: Queued
├─├─ descStringDescription of the transcription status
├─├─ countIntegerNumber of tasks