VIP Version
1 Feature Introduction
The Audio File Transcription service provides a feature that converts audio files into text. It supports two output formats: manuscripts and subtitles.
It supports transcription in languages such as Chinese, English, and Japanese.
For more details, please refer to the Developer Guides.
2 Invocation Limits
The file duration must be less than 5 hours.
The size of the audio file must be less than 1GB.
Supported file formats (mono or stereo): WAV/PCM/OPUS/MP3/AMR/3GP/AAC.
3 Invocation Process
(1) Uploading Files: Request File Transcription: The client sends an HTTP POST method request to the server with the file to be transcribed, and the server returns an HTTP response containing a task ID.
(2) Getting Transcription Results: There are two methods to obtain transcription results: Polling and Callback.
Polling: The client sends an HTTP GET method request to the server with the task ID returned by the Request File Transcription API, and the server returns an HTTP response containing the transcription results.
Callback: When uploading the file, a callback URL is provided. Once the transcription task is completed, the transcription results are automatically sent to the specified callback URL via an HTTP POST request.
4 Request URLs
| Interface Name | Protocol | URL | Method |
|---|---|---|---|
| Requesting File Transcription | HTTP | https://api.voice.dolphin-ai.jp/v1/asrfile/upload/vip | POST |
| Getting Transcription Results | HTTP | https://api.voice.dolphin-ai.jp/v1/asrfile/result | GET |
| Task Query | HTTP | https://api.voice.dolphin-ai.jp/v1/asrfile/tasks | GET |
5 Request File Transcription API
5.1 Request Line
POST /v1/asrfile/upload/vip5.2 Request Header
| Name | Type | Required | Description |
|---|---|---|---|
| Content-type | String | Yes | Must be "multipart/form-data" |
5.3 Request Parameters
The client sends a request, in which parameters need to be set in the request body in form-data format. The meaning of each parameter is as follows:
| Parameter | Type | Required | Description | Default Value |
|---|---|---|---|---|
| lang_type | String | Yes | The language of the file to be transcribed. See Developer Guides - Language Support. | Null |
| file | File | No | The file to be transcribed | Null |
| file_url | String | No | The URL of the audio file to be transcribed | Null |
| format | String | Yes | The format of the file to be transcribed. See Developer Guides - Audio Encoding. | Null |
| field | String | No | Fieldgeneral: for sampling rate of 16000Hzcall-center: for sampling rate of 8000Hz | No |
| output | String | No | text for manuscript format, subtitle for subtitle format. See Developer Guides - Practical Features. | text |
| max_sentence_silence | Integer | No | Speech sentence breaking detection threshold. Silence longer than this threshold is considered as a sentence break. Range [200, 1200], unit: milliseconds | 800 (field=general)250 ( field=call-center) |
| enable_modal_particle_filter | Boolean | No | Whether to enable filler word removal. See Developer Guide - Practical Features. | false |
| enable_punctuation_prediction | Boolean | No | Whether to add punctuation | true(output=text)false( output=subtitle) |
| enable_words | Boolean | No | Whether to enable returning word information. See Developer Guides - Basic Terms. | false |
| words_type | Integer | No | The minimum unit for returning word information, valid only for Mandarin0 for words, 1 for characters | 0 |
| enable_inverse_text_normalization | Boolean | No | Whether to perform ITN in post-processing. See Developer Guides - Basic Terms. | true |
| split_clusters | Boolean | No | Whether to distinguish speakers | false |
| clusters | Integer | No | Number of speaker diarization categories, range 2~10, indicates the number of speakers to categorize by voiceprint. Specifically, a value of 0 means that the number of diarization categories will be automatically determined by the system. Note: This field will only become active if split_clusters = true. Passing the correct parameter will improve the performance of speaker diarization. | No |
| channels | Integer | No | Number of channels, options are 1 or 2.1 indicates mono channel, while 2 indicates stereo.Note: If a valid channels parameter is set, cluster_id will be returned. If cluster_id is the same, it indicates that the content is from the same channel. | 1 |
| hotwords_list | String | No | One-time hotwords list, effective only for the current connection. If both hotwords_list and hotwords_id parameters exist, hotwords_list will be used. Up to 100 entries can be provided at a time. See Developer Guides - Practical Features. | No |
| hotwords_id | String | No | Hotwords ID. See Developer Guides - Practical Features. | No |
| hotwords_weight | Float | No | Hotwords weight. Range is [0.1, 1.0] | 0.4 |
| correction_words_id | String | No | Forced correction vocabulary ID. See Developer Guides - Practical Features. Supports multiple IDs, separated by a vertical bar |. all indicates using all IDs. | No |
| forbidden_words_id | String | No | Forbidden words ID. See Developer Guides - Practical Features. Supports multiple IDs, separated by a vertical bar | . all indicates using all IDs. | No |
| keywords_quantity | Integer | No | The number of automatically extracted keywords. Range is [0, 100] (currently supports Japanese-English mixed and Chinese-English mixed languages only) | 0 |
| callback_url | String | No | The callback URL, such as http://domain.net:8080/callback | No |
| gain | Integer | No | The amplitude gain factor, range [1, 20], see Developer Guides - Practical Features.1 indicates no amplification, 2 indicates the original amplitude doubled (amplified by 1 times), and so on. | 1 (field=general)2 ( field=call-center) |
| enable_lang_label | Boolean | No | When switching languages, language labels will be returned in the recognition results (currently supports Japanese-English mixed and Chinese-English mixed languages only) | false |
| paragraph_condition | Integer | No | When the number of characters set within the same speakerid is reached, return a new paragraph number in the next sentence: range [100, 2000], and values outside this range will disable this feature. | 0 |
| enable_save_log | Boolean | No | Is it possible for you to provide audio data and recognition result logs so that we can use them to improve the quality of our products and services. | true |
5.4 Response Results
The Content-Type field in the HTTP Headers of all interface responses is always application/json. The response results are in the HTTP Body.
The fields of the response results are as follows:
| Name | Type | Description |
|---|---|---|
| status | String | Status Code |
| message | String | Status Code Description |
| data | Object | Transcription Information |
| ├─ duration | Integer | Duration of Audio (seconds) |
| ├─ task_id | String | Task ID for This Transcription. Please record this value for interface requests |
5.4.1 Successful Response
The status field in the body is 000000.
Example of a successful response:
{
"status": "000000",
"message": "success",
"data": {
"task_id": "01e57746-5f50-490a-8e36-4ea1110c21cd",
"duration": 23
}
}5.4.2 Error Response
If the status field in the body is not 000000, it indicates an error response. This field can be used as an indicator to determine whether the response was successful.
Example of an error response:
{
"status": "200001",
"message": "file Parameter Missing",
"data": {
"task_id": "38d19b08-b075-443e-af68-e8b990764b1e",
"duration": 0
}
}6 Get Transcription Results API (Polling)
6.1 Request Line
GET /v1/asrfile/result6.2 Request Parameters
When the client sends a request, the meaning of the query parameters is as follows:
| Parameter | Type | Required | Description |
|---|---|---|---|
| task_id | String | Yes | Task ID, returned by the Request File Transcription API |
6.3 Response Results
The Content-Type field in the HTTP Headers of all interface responses is always application/json. The response results are in the HTTP Body.
The fields of the response results are as follows:
| Name | Type | Description |
|---|---|---|
| status | String | Status Code |
| message | String | Status Code Description |
| data | Object | Transcription Information |
| statistics | Object | Statistical Information |
6.3.1 Successful Response
When transcription is not completed yet,
the status field in the body is 000000.
the data field in the body is the description of this transcription task.
The data field is defined as follows:
| Name | Type | Description |
|---|---|---|
| desc | String | Description of the transcription status code |
| file_name | String | The name of the file being transcribed |
| insert_time | String | The time of the request for file transcription |
| process_time | String | The time when the transcription actually started after queuing |
| progress | Integer | The percentage of the processing progress |
Example of a successful response:
{
"status": "000000",
"message": "success",
"data": {
"desc": "Transcribing",
"file_name": "test.mp3",
"insert_time": "2022-11-11 12:42:04",
"process_time": "2022-11-11 12:42:04",
"progress": 67
}
}When transcription is completed,
the status field in the body is 000000.
the data.result field in the body is the transcription results of this task.
The transcription result fields are defined as follows:
| Name | Type | Description |
|---|---|---|
| begin | String | The start time of each segment (relative to the source audio file) |
| end | String | The end time of each segment (relative to the source audio file) |
| seg_num | Integer | Segment number (starting from 1 and increasing incrementally) |
| transcript | String | The transcription result of each segment |
| confidence | Float | Confidence level of the current result, range [0, 1] |
| words | Object | Word information (returned when the word information parameter enable_words = true is set) |
| lang_type | String | Language labels when switching languages: currently supports Japanese-English mixed and Chinese-English mixed languages only (Returned when parameter enable_lang_label is set to true) |
| paragraph | Integer | Paragraph number, increasing sequentially starting from 1. |
| cluster_id | Integer | Speaker numbers increase sequentially starting from 1. If the cluster_id is the same, it indicates the same speaker.(Returned when a valid clusters parameter is set, or when channels is set to >=2) |
Among them, the words object:
| Parameter | Type | Description |
|---|---|---|
| word | String | Text |
| start_time | Integer | Word start time, unit is milliseconds |
| end_time | Integer | Word end time, unit is milliseconds |
| type | String | Typenormal indicates regular texts. modal indicates filler words (returned only when the request parameter words_type = 0). punc indicates punctuation marks |
| keywords_index | Integer | The sequence number of the corresponding keyword (0 indicates a non-keyword), this field is returned only when the request parameter words_type = 0 |
The data.statistics field in the body contains statistical information for this transcription task:
| Parameter | Type | Description |
|---|---|---|
| keywords | Object | Keyword object (returned when the keywords_quantity parameter is set) |
| speed | Integer | Average speaking speed (Japanese/Chinese/Korean: characters per minute; English: words per minute) |
| word_count | Integer | Word count (Japanese/Chinese/Korean: number of characters; English: number of words) |
| insert_time | String | Time of the request for file transcription |
| process_time | String | Time when the transcription actually started after queuing |
| finish_time | String | Time when the transcription task was completed |
Among them, the keywords object:
| Parameter | Type | Description |
|---|---|---|
| index | Integer | Keyword number (starting from 1 and increasing incrementally) |
| word | String | Keyword |
| frequency | Integer | Keyword frequency |
| time | Dict | Keyword appearance time range |
| ├─ begin | String | Keyword start time |
| ├─ end | String | Keyword end time |
The
keywordsobjects are scored and returned based on a combination of frequency and importance (TF-IDF algorithm).If you need to sort solely based on the keyword frequency, please use the
frequencyfield.
Example of a successful response:
{
"status": "000000",
"message": "Success",
"data": {
"result": [
{
"begin": "00:00:01,170",
"end": "00:00:04,610",
"seg_num": 1,
"transcript": "こんにちは。あの、お名前は何ですか?",
"confidence": 0.9,
"paragraph": 1,
"lang_type": "ja-JP",
"cluster_id": 1,
"words": [
{
"word": "こんにちは",
"start_time": 80,
"end_time": 400,
"is_punc": false,
"keywords_index": 2,
"type": "normal"
},
{
"word": "。",
"start_time": 80,
"end_time": 400,
"is_punc": true,
"keywords_index": 0,
"type": "punc"
},
{
"word": "あの",
"start_time": 400,
"end_time": 860,
"is_punc": false,
"keywords_index": 0,
"type": "normal"
},
……more……
]
},
{
"begin": "00:00:04,630",
"end": "00:00:07,010",
"seg_num": 2,
"transcript": "こんにちは。私はあきです,お名前は?",
"confidence": 0.9,
"paragraph": 1,
"lang_type": "ja-JP",
"cluster_id": 1,
"words": [
……more……
]
}
],
"statistics": {
"finish_time": "2022-11-11 12:47:12",
"insert_time": "2022-11-11 12:42:04",
"process_time": "2022-11-11 12:46:28",
"keywords": [
{
"index": 1,
"words": "お名前",
"frequency": 2,
"time": [
{
"begin": "00:00:05,120",
"end": "00:00:05,590"
},
{
"begin": "00:00:13,420",
"end": "00:00:13,980"
}
]
}
],
"speed": 210,
"word_count": 86
}
}
}6.3.2 Error Response
If the status field in the body is not 000000, it indicates an error response. This field can be used as an indicator to determine whether the response was successful.
Example of an error response:
{
"status": "220404",
"message": "task_id does not exist"
}7 Callback of Transcription Results
If the callback_url parameter is specified when calling the Request File Transcription API, the system will automatically send the recognition results to the specified address in the form of an HTTP POST request upon completion (or failure) of the transcription.
The response result fields are as follows:
| Name | Type | Description |
|---|---|---|
| status | String | Status Code |
| message | String | Description of the status code |
| data | Object | Transcription information |
7.1.1 Successful Transcription
The data field in the transcription results is defined as follows:
| Name | Type | Description |
|---|---|---|
| result | Object | Transcription results, with the same format as the data field in the polling API |
| statistics | Object | Statistical information, with the same format as the statistics field in the polling API |
| task_id | String | Task ID for this transcription |
{
"status":"000000",
"message":"success",
"data":{
"result":[
{
"begin": "00:00:01,170",
"end": "00:00:04,610",
"seg_num": 1,
"transcript": "こんにちは。あの、お名前は何ですか?"
},
{
"begin": "00:00:04,630",
"end": "00:00:7,010",
"seg_num": 2,
"transcript": "こんにちは。私はあきです,お名前は?"
}
],
"statistics": {
"keywords":[],
"speed": 210,
"word_count": 86,
"finish_time": "2022-11-11 12:47:12",
"insert_time": "2022-11-11 12:42:04",
"process_time": "2022-11-11 12:46:28",
},
"task_id":"27dc2487-bb02-4585-bd10-b75b445441bc"
}
}7.1.2 Transcription Failure
The field format is the same as described above, but the result field will be empty. For specific information about the task failure, please refer to the message field.
8 Task Query Interface
Tasks can be filtered based on certain conditions, and transcription results are only stored on the platform for one month. Please retrieve them as soon as possible.
8.1 Request Line
GET /v1/asrfile/tasks8.2 Request Parameters
The client sends a request, and the meaning of the query parameters is as follows:
| Parameter | Description | Default Value |
|---|---|---|
| time | Filter by upload time: audios uploaded within the last n hours | No time limit |
| type | Filter by type: 0: Failed, 1: Completed, 2: Processing | No type limit |
| count | Filter by quantity: the upper limit of the number of tasks returned | No quantity limit |
When the above filtering conditions are used in combination, they form an "AND" relationship.
8.3 Response Parameters
The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.
The response result fields are as follows:
| Parameter | Type | Description |
|---|---|---|
| status | String | Status code |
| message | String | Description of the status code |
| data | Dict | Task information |
| ├─ task_id | String | Task ID |
| ├─ desc | String | Description of the transcription status |
| ├─ file_name | String | Name of the file being transcribed |
| ├─ insert_time | String | Time of the request for file transcription |
| ├─ process_time | String | Time when the transcription actually started after queuing |
| ├─ finish_time | String | Time when the transcription task was completed |
| statistics | Object | Statistical information |
| ├─ count | Integer | Total number of tasks in the query results |
| ├─ types | Dict | Summary of quantities by type |
| ├─├─ type | Integer | 0: Failed, 1: Completed, 2: Processing, 3: Queued |
| ├─├─ desc | String | Description of the transcription status |
| ├─├─ count | Integer | Number of tasks |