録音ファイルの書き起こし API
Feature Introduction
The Audio File Transcription service provides the functionality to transcribe audio files into text.
For more information, please refer to the "Basic Knowledge" section in the "Speech-to-Text Service User Manual".
Invocation Limits
The duration of the audio file should be under 10 hours, and the file size should be under 2 GB.
Interaction Process
-
Uploading Files
- Request File Transcription: The client sends an HTTP POST method request to the server with the file to be transcribed, and the server responds with an HTTP response containing a task ID.
-
Getting Transcription Results: There are two methods to obtain transcription results: Polling and Callback
- Polling: The client sends an HTTP GET method request to the server with the task ID returned by the Request File Transcription API, and the server responds with an HTTP response that includes the transcription results.
- Callback: When uploading the file, a callback URL is provided. Once the transcription task is completed, the system automatically sends the transcription results to the specified callback URL via an HTTP POST request.
Request URLs
| Interface Name | Protocol | URL | Method |
|---|---|---|---|
| File Transcription | HTTP/1.1 | http://<ip_address>:7100/request | POST |
| Get Transcription | HTTP/1.1 | http://<ip_address>:7100/getResult | GET |
| Query Task | HTTP/1.1 | http://<ip_address>:7100/getTasks | GET |
| Delete Task | HTTP/1.1 | http://<ip_address>:7100/deleteTask | POST |
Request File Transcription API
Request Line
POST /requestRequest Header
| Name | Type | Required | Description |
|---|---|---|---|
| Content-type | String | Yes | Must be "multipart/form-data" |
Request Parameters
The client sends a request where parameters are set in the request body in form-data format. The meanings of the parameters are as follows:
| Parameter | Type | Required | Description | Default |
|---|---|---|---|---|
| lang_type | String | Yes | Language code of the file to be transcribed, see "Language Support" section in the "Speech-to-Text Service User Manual" | Required |
| file | File | Either with file_url | The file to be transcribed | Null |
| file_url | String | Either with file | The URL of the audio file to be transcribed, if both file and file_url are present, file takes precedence | Null |
| format | String | Yes | The format of the file to be transcribed, see "Audio Encoding" section in the "Speech-to-Text Service User Manual" Note: When transcribing an 8kHz PCM format audio file with a 16kHz model, please pass the parameter value pcm_8000 | Required |
| sample_rate | Integer | No | Sampling rate of the Model (rather than the audio itself) Note: The 16kHz model can also transcribe audio recorded at an 8kHz sample rate. | 16000 |
| output | String | No | text for manuscript format, subtitle for subtitle format, see "Practical Features" section in the manual | text |
| channels | Integer | No | Number of audio channels to transcribe, transcribe the first channels audio channels separately, values should be ≥ 1When the channels parameter is set to 1, all audio channels of the file are mixed down to a single channel, and then the transcription is performed; for more details, see the explanation below the table | 1 |
| enable_modal_particle_filter | Boolean | No | Whether to enable modal particle filtering, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" | false |
| enable_punctuation_prediction | Boolean | No | Whether to add punctuation | true (if output=text)false (if output=subtitle) |
| max_sentence_silence | Integer | No | Speech sentence breaking detection threshold, silence longer than this threshold is considered as sentence breaking Range [200, 5000], Unit: milliseconds | 450 |
| enable_words | Boolean | No | Whether to enable returning word information in the final results, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual" | false |
| words_type | Integer | No | The minimum unit for returning word information, only valid for Chinese 0 for words, 1 for characters | 0 |
| enable_inverse_text_normalization | Boolean | No | Whether to perform ITN, refer to the "Basic Terms" section in the "Speech-to-Text Service User Manual" | false |
| hotwords_id | String | No | Hotwords ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Hotword API Protocol" | Null |
| hotwords_weight | Float | No | Hotwords weight, range [0.1, 1.0] | 0.4 |
| correction_words_id | String | No | Forced correction vocabulary ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forced Correction API Protocol" Supports multiple forced replacement word library IDs, separated by a vertical line |; all indicates using all forced replacement word library IDs | |
| forbidden_words_id | String | No | Forbidden words ID, refer to the "Practical Features" section in the "Speech-to-Text Service User Manual" and the "Forbidden Word API Protocol" Supports multiple IDs, separated by a vertical line |; all indicates using all IDs | Null |
| gain | Int | No | Amplitude gain factor, range [1, 20], see "Basic Knowledge" and "Practical Features" sections in the "Speech-to-Text Service User Manual" The value 1 indicates no amplification, 2 indicates the original amplitude doubled, and so on. | 1 |
| callback_url | String | No | Callback URL, e.g., http://10.0.0.2:8080 | Null |
| clusters | Int | No | Number of speaker diarization categories, range [0~10], indicates the number of speakers to categorize by voiceprint. Specifically, 1 indicates that no speaker diarization processing will be performed; 0 indicates that the system will automatically determine the number of categories for diarization. When this parameter exists concurrently with the channels parameter and channels > 1, the transcription will be processed according to the logic of multi-channel audio transcription, and speaker diarization processing will not be conducted.This feature relies on the voiceprint component (module_vpr). | 1 |
| enable_lang_label | Boolean | No | [In Development v2.5.11] Return language code in recognition results when switching languages, only effective for mixed English languages (e.g., Japanese-English, Chinese-English) | false |
| paragraph_condition | Integer | No | [In Development v2.5.11] Control paragraph character count, return a new paragraph number in the next sentence within the same cluster_id when the set character count is reached, range [100, 2000], values outside the range indicate that this feature is not enabled | 0 |
| decode_silence_segment | Boolean | No | Control whether to perform speech recognition processing on the silent segments determined by VAD, suitable for far-field recording environments (supported in version 2.5.11 and above). | false |
Note:
-
Regarding the
channelsparameter-
When
channels= 1, the audio is converted to a single-channel audio before transcription. -
When
channels= 2, the first two channels of all channels are processed and transcribed separately. When processing begins, if there are at least two available concurrent processes, two are used; if there is only one available concurrent process, then only one is used. The transcription results usecluster_idto differentiate the channels (with values of 1 and 2). -
When
channels= 3, the first three channels of all channels are processed and transcribed separately... and so on. -
When
channels> the actual number of channels in the audio, the effective value ofchannelsis the actual number of channels in the audio. That is, the effective value ofchannels= min(channels parameter value, actual number of audio channels).
-
Response Results
The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.
The response result fields are as follows:
| Name | Type | Description |
|---|---|---|
| status | String | Status code |
| message | String | Description of the status code |
| data | Dict | Transcription information |
| ├─ duration | Integer | Audio duration (seconds) |
| ├─ task_id | String | The task ID for this transcription, please record this value for use in requesting the interface |
(Note: The table above describes the structure of the JSON response typically expected from the API. The actual response may include additional fields or differ slightly based on the implementation details of the service.)
Successful Response
The status field in the body is 00000.
Example of a successful response:
{
"status": "00000",
"message": "Success",
"data": {
"duration": 600,
"task_id": "96e25f64-1727-4dbe-889c-4a9052c2eb28"
}
}Error Response
Any status field in the body that is not 00000 is considered an error response, and this field can be used as an indicator of whether the response was successful.
Example of an error response:
{
"status": "20322",
"message": "lang_type invalid",
"data": null
}Get Transcription Results API (Polling)
Request Line
GET /getResultRequest Parameters
The client sends a request with the following query parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| task_id | String | Yes | Task ID, returned by the Request File Transcription API |
Response Results
The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.
The response result fields are as follows:
| Name | Type | Description |
|---|---|---|
| status | String | Status code |
| message | String | Description of the status code |
| data | Dict | Transcription information |
| statistics | Dict | Statistical information |
Successful Response
When transcription is not completed yet:
The status field in the body is 20302.
The data field contains a description of the current transcription task.
The data field is defined as follows:
| Name | Type | Description |
|---|---|---|
| desc | String | Description of the transcription status code |
| file_name | String | The name of the file being transcribed |
| insert_time | String | The time of the request for file transcription |
| process_time | String | The time when the transcription actually started after queuing |
| progress | Integer | Transcription progress, a negative number -n indicates queuing, with n transcription tasks ahead; a positive number n indicates that transcription is in progress, with a processing progress of n percent; 0 indicates transcription failure. |
Example of a successful response:
{
"status": "20302",
"message": "The task is still in the queue or being processed.",
"data": {
"desc": "Transcription in progress",
"file_name": "test.mp3",
"insert_time": "2022-11-11 12:42:04",
"progress": 92
}
}When transcription is completed:
The status field in the body is 00000.
The data[] field in the body contains the transcription results of this task.
The fields for the transcription results are defined as follows:
| Name | Type | Description |
|---|---|---|
| lang_type | String | [In Development v2.5.11] The language code returned in the recognition result when switching languages, effective only for mixed English languages (e.g., Japanese-English, Chinese-English) |
| paragraph | Integer | [In Development v2.5.11] Paragraph number, starting from 1 and increasing incrementally |
| begin | String | The start time of each segment (relative to the source audio file) |
| end | String | The end time of each segment (relative to the source audio file) |
| seg_num | Integer | Segment number (starting from 1 and increasing incrementally) |
| transcript | String | The transcription result of each segment |
| confidence | Float | The confidence level of the current result, range [0, 1] |
| words | Dict[] | Word information (returned when the word information parameter enable_words = True is set) |
| cluster_id | Integer | The archive (classification) number, starting from 1 and increasing incrementally; the same cluster_id indicates the same speaker(Returned when a valid clusters parameter is set, or when channels >=2) |
| volume | Integer | The current volume, range [0, 100] |
The structure of the word information object is as follows:
| Parameter | Type | Description |
|---|---|---|
| text | String | The text of the word or punctuation |
| start_time | Integer | The start time of the word, in milliseconds |
| end_time | Integer | The end time of the word, in milliseconds |
| type | String | The type of the textnormal indicates regular text, forbidden indicates sensitive words, modal indicates modal particles (only returned when the request parameter words_type = 0), punc indicates punctuation marks |
The statistics field in the body contains statistical information for this transcription task:
| Parameter | Type | Description |
|---|---|---|
| speed | Integer | Average speaking speed (Chinese/Japanese/Korean: characters per minute; other languages: words per minute) |
| word_count | Integer | Word count (Japanese/Chinese/Korean: character count; other languages: word count) |
| insert_time | String | The time of the request for file transcription |
| process_time | String | The time when the transcription actually started after queuing |
| finish_time | String | The time when the transcription task was completed |
Example of a successful response:
{
"status": "00000",
"message": "",
"data": [
{
"begin": "00:00:01,170",
"end": "00:00:04,610",
"lang_type": "ja-JP",
"paragraph": 1,
"seg_num": 1,
"transcript": "初めまして、お名前は?",
"confidence": 0.9,
"volume": 76,
"words": [
{
"word": "初め",
"start_time": 80,
"end_time": 400,
"is_punc": false,
"keywords_index": 0,
"type": "normal"
},
{
"word": "まして",
"start_time": 80,
"end_time": 400,
"is_punc": true,
"keywords_index": 0,
"type": "normal"
},
{
"word": "、",
"start_time": 400,
"end_time": 860,
"is_punc": false,
"keywords_index": 0,
"type": "punc"
},
……略……
]
},
{
"begin": "00:00:04,630",
"end": "00:00:07,010",
"lang_type": "zh-cmn-Hans-CN",
"paragraph": 1,
"seg_num": 2,
"transcript": "初めまして、田中と申します。",
"confidence": 0.95,
"volume": 82,
"words": [
……略……
]
}
],
"statistics": {
"finish_time": "2022-11-11 12:47:12",
"insert_time": "2022-11-11 12:42:04",
"process_time": "2022-11-11 12:46:28",
"speed": 210,
"word_count": 86
}
}Error Response
Any status field in the body that is not 20302 or 00000 is considered an error response. This field can be used as an indicator of whether the request was successful.
Example of an error response:
{
"status": "20326",
"message": "task_id invalid",
"data": null
}Callback of Transcription Results
If the callback_url parameter is specified when calling the Request File Transcription API, the system will automatically send the recognition results to the specified address in the form of an HTTP POST request upon completion (or failure) of the transcription.
The response result fields are as follows:
| Name | Type | Description |
|---|---|---|
| status | String | Status code |
| message | String | Description of the status code |
| data | Dict | Transcription information |
Successful Transcription
The data field for successful transcription is defined as follows:
| Name | Type | Description |
|---|---|---|
| result | Dict[] | Transcription results, with the same format as the data field in the polling API |
| statistics | Dict | Statistical information, with the same format as the statistics field in the polling API |
| task_id | String | The task ID for this transcription |
{
"status":"00000",
"message":"success",
"data":{
"result":[
{
"begin": "00:00:01,170",
"end": "00:00:04,610",
"lang_type": "zh-cmn-Hans-CN",
"paragraph": 1,
"seg_num": 1,
"transcript": "初めまして、お名前は?",
"confidence": 0.9,
"volume": 82
},
{
"begin": "00:00:04,630",
"end": "00:00:7,010",
"lang_type": "zh-cmn-Hans-CN",
"paragraph": 1,
"seg_num": 2,
"transcript": "初めまして、田中と申します。",
"confidence": 0.95,
"volume": 76
},
……略……
],
"statistics": {
"keywords":[],
"speed": 210,
"word_count": 86,
"finish_time": "2022-11-11 12:47:12",
"insert_time": "2022-11-11 12:42:04",
"process_time": "2022-11-11 12:46:28",
},
"task_id":"27dc2487-bb02-4585-bd10-b75b445441bc"
}
}Transcription Failure
The field format is the same as described above, but the result field will be empty. For specific information about the task failure, refer to the message field.
Task Inquiry Interface
Filter tasks based on certain conditions. Tasks cleared according to the expiration policy in the configuration file will not be returned.
Request Line
GET /getTasksRequest Parameters
The client sends a request with the following query parameters:
| Parameter | Description | Default |
|---|---|---|
| time | Filter condition based on upload time, audios uploaded within the last n hours | No time limit |
| type | Filter by type, 0: Failed, 1: Completed, 2: Processing, 3: Queued | No type limit |
| count | Filter by quantity, the upper limit of the number of tasks returned | No quantity limit |
When combining the above filter conditions, they are used in an "AND" relationship.
Response Parameters
The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.
The response result fields are as follows:
| Parameter | Type | Description |
|---|---|---|
| status | String | Status code |
| message | String | Description of the status code |
| data | Dict[] | Task information |
| ├─ task_id | String | Task ID |
| ├─ desc | String | Description of the transcription status |
| ├─ file_name | String | Name of the file being transcribed |
| ├─ insert_time | String | Time of the request for file transcription |
| ├─ process_time | String | Time when the transcription actually started after queuing |
| ├─ finish_time | String | Time when the transcription task was completed |
| ├─ progress | Integer | Transcription progress. A negative number -n indicates queuing, with n transcription tasks ahead; A positive number n indicates that transcription is in progress, with a processing progress of n percent; 0 indicates transcription failure; 100 indicates transcription completed. |
| statistics | Dict | Statistical information |
| ├─ count | Integer | Total number of tasks in the query results |
| ├─ types | Dict[] | Summary of quantities by type |
| ├─├─ type | Integer | 0: Failed, 1: Completed, 2: Processing, 3: Queued |
| ├─├─ desc | String | Description of the transcription status |
| ├─├─ count | Integer | Number of tasks |
Delete Transcription Task API
This API is used to delete tasks that are in the queue or have been completed. For tasks that are currently being processed, they will be deleted after the pre-processing phase is completed.
Request Line
POST /deleteTaskRequest Parameters
The client sends a request with the following query parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| task_id | String | Yes | Task ID, returned by the Request File Transcription API |
Response Results
The Content-Type field in the HTTP Headers of all interface responses is application/json, and the response results are in the HTTP Body.
The response result fields are as follows:
| Name | Type | Description |
|---|---|---|
| status | String | Status code |
| message | String | Description of the status code |
| data | Dict | null |
Service Status Codes
Every response from the service will include the status and message fields, which indicate the status code and the description of the status code for the response. The service status codes are as follows:
| Status Code | Reason |
|---|---|
| 20302 | Progress update from getResult API |
| 20001 | Parameters parsing failed |
| 20002、20003、20006 | File processing failed |
| 20111 | WebSocket upgrading failed |
| 20114 | Request body is empty or failed to retrieve the audio file. |
| 20115 | Audio size/duration limit exceeded |
| 20116 | Unsupported sample_rate |
| 20190 | Missing parameter |
| 20191 | Invalid parameter |
| 20192 | Processing failed (decoder) |
| 20193 | Processing failed (service) |
| 20194 | Connection timed out (no data received from the client for a long time) |
| 20195 | Other error |
| 20330 | Speaker diarization related error |
Quick Test
Linux
Request File Transcription API
curl -X POST -F 'file=@test.mp3' -F 'lang_type=ja-JP' -F 'format=mp3' -F 'sample_rate=16000' 'http://localhost:7100/request'Get Transcription Results API
curl -X GET 'http://localhost:7100/getResult?task_id=******'Delete Transcription Task API
curl -X POST 'http://localhost:7100/deleteTask?task_id=******'Windows
Request File Transcription API
curl.exe -X POST -F "file=@test.mp3" -F "lang_type=ja-JP" -F "format=mp3" -F "sample_rate=16000" "http://localhost:7100/request"Get Transcription Results API
curl.exe -X GET "http://localhost:7100/getResult?task_id=******"Delete Task API
curl.exe -X POST "http://localhost:7100/deleteTask?task_id=******"