Cloud API
Functional Introduction
The Speech Synthesis service provides the functionality to synthesize input text into binary audio data.
Supports output of data in wav, pcm, and mp3 encoding formats.
Supports setting voice, emotion, speech rate, pitch rate, and volume.
WebSocket Interface (Streaming)
Send text and receive audio data through WebSocket duplex streaming to enhance the real-time performance of synthesis.
Call Constraint
The input text must be encoded in UTF-8.
The input text must not exceed 1024 bytes, which is approximately 300 Chinese characters.
Service Address
wss://api.voice.dolphin-ai.jp/v1/tts/wsInteraction Process

1. Authorization
When establishing a WebSocket connection between the client and the server, it must include the following request header:
| Name | Type | Required | Description |
|---|---|---|---|
| Authorization | String | Yes | Standard HTTP header for authentication. The format must be in the standard Bearer <Token> form (note the space after Bearer). |
For more information on authorization, refer to Authentication and Authorization.
2. Start and Send Parameter
The client sends a speech synthesis request, where parameter settings need to be configured in the request message, and it is sent in JSON format.
Send parameter (header object):
| Parameter | Type | Required | Description |
|---|---|---|---|
| namespace | String | Yes | Namespace of the message: SpeechSynthesizer indicates Speech Synthesis |
| name | String | Yes | Event name: StartSynthesis indicates the start phase |
Send Parameter (payload object):
| Parameter | Type | Required | Description | Default Value |
|---|---|---|---|---|
| text | String | Yes | Text to be synthesized, length limit: 1024 bytes (UTF-8 encoding) | No |
| lang_type | String | Yes | Language option, refer to Development Guide - Language and Voice Support | No |
| voice | String | No | VoiceID, refer to Development Guide - Language and Voice Support | Japanese: Yuko English:Julie Chinese:Xiaohui |
| sample_rate | Integer | No | Audio sample rate, options are 8000, 16000, 24000 | 24000 |
| format | String | No | Audio encode format, wav / pcm / mp3, note: wav does not support streaming | pcm |
| speech_rate | Float | No | Speech rate, parameter range [0.2, 3], usually retaining one decimal place is sufficient | 1 |
| volume | Float | No | Volume, parameter range [0.1, 3], usually retaining one decimal place is sufficient | 1 |
| pitch_rate | Float | No | Pitch rate, parameter range [0.1, 3], usually retaining one decimal place is sufficient | 1 |
| emotion | String | No | Emotion/style, refer to Development Guide - Language and Voice Support | No |
| silence_duration | Integer | No | Silence duration at the end of the sentence, parameter range [0, 10000], in ms | 125 |
| enable_timestamp | Boolean | No | Timestamp related, when passed as true, it indicates enabling, and the original text's timestamps can be returned. Note: multiple consecutive punctuation or spaces in the original text will still be processed, but this will not affect the continuity of the timestamps | false |
Send Example:
{
"header":{
"namespace":"SpeechSynthesizer",
"name":"StartSynthesis"
},
"payload":{
"lang_type":"zh-cmn-Hans-CN",
"voice":"",
"text":"今天是星期三。",
"format":"pcm",
"sample_rate":16000,
"volume":1,
"speech_rate":1,
"pitch_rate":1,
"enable_timestamp":true
}
}Return Parameter (header object):
| Parameter | Type | Description |
|---|---|---|
| namespace | String | Namespace of the message: SpeechSynthesizer indicates Speech Synthesis |
| name | String | Event name: SynthesisStarted indicates the start phase |
| status | String | Status code |
| status_text | String | Description of the status code |
| app_id | String | Application ID |
| task_id | String | Globally unique task ID, please record this value for troubleshooting. |
| message_id | String | ID of the current message |
Return example:
{
"header": {
"namespace": "SpeechSynthesizer",
"name": "SynthesisStarted",
"status": "000000",
"status_text": "Success",
"app_id": "33a7d8c6-6cbe-40ff7-95f3-0a89bf6eb1fb",
"task_id": "c43bdfda-6319-49cc-be8d-a89852b104d9",
"message_id": "f0cfbfa3-ed78-400b-9c62-2a904ea48961"
},
"payload": {}3. Receiving Synthesized Audio Data
The server will continuously return the synthesized audio binary data, audio duration, and timestamp information.
Note: Timestamp information and audio duration is only returned when enable_timestamp is enabled.
During this phase, the server will ignore all subsequent request messages within the current connection. To send another synthesis request, establish a new connection.
Audio Duration Information:
Return Parameter (header object):
| Parameter | Type | Description |
|---|---|---|
| namespace | String | Namespace of the message: SpeechSynthesizer indicates Speech Synthesis |
| name | String | Event name: SynthesisDuration indicates the start phase |
| status | String | Status code |
| status_text | String | Description of the status code |
| app_id | String | Application ID |
| task_id | String | Globally unique task ID, please record this value for troubleshooting. |
| message_id | String | ID of the current message |
Return Parameter (payload object):
| Parameter | Type | Description |
|---|---|---|
| duration | String | Length of the audio, in ms |
Return Example:
{
"header": {
"namespace": "SpeechSynthesizer",
"name": "SynthesisDuration",
"status": "000000",
"status_text": "Success",
"app_id": "33a7d8c6-6cbe-40ff7-95f3-0a89bf6eb1fb",
"task_id": "c43bdfda-0019-35cc-be0d-a85643b104d9",
"message_id": "259e92ca-24f4-4958-9fe1-aa93640fd4dd"
},
"payload": {
"duration": "1305"
}
}Timestamp Information:
Return Parameter (header object):
| Parameter | Type | Description |
|---|---|---|
| namespace | String | Namespace of the message: SpeechSynthesizer indicates Speech Synthesis |
| name | String | Event name: SynthesisTimestamp indicates the start phase |
| status | String | Status code |
| status_text | String | Status code |
| app_id | String | Application ID |
| task_id | String | Globally unique task ID, please record this value for troubleshooting. |
| message_id | String | ID of the current message |
Return Parameter (payload object):
| Parameter | Type | Description |
|---|---|---|
| timestamp | String | Timestamp information |
| ├─words | String | Word-level timestamp information |
| ├─phonemes | String | Phoneme-level Timestamp Information |
Among them, the word-level timestamp information words object:
| Parameter | Type | Description |
|---|---|---|
| word | String | Text |
| start_time | Integer | Start time, in s |
| end_time | Integer | End time, in s |
| unit_type | String | Typetext represents plain text, and mark denotes punctuation marks. |
Phoneme-level timestamp information:
| Parameter | Type | Description |
|---|---|---|
| phone | String | Phoneme |
| start_time | Float | Start time, in s |
| end_time | Float | End time, in s |
Return Example:
{
"header": {
"namespace": "SpeechSynthesizer",
"name": "SynthesisTimestamp",
"status": "000000",
"status_text": "Success",
"app_id": "33a7d8c6-0cbe-56ff7-95f3-0a89bf6eb1fb",
"task_id": "c43bdfda-6319-49cc-be8d-a89852b104d9",
"message_id": "e70e5628-be9a-426b-87d3-a2f4527a490a"
},
"payload": {
"timestamp": "{\"words\":[{\"word\":\"今\",\"start_time\":0.02,\"end_time\":0.175,\"unit_type\":\"text\"},{\"word\":\"天\",\"start_time\":0.175,\"end_time\":0.355,\"unit_type\":\"text\"},{\"word\":\"是\",\"start_time\":0.355,\"end_time\":0.55,\"unit_type\":\"text\"},{\"word\":\"星\",\"start_time\":0.55,\"end_time\":0.725,\"unit_type\":\"text\"},{\"word\":\"期\",\"start_time\":0.725,\"end_time\":0.865,\"unit_type\":\"text\"},{\"word\":\"三\",\"start_time\":0.865,\"end_time\":1.145,\"unit_type\":\"text\"},{\"word\":\"。\",\"start_time\":1.145,\"end_time\":1.305,\"unit_type\":\"mark\"}],\"phonemes\":[{\"phone\":\"C0j\",\"start_time\":0.02,\"end_time\":0.085},{\"phone\":\"C0in\",\"start_time\":0.085,\"end_time\":0.175},{\"phone\":\"C0t\",\"start_time\":0.175,\"end_time\":0.25},{\"phone\":\"C0ian\",\"start_time\":0.25,\"end_time\":0.355},{\"phone\":\"C0sh\",\"start_time\":0.355,\"end_time\":0.485},{\"phone\":\"C0iii\",\"start_time\":0.485,\"end_time\":0.55},{\"phone\":\"C0x\",\"start_time\":0.55,\"end_time\":0.655},{\"phone\":\"C0ing\",\"start_time\":0.655,\"end_time\":0.725},{\"phone\":\"C0q\",\"start_time\":0.725,\"end_time\":0.825},{\"phone\":\"C0i\",\"start_time\":0.825,\"end_time\":0.865},{\"phone\":\"C0s\",\"start_time\":0.865,\"end_time\":0.965},{\"phone\":\"C0an\",\"start_time\":0.965,\"end_time\":1.145},{\"phone\":\"。\",\"start_time\":1.145,\"end_time\":1.305}]}"
}
}4. Speech Synthesis Complete
The speech synthesis is completed, the server sends a notification of the synthesis completion event, and then automatically disconnects.
Return Example:
{
"header": {
"namespace": "SpeechSynthesizer",
"name": "SynthesisCompleted",
"status": "000000",
"status_text": "Success",
"app_id": "33a7d8c6-0cbe-56ff7-95f3-0a89bf6eb1fb",
"task_id": "c43bdfda-6319-49cc-be8d-a89852b104d9",
"message_id": "031aab78-c7ab-45ac-8f4c-4dfaf45855b6"
},
"payload": {}
}HTTP Interface (non-streaming)
Send the text via a POST request, and after the synthesis of all content is completed, return all the audio data at once.
Call Constraint
- The input text must be encoded in UTF-8
- The input text must not exceed 1024 bytes, which is approximately 300 Chinese characters.
Interactive Process
The client sends an HTTP POST request with text content to the server, and the server returns an HTTP response containing synthesized speech data.
HTTP Request Line
| Agreement | URL | Method |
|---|---|---|
| HTTP | https://api.voice.dolphin-ai.jp/v1/tts/ws | POST |
Request Header
The HTTP request header consists of "key/value" pairs, with one pair per line. The key and value are separated by an English colon :. The content is set as follows:
| Name | Type | Required | Discription |
|---|---|---|---|
| Content-type | String | Yes | It must be "application/json", indicating that the data in the HTTP body is in JSON format |
Request Parameter
The client sends a speech synthesis request, where parameters need to be set in the request body(body). The meanings of each parameter are as follows:
| Name | Type | Required | Discription | Default Value |
|---|---|---|---|---|
| text | String | Yes | Text to be synthesized, length limit: 1024 bytes (UTF-8 encoding) | No |
| lang_type | String | Yes | Language option, refer to Development Guide - Language and Voice Support | No |
| voice | String | No | VoiceID, refer to Development Guide - Language and Voice Support | Japanese: Yuko English:Julie Chinese:Xiaohui |
| sample_rate | Integer | No | Audio sample rate, options are 8000, 16000, 24000 | 24000 |
| format | String | No | Audio encode format, wav / pcm / mp3, note: wav does not support streaming | pcm |
| speech_rate | Float | No | Speech rate, parameter range [0.2, 3], usually retaining one decimal place is sufficient | 1 |
| volume | Float | No | Volume, parameter range [0.1, 3], usually retaining one decimal place is sufficient | 1 |
| pitch_rate | Float | No | Pitch rate, parameter range [0.1, 3], usually retaining one decimal place is sufficient | 1 |
| emotion | String | No | Emotion/style, refer to Development Guide - Language and Voice Support | No |
| silence_duration | Integer | No | Silence duration at the end of the sentence, in ms | 125 |
| enable_timestamp | Boolean | No | Timestamp related, when passed as true, it indicates enabling, and the original text's timestamps can be returned. Note: multiple consecutive punctuation or spaces in the original text will still be processed, but this will not affect the continuity of the timestamps | false |
Request Body Example
curl --location --request POST 'https://api.voice.dolphin-ai.jp/v1/tts/ws' \
--header 'Content-Type: application/json' \
--data '{
"text":"今天天气不错。",
"lang_type":"zh-cmn-Hans-CN",
"format":"wav"
}'Response Result
The Content-Type field in the response Headers is application/json, and the response result is in the Body. The fields of the response result are as follows:
| Name | Type | Description |
|---|---|---|
| status | String | Status code |
| message | String | Status code description |
| data | Object | |
| ├─task_id | String | Globally unique task ID, please record this value for troubleshooting. |
| ├─result | String | Base64 encoded audio data |
| ├─duration | String | Length of the audio, in ms |
| ├─timestamp | String | Timestamp information, returned only when enable_timestamp is enabled. |
| ├─├─words | List<Word> | word-level timestamp information |
| ├─├─├─word | String | Text |
| ├─├─├─start_time | Float | Start time, in s |
| ├─├─├─end_time | Float | End time, in s |
| ├─├─├─unit_type | String | Typetext represents plain text, and mark denotes punctuation marks. |
| ├─├─phonemes | String | Phoneme-level Timestamp Information |
| ├─├─├─phone | String | Phoneme |
| ├─├─├─start_time | Float | Start time, in s |
| ├─├─├─end_time | Float | End time, in s |
Successful Response
In the body, the status field being 00000 indicates a successful response. The result field within the data is the base64 encoded data of the synthesized audio. By performing a base64 decode on this data, one can obtain the audio file. Example of a successful response:
{
"status": "000000",
"message": "Success",
"data": {
"task_id": "569ba392-f7d4-4f60-8685-b572a821126d",
"duration": "1100",
"result": "//PoxAAAAAAAAAAA...",
"timestamp": "{\"words\":[{\"word\":\"今\",\"start_time\":0.02,\"end_time\":0.17,\"unit_type\":\"text\"},{\"word\":\"天\",\"start_time\":0.17,\"end_time\":0.345,\"unit_type\":\"text\"},{\"word\":\"天\",\"start_time\":0.345,\"end_time\":0.515,\"unit_type\":\"text\"},{\"word\":\"气\",\"start_time\":0.515,\"end_time\":0.66,\"unit_type\":\"text\"},{\"word\":\"不\",\"start_time\":0.66,\"end_time\":0.775,\"unit_type\":\"text\"},{\"word\":\"错\",\"start_time\":0.775,\"end_time\":1.055,\"unit_type\":\"text\"},{\"word\":\"。\",\"start_time\":1.055,\"end_time\":1.095,\"unit_type\":\"mark\"}],\"phonemes\":[{\"phone\":\"C0j\",\"start_time\":0.02,\"end_time\":0.08},{\"phone\":\"C0in\",\"start_time\":0.08,\"end_time\":0.17},{\"phone\":\"C0t\",\"start_time\":0.17,\"end_time\":0.235},{\"phone\":\"C0ian\",\"start_time\":0.235,\"end_time\":0.345},{\"phone\":\"C0t\",\"start_time\":0.345,\"end_time\":0.43},{\"phone\":\"C0ian\",\"start_time\":0.43,\"end_time\":0.515},{\"phone\":\"C0q\",\"start_time\":0.515,\"end_time\":0.625},{\"phone\":\"C0i\",\"start_time\":0.625,\"end_time\":0.66},{\"phone\":\"C0b\",\"start_time\":0.66,\"end_time\":0.73},{\"phone\":\"C0u\",\"start_time\":0.73,\"end_time\":0.775},{\"phone\":\"C0c\",\"start_time\":0.775,\"end_time\":0.9},{\"phone\":\"C0uo\",\"start_time\":0.9,\"end_time\":1.055},{\"phone\":\"。\",\"start_time\":1.055,\"end_time\":1.095}]}"
}
}Failed Response
In the body, if the status field is not 000000, it indicates a failed response. This field can be used as an indicator of whether the response is successful or not. Example of a failed response:
{
"status": "300000",
"message": "lang_type Invalid Parameter",
"data": {
"task_id": "a9017448-efa3-4159-9b95-770d81859ed9",
"duration": "",
"result": "",
"timestamp": ""
}
}