Cloud API

Functional Introduction

The Speech Synthesis service provides the functionality to synthesize input text into binary audio data.

Supports output of data in wav, pcm, and mp3 encoding formats.

Supports setting voice, emotion, speech rate, pitch rate, and volume.

WebSocket Interface (Streaming)

Send text and receive audio data through WebSocket duplex streaming to enhance the real-time performance of synthesis.

Call Constraint

The input text must be encoded in UTF-8.

The input text must not exceed 1024 bytes, which is approximately 300 Chinese characters.

Service Address

wss://api.voice.dolphin-ai.jp/v1/tts/ws

Interaction Process

1. Authorization

When establishing a WebSocket connection between the client and the server, it must include the following request header:

Name	Type	Required	Description
Authorization	String	Yes	Standard HTTP header for authentication. The format must be in the standard Bearer <Token> form (note the space after Bearer).

For more information on authorization, refer to Authentication and Authorization.

2. Start and Send Parameter

The client sends a speech synthesis request, where parameter settings need to be configured in the request message, and it is sent in JSON format.

Send parameter (header object):

Parameter	Type	Required	Description
namespace	String	Yes	Namespace of the message: `SpeechSynthesizer` indicates Speech Synthesis
name	String	Yes	Event name: `StartSynthesis` indicates the start phase

Send Parameter (payload object):

Parameter	Type	Required	Description	Default Value
text	String	Yes	Text to be synthesized, length limit: 1024 bytes (UTF-8 encoding)	No
lang_type	String	Yes	Language option, refer to Development Guide - Language and Voice Support	No
voice	String	No	VoiceID, refer to Development Guide - Language and Voice Support	Japanese: Yuko English：Julie Chinese：Xiaohui
sample_rate	Integer	No	Audio sample rate, options are 8000, 16000, 24000	24000
format	String	No	Audio encode format, wav / pcm / mp3, note: wav does not support streaming	pcm
speech_rate	Float	No	Speech rate, parameter range [0.2, 3], usually retaining one decimal place is sufficient	1
volume	Float	No	Volume, parameter range [0.1, 3], usually retaining one decimal place is sufficient	1
pitch_rate	Float	No	Pitch rate, parameter range [0.1, 3], usually retaining one decimal place is sufficient	1
emotion	String	No	Emotion/style, refer to Development Guide - Language and Voice Support	No
silence_duration	Integer	No	Silence duration at the end of the sentence, parameter range [0, 10000], in ms	125
enable_timestamp	Boolean	No	Timestamp related, when passed as true, it indicates enabling, and the original text's timestamps can be returned. Note: multiple consecutive punctuation or spaces in the original text will still be processed, but this will not affect the continuity of the timestamps	false

Send Example:

{
    "header":{
        "namespace":"SpeechSynthesizer",
        "name":"StartSynthesis"
    },
    "payload":{
        "lang_type":"zh-cmn-Hans-CN",
        "voice":"",
        "text":"今天是星期三。",
        "format":"pcm",
        "sample_rate":16000,
        "volume":1,
        "speech_rate":1,
        "pitch_rate":1,
        "enable_timestamp":true
    }
}

Return Parameter (header object):

Parameter	Type	Description
namespace	String	Namespace of the message: `SpeechSynthesizer` indicates Speech Synthesis
name	String	Event name: `SynthesisStarted` indicates the start phase
status	String	Status code
status_text	String	Description of the status code
app_id	String	Application ID
task_id	String	Globally unique task ID, please record this value for troubleshooting.
message_id	String	ID of the current message

Return example:

{
    "header": {
        "namespace": "SpeechSynthesizer",
        "name": "SynthesisStarted",
        "status": "000000",
        "status_text": "Success",
        "app_id": "33a7d8c6-6cbe-40ff7-95f3-0a89bf6eb1fb",
        "task_id": "c43bdfda-6319-49cc-be8d-a89852b104d9",
        "message_id": "f0cfbfa3-ed78-400b-9c62-2a904ea48961"
    },
    "payload": {}

3. Receiving Synthesized Audio Data

The server will continuously return the synthesized audio binary data, audio duration, and timestamp information.

Note: Timestamp information and audio duration is only returned when enable_timestamp is enabled.

During this phase, the server will ignore all subsequent request messages within the current connection. To send another synthesis request, establish a new connection.

Audio Duration Information:

Return Parameter (header object):

Parameter	Type	Description
namespace	String	Namespace of the message: `SpeechSynthesizer` indicates Speech Synthesis
name	String	Event name: `SynthesisDuration` indicates the start phase
status	String	Status code
status_text	String	Description of the status code
app_id	String	Application ID
task_id	String	Globally unique task ID, please record this value for troubleshooting.
message_id	String	ID of the current message

Return Parameter (payload object):

Parameter	Type	Description
duration	String	Length of the audio, in ms

Return Example:

{
    "header": {
        "namespace": "SpeechSynthesizer",
        "name": "SynthesisDuration",
        "status": "000000",
        "status_text": "Success",
        "app_id": "33a7d8c6-6cbe-40ff7-95f3-0a89bf6eb1fb",
        "task_id": "c43bdfda-0019-35cc-be0d-a85643b104d9",
        "message_id": "259e92ca-24f4-4958-9fe1-aa93640fd4dd"
    },
    "payload": {
        "duration": "1305"
    }
}

Timestamp Information:

Return Parameter (header object):

Parameter	Type	Description
namespace	String	Namespace of the message: `SpeechSynthesizer` indicates Speech Synthesis
name	String	Event name: `SynthesisTimestamp` indicates the start phase
status	String	Status code
status_text	String	Status code
app_id	String	Application ID
task_id	String	Globally unique task ID, please record this value for troubleshooting.
message_id	String	ID of the current message

Return Parameter (payload object):

Parameter	Type	Description
timestamp	String	Timestamp information
├─words	String	Word-level timestamp information
├─phonemes	String	Phoneme-level Timestamp Information

Among them, the word-level timestamp information words object:

Parameter	Type	Description
word	String	Text
start_time	Integer	Start time, in s
end_time	Integer	End time, in s
unit_type	String	Type `text` represents plain text, and `mark` denotes punctuation marks.

Phoneme-level timestamp information:

Parameter	Type	Description
phone	String	Phoneme
start_time	Float	Start time, in s
end_time	Float	End time, in s

Return Example:

{
    "header": {
        "namespace": "SpeechSynthesizer",
        "name": "SynthesisTimestamp",
        "status": "000000",
        "status_text": "Success",
        "app_id": "33a7d8c6-0cbe-56ff7-95f3-0a89bf6eb1fb",
        "task_id": "c43bdfda-6319-49cc-be8d-a89852b104d9",
        "message_id": "e70e5628-be9a-426b-87d3-a2f4527a490a"
    },
    "payload": {
        "timestamp": "{\"words\":[{\"word\":\"今\",\"start_time\":0.02,\"end_time\":0.175,\"unit_type\":\"text\"},{\"word\":\"天\",\"start_time\":0.175,\"end_time\":0.355,\"unit_type\":\"text\"},{\"word\":\"是\",\"start_time\":0.355,\"end_time\":0.55,\"unit_type\":\"text\"},{\"word\":\"星\",\"start_time\":0.55,\"end_time\":0.725,\"unit_type\":\"text\"},{\"word\":\"期\",\"start_time\":0.725,\"end_time\":0.865,\"unit_type\":\"text\"},{\"word\":\"三\",\"start_time\":0.865,\"end_time\":1.145,\"unit_type\":\"text\"},{\"word\":\"。\",\"start_time\":1.145,\"end_time\":1.305,\"unit_type\":\"mark\"}],\"phonemes\":[{\"phone\":\"C0j\",\"start_time\":0.02,\"end_time\":0.085},{\"phone\":\"C0in\",\"start_time\":0.085,\"end_time\":0.175},{\"phone\":\"C0t\",\"start_time\":0.175,\"end_time\":0.25},{\"phone\":\"C0ian\",\"start_time\":0.25,\"end_time\":0.355},{\"phone\":\"C0sh\",\"start_time\":0.355,\"end_time\":0.485},{\"phone\":\"C0iii\",\"start_time\":0.485,\"end_time\":0.55},{\"phone\":\"C0x\",\"start_time\":0.55,\"end_time\":0.655},{\"phone\":\"C0ing\",\"start_time\":0.655,\"end_time\":0.725},{\"phone\":\"C0q\",\"start_time\":0.725,\"end_time\":0.825},{\"phone\":\"C0i\",\"start_time\":0.825,\"end_time\":0.865},{\"phone\":\"C0s\",\"start_time\":0.865,\"end_time\":0.965},{\"phone\":\"C0an\",\"start_time\":0.965,\"end_time\":1.145},{\"phone\":\"。\",\"start_time\":1.145,\"end_time\":1.305}]}"
    }
}

4. Speech Synthesis Complete

The speech synthesis is completed, the server sends a notification of the synthesis completion event, and then automatically disconnects.

Return Example:

{
    "header": {
        "namespace": "SpeechSynthesizer",
        "name": "SynthesisCompleted",
        "status": "000000",
        "status_text": "Success",
        "app_id": "33a7d8c6-0cbe-56ff7-95f3-0a89bf6eb1fb",
        "task_id": "c43bdfda-6319-49cc-be8d-a89852b104d9",
        "message_id": "031aab78-c7ab-45ac-8f4c-4dfaf45855b6"
    },
    "payload": {}
}

HTTP Interface (non-streaming)

Send the text via a POST request, and after the synthesis of all content is completed, return all the audio data at once.

Call Constraint

The input text must be encoded in UTF-8
The input text must not exceed 1024 bytes, which is approximately 300 Chinese characters.

Interactive Process

The client sends an HTTP POST request with text content to the server, and the server returns an HTTP response containing synthesized speech data.

HTTP Request Line

Agreement	URL	Method
HTTP	https://api.voice.dolphin-ai.jp/v1/tts/ws	POST

Request Header

The HTTP request header consists of "key/value" pairs, with one pair per line. The key and value are separated by an English colon :. The content is set as follows:

Name	Type	Required	Discription
Content-type	String	Yes	It must be "application/json", indicating that the data in the HTTP body is in JSON format

Request Parameter

The client sends a speech synthesis request, where parameters need to be set in the request body(body). The meanings of each parameter are as follows:

Name	Type	Required	Discription	Default Value
text	String	Yes	Text to be synthesized, length limit: 1024 bytes (UTF-8 encoding)	No
lang_type	String	Yes	Language option, refer to Development Guide - Language and Voice Support	No
voice	String	No	VoiceID, refer to Development Guide - Language and Voice Support	Japanese: Yuko English：Julie Chinese：Xiaohui
sample_rate	Integer	No	Audio sample rate, options are 8000, 16000, 24000	24000
format	String	No	Audio encode format, wav / pcm / mp3, note: wav does not support streaming	pcm
speech_rate	Float	No	Speech rate, parameter range [0.2, 3], usually retaining one decimal place is sufficient	1
volume	Float	No	Volume, parameter range [0.1, 3], usually retaining one decimal place is sufficient	1
pitch_rate	Float	No	Pitch rate, parameter range [0.1, 3], usually retaining one decimal place is sufficient	1
emotion	String	No	Emotion/style, refer to Development Guide - Language and Voice Support	No
silence_duration	Integer	No	Silence duration at the end of the sentence, in ms	125
enable_timestamp	Boolean	No	Timestamp related, when passed as true, it indicates enabling, and the original text's timestamps can be returned. Note: multiple consecutive punctuation or spaces in the original text will still be processed, but this will not affect the continuity of the timestamps	false

Request Body Example

curl --location --request POST 'https://api.voice.dolphin-ai.jp/v1/tts/ws' \
--header 'Content-Type: application/json' \
--data '{
    "text":"今天天气不错。",
    "lang_type":"zh-cmn-Hans-CN",
    "format":"wav"
}'

Name	Type	Description
status	String	Status code
message	String	Status code description
data	Object
├─task_id	String	Globally unique task ID, please record this value for troubleshooting.
├─result	String	Base64 encoded audio data
├─duration	String	Length of the audio, in ms
├─timestamp	String	Timestamp information, returned only when `enable_timestamp` is enabled.
├─├─words	`List<Word>`	word-level timestamp information
├─├─├─word	String	Text
├─├─├─start_time	Float	Start time, in s
├─├─├─end_time	Float	End time, in s
├─├─├─unit_type	String	Type `text` represents plain text, and `mark` denotes punctuation marks.
├─├─phonemes	String	Phoneme-level Timestamp Information
├─├─├─phone	String	Phoneme
├─├─├─start_time	Float	Start time, in s
├─├─├─end_time	Float	End time, in s

In the body, the status field being 00000 indicates a successful response. The result field within the data is the base64 encoded data of the synthesized audio. By performing a base64 decode on this data, one can obtain the audio file. Example of a successful response:

{
    "status": "000000",
    "message": "Success",
    "data": {
        "task_id": "569ba392-f7d4-4f60-8685-b572a821126d",
        "duration": "1100",
        "result": "//PoxAAAAAAAAAAA...",
        "timestamp": "{\"words\":[{\"word\":\"今\",\"start_time\":0.02,\"end_time\":0.17,\"unit_type\":\"text\"},{\"word\":\"天\",\"start_time\":0.17,\"end_time\":0.345,\"unit_type\":\"text\"},{\"word\":\"天\",\"start_time\":0.345,\"end_time\":0.515,\"unit_type\":\"text\"},{\"word\":\"气\",\"start_time\":0.515,\"end_time\":0.66,\"unit_type\":\"text\"},{\"word\":\"不\",\"start_time\":0.66,\"end_time\":0.775,\"unit_type\":\"text\"},{\"word\":\"错\",\"start_time\":0.775,\"end_time\":1.055,\"unit_type\":\"text\"},{\"word\":\"。\",\"start_time\":1.055,\"end_time\":1.095,\"unit_type\":\"mark\"}],\"phonemes\":[{\"phone\":\"C0j\",\"start_time\":0.02,\"end_time\":0.08},{\"phone\":\"C0in\",\"start_time\":0.08,\"end_time\":0.17},{\"phone\":\"C0t\",\"start_time\":0.17,\"end_time\":0.235},{\"phone\":\"C0ian\",\"start_time\":0.235,\"end_time\":0.345},{\"phone\":\"C0t\",\"start_time\":0.345,\"end_time\":0.43},{\"phone\":\"C0ian\",\"start_time\":0.43,\"end_time\":0.515},{\"phone\":\"C0q\",\"start_time\":0.515,\"end_time\":0.625},{\"phone\":\"C0i\",\"start_time\":0.625,\"end_time\":0.66},{\"phone\":\"C0b\",\"start_time\":0.66,\"end_time\":0.73},{\"phone\":\"C0u\",\"start_time\":0.73,\"end_time\":0.775},{\"phone\":\"C0c\",\"start_time\":0.775,\"end_time\":0.9},{\"phone\":\"C0uo\",\"start_time\":0.9,\"end_time\":1.055},{\"phone\":\"。\",\"start_time\":1.055,\"end_time\":1.095}]}"
    }
}

Failed Response

In the body, if the status field is not 000000, it indicates a failed response. This field can be used as an indicator of whether the response is successful or not. Example of a failed response:

{
    "status": "300000",
    "message": "lang_type Invalid Parameter",
    "data": {
        "task_id": "a9017448-efa3-4159-9b95-770d81859ed9",
        "duration": "",
        "result": "",
        "timestamp": ""
    }
}

Functional Introduction

WebSocket Interface (Streaming)

Call Constraint

Service Address

Interaction Process

1. Authorization

2. Start and Send Parameter

3. Receiving Synthesized Audio Data

4. Speech Synthesis Complete

HTTP Interface (non-streaming)

Call Constraint

Interactive Process

HTTP Request Line

Request Header

Request Parameter

Request Body Example

Response Result

On this page