Logo
Short Text to Speech

Cloud API

Functional Introduction

The Speech Synthesis service provides the functionality to synthesize input text into binary audio data.

Supports output of data in wav, pcm, and mp3 encoding formats.

Supports setting voice, emotion, speech rate, pitch rate, and volume.

WebSocket Interface (Streaming)

Send text and receive audio data through WebSocket duplex streaming to enhance the real-time performance of synthesis.

Call Constraint

The input text must be encoded in UTF-8.

The input text must not exceed 1024 bytes, which is approximately 300 Chinese characters.

Service Address

wss://api.voice.dolphin-ai.jp/v1/tts/ws

Interaction Process

tts_en.jpg

1. Authorization

When establishing a WebSocket connection between the client and the server, it must include the following request header:

NameTypeRequiredDescription
AuthorizationStringYesStandard HTTP header for authentication. The format must be in the standard Bearer <Token> form (note the space after Bearer).

For more information on authorization, refer to Authentication and Authorization.

2. Start and Send Parameter

The client sends a speech synthesis request, where parameter settings need to be configured in the request message, and it is sent in JSON format.

Send parameter (header object):

ParameterTypeRequiredDescription
namespaceStringYesNamespace of the message: SpeechSynthesizer indicates Speech Synthesis
nameStringYesEvent name: StartSynthesis indicates the start phase

Send Parameter (payload object):

ParameterTypeRequiredDescriptionDefault Value
textStringYesText to be synthesized, length limit: 1024 bytes (UTF-8 encoding)No
lang_typeStringYesLanguage option, refer to Development Guide - Language and Voice SupportNo
voiceStringNoVoiceID, refer to Development Guide - Language and Voice SupportJapanese: Yuko
English:Julie
Chinese:Xiaohui
sample_rateIntegerNoAudio sample rate, options are 8000, 16000, 2400024000
formatStringNoAudio encode format, wav / pcm / mp3,
note: wav does not support streaming
pcm
speech_rateFloatNoSpeech rate, parameter range [0.2, 3], usually retaining one decimal place is sufficient1
volumeFloatNoVolume, parameter range [0.1, 3], usually retaining one decimal place is sufficient1
pitch_rateFloatNoPitch rate, parameter range [0.1, 3], usually retaining one decimal place is sufficient1
emotionStringNoEmotion/style, refer to Development Guide - Language and Voice SupportNo
silence_durationIntegerNoSilence duration at the end of the sentence, parameter range [0, 10000], in ms125
enable_timestampBooleanNoTimestamp related, when passed as true, it indicates enabling, and the original text's timestamps can be returned. Note: multiple consecutive punctuation or spaces in the original text will still be processed, but this will not affect the continuity of the timestampsfalse

Send Example:

{
    "header":{
        "namespace":"SpeechSynthesizer",
        "name":"StartSynthesis"
    },
    "payload":{
        "lang_type":"zh-cmn-Hans-CN",
        "voice":"",
        "text":"今天是星期三。",
        "format":"pcm",
        "sample_rate":16000,
        "volume":1,
        "speech_rate":1,
        "pitch_rate":1,
        "enable_timestamp":true
    }
}

Return Parameter (header object):

ParameterTypeDescription
namespaceStringNamespace of the message: SpeechSynthesizer indicates Speech Synthesis
nameStringEvent name: SynthesisStarted indicates the start phase
statusStringStatus code
status_textStringDescription of the status code
app_idStringApplication ID
task_idStringGlobally unique task ID, please record this value for troubleshooting.
message_idStringID of the current message

Return example:

{
    "header": {
        "namespace": "SpeechSynthesizer",
        "name": "SynthesisStarted",
        "status": "000000",
        "status_text": "Success",
        "app_id": "33a7d8c6-6cbe-40ff7-95f3-0a89bf6eb1fb",
        "task_id": "c43bdfda-6319-49cc-be8d-a89852b104d9",
        "message_id": "f0cfbfa3-ed78-400b-9c62-2a904ea48961"
    },
    "payload": {}

3. Receiving Synthesized Audio Data

The server will continuously return the synthesized audio binary data, audio duration, and timestamp information.

Note: Timestamp information and audio duration is only returned when enable_timestamp is enabled.

During this phase, the server will ignore all subsequent request messages within the current connection. To send another synthesis request, establish a new connection.

Audio Duration Information:

Return Parameter (header object):

ParameterTypeDescription
namespaceStringNamespace of the message: SpeechSynthesizer indicates Speech Synthesis
nameStringEvent name: SynthesisDuration indicates the start phase
statusStringStatus code
status_textStringDescription of the status code
app_idStringApplication ID
task_idStringGlobally unique task ID, please record this value for troubleshooting.
message_idStringID of the current message

Return Parameter (payload object):

ParameterTypeDescription
durationStringLength of the audio, in ms

Return Example:

{
    "header": {
        "namespace": "SpeechSynthesizer",
        "name": "SynthesisDuration",
        "status": "000000",
        "status_text": "Success",
        "app_id": "33a7d8c6-6cbe-40ff7-95f3-0a89bf6eb1fb",
        "task_id": "c43bdfda-0019-35cc-be0d-a85643b104d9",
        "message_id": "259e92ca-24f4-4958-9fe1-aa93640fd4dd"
    },
    "payload": {
        "duration": "1305"
    }
}

Timestamp Information:

Return Parameter (header object):

ParameterTypeDescription
namespaceStringNamespace of the message: SpeechSynthesizer indicates Speech Synthesis
nameStringEvent name: SynthesisTimestamp indicates the start phase
statusStringStatus code
status_textStringStatus code
app_idStringApplication ID
task_idStringGlobally unique task ID, please record this value for troubleshooting.
message_idStringID of the current message

Return Parameter (payload object):

ParameterTypeDescription
timestampStringTimestamp information
├─wordsStringWord-level timestamp information
├─phonemesStringPhoneme-level Timestamp Information

Among them, the word-level timestamp information words object:

ParameterTypeDescription
wordStringText
start_timeIntegerStart time, in s
end_timeIntegerEnd time, in s
unit_typeStringType
text represents plain text, and mark denotes punctuation marks.

Phoneme-level timestamp information:

ParameterTypeDescription
phoneStringPhoneme
start_timeFloatStart time, in s
end_timeFloatEnd time, in s

Return Example:

{
    "header": {
        "namespace": "SpeechSynthesizer",
        "name": "SynthesisTimestamp",
        "status": "000000",
        "status_text": "Success",
        "app_id": "33a7d8c6-0cbe-56ff7-95f3-0a89bf6eb1fb",
        "task_id": "c43bdfda-6319-49cc-be8d-a89852b104d9",
        "message_id": "e70e5628-be9a-426b-87d3-a2f4527a490a"
    },
    "payload": {
        "timestamp": "{\"words\":[{\"word\":\"\",\"start_time\":0.02,\"end_time\":0.175,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":0.175,\"end_time\":0.355,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":0.355,\"end_time\":0.55,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":0.55,\"end_time\":0.725,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":0.725,\"end_time\":0.865,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":0.865,\"end_time\":1.145,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":1.145,\"end_time\":1.305,\"unit_type\":\"mark\"}],\"phonemes\":[{\"phone\":\"C0j\",\"start_time\":0.02,\"end_time\":0.085},{\"phone\":\"C0in\",\"start_time\":0.085,\"end_time\":0.175},{\"phone\":\"C0t\",\"start_time\":0.175,\"end_time\":0.25},{\"phone\":\"C0ian\",\"start_time\":0.25,\"end_time\":0.355},{\"phone\":\"C0sh\",\"start_time\":0.355,\"end_time\":0.485},{\"phone\":\"C0iii\",\"start_time\":0.485,\"end_time\":0.55},{\"phone\":\"C0x\",\"start_time\":0.55,\"end_time\":0.655},{\"phone\":\"C0ing\",\"start_time\":0.655,\"end_time\":0.725},{\"phone\":\"C0q\",\"start_time\":0.725,\"end_time\":0.825},{\"phone\":\"C0i\",\"start_time\":0.825,\"end_time\":0.865},{\"phone\":\"C0s\",\"start_time\":0.865,\"end_time\":0.965},{\"phone\":\"C0an\",\"start_time\":0.965,\"end_time\":1.145},{\"phone\":\"\",\"start_time\":1.145,\"end_time\":1.305}]}"
    }
}

4. Speech Synthesis Complete

The speech synthesis is completed, the server sends a notification of the synthesis completion event, and then automatically disconnects.

Return Example:

{
    "header": {
        "namespace": "SpeechSynthesizer",
        "name": "SynthesisCompleted",
        "status": "000000",
        "status_text": "Success",
        "app_id": "33a7d8c6-0cbe-56ff7-95f3-0a89bf6eb1fb",
        "task_id": "c43bdfda-6319-49cc-be8d-a89852b104d9",
        "message_id": "031aab78-c7ab-45ac-8f4c-4dfaf45855b6"
    },
    "payload": {}
}

HTTP Interface (non-streaming)

Send the text via a POST request, and after the synthesis of all content is completed, return all the audio data at once.

Call Constraint

  • The input text must be encoded in UTF-8
  • The input text must not exceed 1024 bytes, which is approximately 300 Chinese characters.

Interactive Process

The client sends an HTTP POST request with text content to the server, and the server returns an HTTP response containing synthesized speech data.

HTTP Request Line

Request Header

The HTTP request header consists of "key/value" pairs, with one pair per line. The key and value are separated by an English colon :. The content is set as follows:

NameTypeRequiredDiscription
Content-typeStringYesIt must be "application/json", indicating that the data in the HTTP body is in JSON format

Request Parameter

The client sends a speech synthesis request, where parameters need to be set in the request body(body). The meanings of each parameter are as follows:

NameTypeRequiredDiscriptionDefault Value
textStringYesText to be synthesized, length limit: 1024 bytes (UTF-8 encoding)No
lang_typeStringYesLanguage option, refer to Development Guide - Language and Voice SupportNo
voiceStringNoVoiceID, refer to Development Guide - Language and Voice SupportJapanese: Yuko
English:Julie
Chinese:Xiaohui
sample_rateIntegerNoAudio sample rate, options are 8000, 16000, 2400024000
formatStringNoAudio encode format, wav / pcm / mp3,
note: wav does not support streaming
pcm
speech_rateFloatNoSpeech rate, parameter range [0.2, 3], usually retaining one decimal place is sufficient1
volumeFloatNoVolume, parameter range [0.1, 3], usually retaining one decimal place is sufficient1
pitch_rateFloatNoPitch rate, parameter range [0.1, 3], usually retaining one decimal place is sufficient1
emotionStringNoEmotion/style, refer to Development Guide - Language and Voice SupportNo
silence_durationIntegerNoSilence duration at the end of the sentence, in ms125
enable_timestampBooleanNoTimestamp related, when passed as true, it indicates enabling, and the original text's timestamps can be returned. Note: multiple consecutive punctuation or spaces in the original text will still be processed, but this will not affect the continuity of the timestampsfalse

Request Body Example

curl --location --request POST 'https://api.voice.dolphin-ai.jp/v1/tts/ws' \
--header 'Content-Type: application/json' \
--data '{
    "text":"今天天气不错。",
    "lang_type":"zh-cmn-Hans-CN",
    "format":"wav"
}'

Response Result

The Content-Type field in the response Headers is application/json, and the response result is in the Body. The fields of the response result are as follows:

NameTypeDescription
statusStringStatus code
messageStringStatus code description
dataObject
├─task_idStringGlobally unique task ID, please record this value for troubleshooting.
├─resultStringBase64 encoded audio data
├─durationStringLength of the audio, in ms
├─timestampStringTimestamp information, returned only when enable_timestamp is enabled.
├─├─wordsList<Word>word-level timestamp information
├─├─├─wordStringText
├─├─├─start_timeFloatStart time, in s
├─├─├─end_timeFloatEnd time, in s
├─├─├─unit_typeStringType
text represents plain text, and mark denotes punctuation marks.
├─├─phonemesStringPhoneme-level Timestamp Information
├─├─├─phoneStringPhoneme
├─├─├─start_timeFloatStart time, in s
├─├─├─end_timeFloatEnd time, in s

Successful Response

In the body, the status field being 00000 indicates a successful response. The result field within the data is the base64 encoded data of the synthesized audio. By performing a base64 decode on this data, one can obtain the audio file. Example of a successful response:

{
    "status": "000000",
    "message": "Success",
    "data": {
        "task_id": "569ba392-f7d4-4f60-8685-b572a821126d",
        "duration": "1100",
        "result": "//PoxAAAAAAAAAAA...",
        "timestamp": "{\"words\":[{\"word\":\"\",\"start_time\":0.02,\"end_time\":0.17,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":0.17,\"end_time\":0.345,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":0.345,\"end_time\":0.515,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":0.515,\"end_time\":0.66,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":0.66,\"end_time\":0.775,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":0.775,\"end_time\":1.055,\"unit_type\":\"text\"},{\"word\":\"\",\"start_time\":1.055,\"end_time\":1.095,\"unit_type\":\"mark\"}],\"phonemes\":[{\"phone\":\"C0j\",\"start_time\":0.02,\"end_time\":0.08},{\"phone\":\"C0in\",\"start_time\":0.08,\"end_time\":0.17},{\"phone\":\"C0t\",\"start_time\":0.17,\"end_time\":0.235},{\"phone\":\"C0ian\",\"start_time\":0.235,\"end_time\":0.345},{\"phone\":\"C0t\",\"start_time\":0.345,\"end_time\":0.43},{\"phone\":\"C0ian\",\"start_time\":0.43,\"end_time\":0.515},{\"phone\":\"C0q\",\"start_time\":0.515,\"end_time\":0.625},{\"phone\":\"C0i\",\"start_time\":0.625,\"end_time\":0.66},{\"phone\":\"C0b\",\"start_time\":0.66,\"end_time\":0.73},{\"phone\":\"C0u\",\"start_time\":0.73,\"end_time\":0.775},{\"phone\":\"C0c\",\"start_time\":0.775,\"end_time\":0.9},{\"phone\":\"C0uo\",\"start_time\":0.9,\"end_time\":1.055},{\"phone\":\"\",\"start_time\":1.055,\"end_time\":1.095}]}"
    }
}

Failed Response

In the body, if the status field is not 000000, it indicates a failed response. This field can be used as an indicator of whether the response is successful or not. Example of a failed response:

{
    "status": "300000",
    "message": "lang_type Invalid Parameter",
    "data": {
        "task_id": "a9017448-efa3-4159-9b95-770d81859ed9",
        "duration": "",
        "result": "",
        "timestamp": ""
    }
}