プロダクト一覧

音声認識（Speech-to-Text）

Speech-to-Text (or Auto Speech Recognition, ASR) is a service that converts audio speech into corresponding language text. By using Speech-to-Text services, you can easily and conveniently integrate speech recognition technology into your applications.

The Speech-to-Text category includes the following products:

一発話認識 (One-sentence Recognition): Audio data is sent to the Speech-to-Text service via HTTP request or WebSocket request. Each request only accepts audio data not exceeding 60 seconds and does not have an automatic sentence-breaking function. When using the HTTP API, the final recognition result will be returned at once; when using the WebSocket API, intermediate recognition results and final results can be received while sending data, for real-time echo.
リアルタイム音声認識 (Real-time Speech Transcription): To transcribe audio data within a WebSocket duplex stream. The theoretical upper limit of audio duration is 37 hours, and it can automatically break sentences. Real-time Speech Transcription can receive intermediate recognition results and final results while sending data, for real-time echo.
録音ファイルの書き起こし (Audio File Transcription): Audio files (or audio URLs) are sent to the Speech-to-Text service via HTTP request to create an asynchronous transcription task. After the task is created, the transcription progress can be queried through a query API, and the final transcription result can be obtained if the task ends normally. The duration of an audio file should not exceed 10 hours (and the file size must be controlled below 2 GB).

A brief comparison of each product is shown in the table below.

Item/Product	One-sentence Recognition	Real-time Speech Transcription	Audio File Transcription
Function	To recognize short audio and return results in one go, or return results while processing.	To transcribe streaming audio and return results while processing.	To transcribe audio files into scripts or subtitles, and return results all at once after process is complete.
Interface Protocol	HTTP and WebSocket	WebSocket	HTTP
Audio Limit	60 seconds	Theoretical upper limit of about 37 hours	10 hours, below 2 GB
Supported Formats	WAV/PCM	WAV/PCM	WAV/PCM/OPUS/MP3/MP4/M4A, etc.
Automatic Sentence Breaking	No	Yes	Yes
Word Information/ITN and Other Practical Features	Yes	Yes	Yes
Typical Usage Scenarios	Voice Assistant	Real-time Subtitles	Audio file transcription Video subtitle generation

声紋認証（Voiceprint Recognition）

Voiceprint recognition (VPR) is a service that analyzes and identifies the identity of the speaker from voice audio. By using voiceprint recognition services, you can quickly integrate a secure and efficient authentication capability, adding identity recognition functionality to your application.

The voiceprint recognition service provides functions for the registration, verification, and deregistration of voiceprints, supporting the upload of a voice audio clip to compare with the voiceprint records in the voiceprint library, thereby identifying the speaker's identity in the audio clip.

Voiceprint verification includes two modes:

1-v-1 Verification: An audio clip is compared with a specific voiceprint record in the voiceprint library, and the match score is returned.
1-v-N Verification: An audio clip is compared with all the voiceprint records in the voiceprint library, and the information of the top 3 highest match scores is returned.

音声認識 （Speech-to-Text）

声紋認証 （Voiceprint Recognition）

目次

音声認識（Speech-to-Text）

声紋認証（Voiceprint Recognition）