プロダクト一覧
音声認識 (Speech-to-Text)
Speech-to-Text (or Auto Speech Recognition, ASR) is a service that converts audio speech into corresponding language text. By using Speech-to-Text services, you can easily and conveniently integrate speech recognition technology into your applications.
The Speech-to-Text category includes the following products:
- 一発話認識 (One-sentence Recognition): Audio data is sent to the Speech-to-Text service via HTTP request or WebSocket request. Each request only accepts audio data not exceeding 60 seconds and does not have an automatic sentence-breaking function. When using the HTTP API, the final recognition result will be returned at once; when using the WebSocket API, intermediate recognition results and final results can be received while sending data, for real-time echo.
- リアルタイム音声認識 (Real-time Speech Transcription): To transcribe audio data within a WebSocket duplex stream. The theoretical upper limit of audio duration is 37 hours, and it can automatically break sentences. Real-time Speech Transcription can receive intermediate recognition results and final results while sending data, for real-time echo.
- 録音ファイルの書き起こし (Audio File Transcription): Audio files (or audio URLs) are sent to the Speech-to-Text service via HTTP request to create an asynchronous transcription task. After the task is created, the transcription progress can be queried through a query API, and the final transcription result can be obtained if the task ends normally. The duration of an audio file should not exceed 10 hours (and the file size must be controlled below 2 GB).
A brief comparison of each product is shown in the table below.
| Item/Product | One-sentence Recognition | Real-time Speech Transcription | Audio File Transcription |
|---|---|---|---|
| Function | To recognize short audio and return results in one go, or return results while processing. | To transcribe streaming audio and return results while processing. | To transcribe audio files into scripts or subtitles, and return results all at once after process is complete. |
| Interface Protocol | HTTP and WebSocket | WebSocket | HTTP |
| Audio Limit | 60 seconds | Theoretical upper limit of about 37 hours | 10 hours, below 2 GB |
| Supported Formats | WAV/PCM | WAV/PCM | WAV/PCM/OPUS/MP3/MP4/M4A, etc. |
| Automatic Sentence Breaking | No | Yes | Yes |
| Word Information/ITN and Other Practical Features | Yes | Yes | Yes |
| Typical Usage Scenarios | Voice Assistant | Real-time Subtitles | Audio file transcription Video subtitle generation |
声紋認証 (Voiceprint Recognition)
Voiceprint recognition (VPR) is a service that analyzes and identifies the identity of the speaker from voice audio. By using voiceprint recognition services, you can quickly integrate a secure and efficient authentication capability, adding identity recognition functionality to your application.
The voiceprint recognition service provides functions for the registration, verification, and deregistration of voiceprints, supporting the upload of a voice audio clip to compare with the voiceprint records in the voiceprint library, thereby identifying the speaker's identity in the audio clip.
Voiceprint verification includes two modes:
-
1-v-1 Verification: An audio clip is compared with a specific voiceprint record in the voiceprint library, and the match score is returned.
-
1-v-N Verification: An audio clip is compared with all the voiceprint records in the voiceprint library, and the information of the top 3 highest match scores is returned.