Built by Metorial, the integration platform for agentic AI.
Provider Summary
transcribe audio to text
synthesize speech from text
stream real-time transcription
identify speakers via diarization
adapt models with phrases
generate subtitle captions
manage recognizer configurations
select synthesis voice types
recognize multi-channel audio
synthesize long-form audio
Convert audio to text transcriptions and synthesize natural-sounding speech from text using Google's neural network models. Perform synchronous, asynchronous, and streaming speech-to-text recognition across 125+ languages. Create and manage recognizer configurations for reusable transcription settings. Adapt speech models with custom phrase sets, custom classes, and boost values to improve accuracy for domain-specific vocabulary. Identify distinct speakers via speaker diarization and recognize multi-channel audio. Generate subtitle/caption output in SRT format. Synthesize text or SSML into audio using Standard, WaveNet, Neural2, Studio, and Chirp voice types with configurable pitch, speaking rate, volume, and encoding. Produce long-form audio content asynchronously.
Start an asynchronous batch transcription of one or more audio files stored in Google Cloud Storage. Returns a long-running operation that can be monitored using the Get Operation tool. Suitable for audio files longer than 1 minute (up to 8 hours). Results can be written to a GCS output location or returned inline when the operation completes.
Check the status and retrieve results of a long-running Speech-to-Text operation. Use this to monitor batch transcription jobs started with the Batch Transcribe Audio tool. Returns the current status, and when complete, the full transcription results or error details.
List available Text-to-Speech voices. Optionally filter by language code to find voices for a specific language. Returns voice names, genders, supported languages, and native sample rates.
Create a named recognizer configuration for Speech-to-Text v2. A recognizer stores default settings like model, language, and recognition features so they don't need to be repeated in every transcription request.
Convert text or SSML into natural-sounding speech audio using Google Cloud Text-to-Speech. Returns base64-encoded audio data in the requested format. Supports multiple voice types including Standard, WaveNet, Neural2, Studio, and Chirp 3 HD voices. Customize pitch, speaking rate, and volume.
Transcribe audio to text using Google Cloud Speech-to-Text (synchronous recognition). Supports inline base64-encoded audio or audio files in Google Cloud Storage. Use for audio files up to 1 minute in duration. Configure language, model, punctuation, word-level details, speaker diarization, and speech adaptation hints.
This integration is licensed under the AGPL-3.0 License.
Built with ❤️ by Metorial