Openai
Generate text, images, audio, and video using large language models and multimodal AI. Create chat completions, generate and edit images from text prompts, convert text to speech, transcribe and translate audio, generate video, and create text embeddings for search and retrieval. Fine-tune models on custom training data, run evaluations to measure model performance, and moderate content against policy categories. Manage vector stores for semantic file search, upload and organize files, and submit batch processing jobs for asynchronous bulk requests. Conduct real-time speech-to-speech conversations via WebRTC or SIP. Administer organizations, projects, users, API keys, and audit logs programmatically. Receive webhook notifications for background responses, batch jobs, fine-tuning jobs, eval runs, and incoming realtime calls.
Aiml API
Unified gateway to 400+ AI/ML models for text generation, image generation, video generation, music generation, speech-to-text, text-to-speech, content moderation, 3D model generation, vision/OCR, embeddings, and AI-powered web search. Generate chat completions and reasoning with models like GPT, Claude, Gemini, DeepSeek, and Llama. Create images from text prompts using Flux, Stable Diffusion, and DALL-E. Generate videos from text or images asynchronously. Convert speech to text and text to speech in 120+ languages. Moderate content for safety classification. Generate 3D objects from text or images. Extract text and structured data from images via OCR. Produce text embeddings for semantic search. Search the web for real-time information. Create AI Assistants for customer support and data analysis. Interact in real time via WebSocket for voice and text. Receive webhook notifications for async operation completion.
Aivoov
Convert text to speech audio using 2300+ AI voices from Google, Amazon, IBM, and Microsoft. Generate audio in MP3 and WAV formats across 155+ languages. Combine multiple voices in a single request to create conversational audio. Control pitch, speaking rate, and volume per text segment using SSML. Browse and filter available voices by language.
Apipie Ai
Access hundreds of AI models from multiple providers (OpenAI, Anthropic, Google, Meta, etc.) through a unified OpenAI-compatible API. Send chat completions to language models with streaming, function calling, and structured output. Generate images, convert text to speech, analyze images with vision models, and create text embeddings. Augment responses with real-time web search grounding, upload documents for RAG-based retrieval, and reduce hallucinations via integrity checking. Discover and filter available models by type, provider, pricing, and performance. Manage routing preferences (cost or performance optimized), configure model pooling for redundancy, enable persistent conversational memory, and track API usage with detailed cost and token analytics.
Assemblyai
Transcribe pre-recorded and live audio/video to text with support for 99+ languages, speaker diarization, and multichannel audio. Apply audio intelligence models to extract summaries, sentiment analysis, entity detection, topic detection, key phrases, and content moderation from transcripts. Redact personally identifiable information from text and audio. Generate SRT/VTT subtitles and segment transcripts into paragraphs, sentences, or auto-chapters. Stream real-time speech-to-text via WebSocket connections. Upload audio/video files for processing. Manage and delete transcripts. Access an LLM gateway to apply large language models (Claude, GPT, Gemini) to transcribed speech data for summarization, Q&A, and custom analysis. Translate transcripts across 99+ languages. Receive webhook notifications when transcriptions complete or fail.
Astica Ai
Analyze images using computer vision for object detection, face detection, OCR, content moderation, tagging, and GPT-powered descriptions. Generate AI images from text prompts. Convert text to speech with 500+ voices, voice cloning, and multilingual support. Transcribe speech to text from audio files or streams. Generate natural language text using GPT-S for question answering, content creation, and diverse text generation. Upscale images using AI enhancement. Train and run custom AI models for vision and NLP tasks.
Azure Speech
Transcribe audio to text using real-time, fast, or batch transcription modes with speaker diarization and language identification. Convert text to synthesized speech using neural, custom, or personal voices with SSML control over pronunciation and prosody. Generate photorealistic avatar videos from text. Translate speech across multiple languages. Verify and identify speakers by voice characteristics. Assess pronunciation accuracy, fluency, completeness, and prosody for language learning. Supports custom speech models trained on domain-specific data and LLM-enhanced transcription for captions, meeting summaries, and call center assistance.
Bolna
Create, configure, and manage conversational Voice AI agents that make and receive phone calls. Initiate outbound calls with dynamic context variables, handle inbound calls with caller identification, and automate batch calling campaigns via CSV uploads with scheduling and auto-retry. Upload PDF documents and URLs as knowledge bases for RAG-powered conversations. Configure function calling tools for live call transfers, calendar booking, and custom API integrations. Retrieve call execution history including transcripts, recordings, cost breakdowns, and extracted data. Purchase and manage phone numbers, import or clone custom voices, and connect external LLM, TTS, ASR, and telephony providers.
Deepgram
Transcribe pre-recorded and live streaming audio to text in 45+ languages with speaker diarization, smart formatting, and keyword boosting. Convert text to natural-sounding speech with 40+ voice options. Analyze transcripts for sentiment, topics, summaries, and intents. Build conversational voice agents with integrated STT, LLM reasoning, and TTS in a single session. Manage projects, API keys, members, billing, and usage. Discover available models and their metadata. Supports asynchronous processing via callbacks for both transcription and speech synthesis.
Witai
Extract meaning from text and voice inputs using natural language processing. Detect intents, entities, and traits from text messages with confidence scores. Transcribe and analyze audio files and streams via speech recognition. Synthesize human-like speech from text. Manage multi-turn conversational flows with dialogue state and context. Detect languages from text input. Create, update, and delete entities, intents, and training utterances programmatically. Manage Wit.ai apps, export and import app configurations, and train NLU models with annotated samples.
Elevenlabs
Convert text to lifelike speech with customizable voices, intonation, and emotional awareness across 70+ languages. Transcribe speech to text with real-time streaming or batch processing. Clone, generate, and manage voices. Generate music, sound effects, and multi-speaker dialogue from text descriptions. Dub and translate audio/video content into other languages. Deploy and manage conversational voice agents with phone integration, knowledge bases, and analytics. Isolate vocals from background noise, align text to audio timestamps, and remix voice characteristics. Manage pronunciation dictionaries, access generation history, and retrieve usage statistics. Supports webhook notifications for call completions, transcription results, and voice events.
Falai
Run inference on 1,000+ generative AI models for image, video, audio, 3D, and multimodal content generation. Generate images from text or other images using models like FLUX, Stable Diffusion, Ideogram, and Recraft with support for LoRA adapters. Generate videos from text, images, or other videos using models like Veo, Sora, Kling, and LTX. Transcribe audio to text with speaker diarization and generate speech from text with voice cloning. Convert images to 3D models or generate 3D from text. Submit requests synchronously or via an asynchronous queue with polling and webhook notifications. Upload files to built-in CDN storage for use in model inputs. Discover available models, retrieve pricing, track usage, and query analytics. Manage API keys programmatically. Deploy custom models to serverless infrastructure and manage dedicated GPU compute instances.
Fireflies
Record, transcribe, and analyze meeting conversations from platforms like Zoom, Google Meet, and Webex. Retrieve, search, and manage meeting transcripts with AI-generated summaries, action items, sentiment analysis, and keywords. Upload audio files for transcription. Ask questions about meetings using the AskFred AI assistant. Add a bot to live meetings for automatic recording, pause and resume recordings, and create live action items or soundbites. Manage users and teams, organize meetings into channels, query contacts, and receive webhook notifications when transcriptions complete.
Gladia
Transcribe audio and video files to text using asynchronous or real-time streaming modes. Supports 100+ languages with automatic language detection and code-switching. Perform speaker diarization to identify different speakers. Translate transcripts into multiple target languages. Generate summaries, sentiment analysis, named entity recognition, and chapter segmentation from audio. Extract structured data and produce subtitles in SRT/VTT formats. Apply custom vocabulary and spelling corrections, content moderation, and name consistency. Send custom prompts to generate LLM-powered responses from transcripts. Receive results via polling, callback URLs, or account-level webhooks.
Google Cloud Speech
Convert audio to text transcriptions and synthesize natural-sounding speech from text using Google's neural network models. Perform synchronous, asynchronous, and streaming speech-to-text recognition across 125+ languages. Create and manage recognizer configurations for reusable transcription settings. Adapt speech models with custom phrase sets, custom classes, and boost values to improve accuracy for domain-specific vocabulary. Identify distinct speakers via speaker diarization and recognize multi-channel audio. Generate subtitle/caption output in SRT format. Synthesize text or SSML into audio using Standard, WaveNet, Neural2, Studio, and Chirp voice types with configurable pitch, speaking rate, volume, and encoding. Produce long-form audio content asynchronously.
Happy Scribe
Transcribe, subtitle, and translate audio and video files using AI-powered or human professional services. Upload media files, create transcription and subtitling orders in 120+ languages, translate transcriptions into 70+ target languages, and export results in multiple formats (TXT, DOCX, PDF, SRT, VTT, STL, XML, JSON, and more). Manage transcriptions with folders, tags, glossaries, and style guides. Configure webhooks for order status notifications.
Heygen
Generate AI avatar videos from text or audio scripts, with customizable avatars, voices, backgrounds, and multi-scene layouts. Translate videos into multiple languages with lip-sync. Create personalized video campaigns at scale using templates with dynamic variables. Stream real-time interactive avatar sessions for conversational AI experiences. Generate text-to-speech audio. Create photo avatar videos from uploaded photos. Upload and manage image, video, and audio assets. Monitor video generation status via webhooks and check account credit balances.
Recallai
Create and manage meeting bots that join video conferences on Zoom, Google Meet, Microsoft Teams, Webex, Slack Huddles, and GoTo Meeting to capture recordings, transcripts, and metadata. Schedule bots to join meetings automatically via Google Calendar and Outlook integrations. Capture meeting recordings in multiple formats (MP4, MP3, per-participant audio/video). Generate real-time and async transcripts using multiple providers (Deepgram, AssemblyAI, AWS Transcribe, Rev, Speechmatics, or platform captions). Stream real-time audio (PCM), video (PNG/H264), and RTMP. Send and read chat messages through bots. Build interactive AI agents that output speech and video into meetings. Access participant information including names, emails, join/leave events, and host status. Monitor bot status and recording lifecycle via webhooks. Support botless recording via Desktop SDK and native platform integrations (Zoom RTMS, Google Meet Media API).
Retell Ai
Build, deploy, and manage AI phone and chat agents. Create and configure voice agents with customizable voices, conversation flows, and knowledge bases. Initiate outbound phone calls, manage inbound calls, and run batch calling campaigns. Purchase and manage phone numbers with separate inbound/outbound agent assignments. Create web-based voice calls via WebRTC. Send SMS messages and manage chat sessions. Define and run batch simulation tests against agents. Monitor call concurrency, retrieve call transcripts, recordings, cost breakdowns, and post-call analysis including sentiment, summaries, and custom insights. Manage Retell LLM configurations, conversation flows, and reusable flow components. Clone and browse voices filtered by provider, gender, accent, and age. Receive real-time webhooks for call lifecycle, transcript updates, and transfer events.
Synthflow Ai
Create, configure, and manage AI-powered voice agents for automating inbound and outbound phone calls. Initiate live calls, monitor active conversations, and retrieve call history with transcripts and recordings. Upload domain knowledge bases, browse and assign voices, and provision phone numbers. Run simulations and test cases before deploying agents. Register custom actions to integrate external APIs during calls. Manage contacts, subaccounts, webhook logs, and export analytics. Supports post-call and inbound call webhooks for real-time notifications and dynamic call routing.
Vapi
Build and manage voice AI agents that make and receive phone calls and conduct web-based voice conversations. Create and configure assistants combining speech-to-text, LLM, and text-to-speech providers. Initiate outbound calls, handle inbound calls, and orchestrate multi-assistant squads with context-preserving transfers. Define node-based conversational workflows with AI-driven routing. Provision and manage phone numbers across Twilio, Vonage, Telnyx, and SIP providers. Attach tools for real-time API calls and code execution during conversations. Upload knowledge base files for retrieval-augmented generation. Run outbound call campaigns for bulk calling. Query call analytics, retrieve transcripts and recordings, and perform post-call analysis including summaries, structured data extraction, and success evaluation. Manage text-based chat sessions with an OpenAI-compatible endpoint. Receive webhook events for call status changes, transcripts, tool invocations, transfers, and conversation updates.
Elevenreader
Convert text to lifelike speech audio with customizable AI voices, models, languages, and output formats. Transcribe speech to text in real-time or batch mode. Manage, clone, and design voices from audio samples or text prompts. Generate music and sound effects from text descriptions. Dub and translate audio/video content into other languages. Isolate vocals from background noise, change voices in audio, and generate multi-speaker dialogue. Create and manage conversational AI agents with knowledge bases and tool integrations. Manage studio projects for long-form audio productions like audiobooks. Configure pronunciation dictionaries and administer workspace settings, users, and usage analytics.
Ganai
Generate personalized videos, text-to-speech audio, AI avatars, lip-synced videos, and sound effects. Create AI avatars from video or photo, then produce HD videos with synthesized speech from text scripts. Convert text to natural-sounding speech in 70+ languages including 22+ Indic languages. Generate lip-synchronized videos by combining audio with source video. Create audio sound effects from text descriptions. Run personalized video campaigns at scale by specifying per-recipient variables for bulk video generation with unique landing pages. Manage voices, avatars, projects, and workspaces. Receive webhook notifications for avatar creation, video generation, and lip-sync job completion status.
Groqcloud
Run AI inference on open-source language models with ultra-low latency using Groq's LPU hardware. Generate text via chat completions, produce structured JSON outputs, and perform function calling with built-in, remote (MCP), or local tools. Transcribe and translate audio using Whisper models, convert text to speech, and analyze images with multimodal vision models. Support chain-of-thought reasoning, content moderation with custom policies, and asynchronous batch processing for large-scale workloads. List and query available hosted models.
Jigsawstack
Scrape websites using natural language prompts to extract structured data. Perform AI-powered web search and deep research on topics. Analyze text sentiment, summarize content, translate text across 160+ languages, and check spelling. Convert natural language to SQL queries. Extract structured data from images using vision OCR (e.g., receipts, documents). Transcribe audio and video files to text with speaker diarization and language detection. Generate images from text prompts using multiple model backends. Convert HTML to PDF or images. Detect NSFW content, profanity, and spam. Classify text into custom categories. Upload, retrieve, and delete files in cloud storage. Search for addresses and places with geocoding. Generate text embeddings for semantic search. Receive webhook callbacks for long-running task results.