CallMind: An AI-Powered Real-Time Voice Agent Platform Using a Pluggable Speech-to-Text, Large Language Model, and Text-to-Speech Pipeline

Sahil Makandar; Rohini Magamdum; Aditya Mali; Sudesh Kumbhar; Radheshyam Sah; Rajendra Hiremath

doi:10.17148/IJARCCE.2026.15570

← Back to VOLUME 15, ISSUE 5, MAY 2026

CallMind: An AI-Powered Real-Time Voice Agent Platform Using a Pluggable Speech-to-Text, Large Language Model, and Text-to-Speech Pipeline

Sahil Makandar, Rohini Magamdum, Aditya Mali, Sudesh Kumbhar, Radheshyam Sah, Rajendra Hiremath

Downloads: Download PDF|DOI: 10.17148/IJARCCE.2026.15570

👁 15 views📥 7 downloads

Abstract: Intelligent voice agents offer a scalable alternative to traditional Interactive Voice Response (IVR) systems and human call centres, yet their deployment remains technically complex. This paper presents CallMind, an AI-powered voice agent platform that enables businesses and individuals to deploy conversational telephone agents within minutes. The platform implements a pluggable, channel-agnostic pipeline that normalises all input modalities—phone calls, browser audio, and direct text—into a unified QueryPayload abstraction before processing. The core intelligence pipeline routes each query through Azure Cognitive Speech Services for real-time streaming Speech-to-Text (STT), Groq API (LLaMA 3.3 70B) for large language model inference, and Azure Neural Text-to-Speech (TTS) for audio synthesis. A sentence-by-sentence streaming architecture achieves end-to-end response latency of approximately 780 ms, comparable to natural human conversational response time. The system is built on a dual-backend architecture: a managed Supabase layer for multi-tenant agent configuration, knowledge base management, and conversation persistence; and a containerised Python FastAPI server on Microsoft Azure for the real-time AI pipeline. Live validation through Twilio Programmable Voice demonstrates natural conversation quality with accurate transcription, contextually relevant responses, and seamless audio delivery. The architecture provides a clear migration path from full-context LLM prompting to Retrieval-Augmented Generation (RAG) and from Docker Compose to Kubernetes without modifying the application layer.

Keywords: Voice Agent, Speech-to-Text, Text-to-Speech, Large Language Model, Real-Time Pipeline, WebSocket, Twilio, FastAPI, Docker, Pluggable Architecture, Conversational AI, Redis, Supabase, Retrieval- Augmented Generation

How to Cite:

[1] Sahil Makandar, Rohini Magamdum, Aditya Mali, Sudesh Kumbhar, Radheshyam Sah, Rajendra Hiremath, “CallMind: An AI-Powered Real-Time Voice Agent Platform Using a Pluggable Speech-to-Text, Large Language Model, and Text-to-Speech Pipeline,” International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), DOI: 10.17148/IJARCCE.2026.15570

This work is licensed under a Creative Commons Attribution 4.0 International License.