How to Build a Voice AI Agent from Scratch
Building a voice AI agent from scratch in 2026 is far more accessible than it sounds. The pieces — LLM, text-to-speech, speech-to-text, and telephony — are all available as APIs. Your job is to wire them together intelligently. This tutorial walks you through the complete architecture.
The Architecture of a Voice AI Agent
A voice AI agent has five components working in real time:
- Telephony layer — Handles the phone call (Twilio, Retell's built-in, or VAPI)
- Speech-to-text (STT) — Converts spoken audio to text (Deepgram, Whisper)
- LLM (the brain) — Processes the text and generates a response (GPT-4o, Claude)
- Text-to-speech (TTS) — Converts the LLM response to audio (ElevenLabs, Cartesia)
- Orchestration — Manages the flow, timing, and data capture
The total latency from user speech ending to AI response beginning needs to be under 700ms for a natural conversation. Platform selection directly affects this.
Platform Options
Option A: Retell AI (Recommended for Beginners)
Retell handles STT, TTS, telephony, and orchestration. You bring your LLM key and a system prompt. Latency: 400-600ms. Best for: agencies building voice agents without deep technical expertise.
Option B: VAPI (Recommended for Advanced Builders)
VAPI is more configurable. You can swap every component (different STT, different TTS, custom LLM). Latency: 500-800ms depending on configuration. Best for: developers who need fine-grained control.
Option C: Full Custom Build
Use Twilio Media Streams + Deepgram + OpenAI + ElevenLabs directly. Maximum control, maximum complexity. Latency: 600-900ms (more overhead from orchestration). Best for: engineering teams with specific requirements.
For 90% of agency use cases, Retell AI is the right choice.
Step 1: Design the Conversation Flow
Before touching any platform, write out your conversation flow in plain English.
Sample: Auto Shop Appointment Booking Agent
Entry: "Thank you for calling [Shop Name]. This is Maya — how can I help you today?"
Intents to handle:
- Schedule an appointment
- Get pricing information
- Check appointment status
- Speak to a human
Appointment booking flow:
- Confirm they want to schedule
- Ask for vehicle year/make/model
- Ask what service they need
- Check available slots (from calendar API)
- Offer 2-3 options
- Confirm selection
- Ask for name and phone number
- Confirm appointment: "You're scheduled for [service] on [day] at [time]. You'll get a text confirmation shortly."
Escalation: "Let me get you connected with one of our service advisors — please hold just a moment."
Document this before building. The conversation design is 60% of the work.
Want to build this yourself? NURO University walks you through it step by step. Start free →
Step 2: Choose and Configure Your Voice
Voice selection dramatically affects user perception. A mismatch between voice and brand destroys trust.
ElevenLabs Voice Selection Guide
| Voice Characteristic | Best For |
|---|---|
| Warm, slow cadence | Healthcare, legal, wellness |
| Professional, neutral | Business services, real estate |
| Energetic, friendly | Retail, hospitality, restaurants |
| Authoritative | Financial services, law enforcement |
For most business use cases: start with ElevenLabs "Rachel" (professional, warm) or create a custom clone from the business owner's voice (powerful for brand alignment).
Custom voice cloning with ElevenLabs:
- Record 30-60 minutes of clean audio
- Upload to ElevenLabs Voice Lab
- Train (15-30 minutes)
- Use the voice ID in your Retell configuration
Step 3: Build the LLM System Prompt
The system prompt is the brain of your agent. It needs:
- Identity and role — Who are they, what can they do
- Knowledge base — Business hours, pricing, services, FAQs
- Conversation rules — Tone, pacing, handling unclear input
- Fallback instructions — What to do when the agent cannot handle a request
- Data collection instructions — What information to gather and how to confirm it
Keep prompts under 2,000 tokens for best performance. Test with edge cases: angry callers, unclear requests, questions outside scope.
Step 4: Connect to Calendar/CRM
For appointment booking agents, real-time calendar access is essential. Otherwise the agent books slots that do not exist.
Integration options:
- Google Calendar: Google Calendar API (free, relatively easy)
- Acuity/Calendly: Webhook + API (straightforward)
- Industry-specific software: Usually requires a middleware layer via Make.com or n8n
Build a Make.com workflow that:
- Receives a call from Retell via webhook
- At the "check availability" step, queries the calendar
- Returns available slots to the agent
- At booking confirmation, creates the calendar event
- Sends confirmation SMS to caller
Step 5: Test Rigorously
Before going live, run through at minimum:
- 20 test calls covering normal scenarios
- 5 calls where the user says something unexpected
- 3 calls where the user wants to speak to a human
- 2 calls where the user tries to schedule outside available hours
- 1 call that gets disconnected mid-conversation
Listen to call recordings and read transcripts. Iterate on the prompt until the agent handles 85%+ of scenarios gracefully.
Deployment Checklist
- Phone number acquired and assigned
- Voice tested by 3+ people (not just you)
- Calendar integration tested with real bookings
- Webhook for post-call data capture working
- SMS confirmation triggers on booking
- Escalation path (human handoff) tested
- Error monitoring set up (webhook alert on failed calls)
- Client trained on how to review call logs
Ready to Build Your AI Automation Business?
Stop reading about AI automation — start building it. NURO University gives you the exact frameworks, templates, and step-by-step training to land your first client and scale to $10K/month.
No tech background required. Start seeing results in your first 30 days.