All posts
April 10, 2026 3 min read

How I built a Voice AI receptionist with Twilio + Gemini Live

An honest walkthrough of the architecture, the tradeoffs, and the gotchas you only hit in production.

Voice AI is having a moment. Every founder I talk to wants a receptionist agent. The demos are easy. Production is not.

This post walks through the architecture I shipped for a multi-tenant Voice AI platform, with the trade-offs I'd tell you about over coffee.

The naive architecture (don't ship this)

The "Hello World" of voice agents looks like:

twilio webhook → start streaming → call OpenAI → stream back to Twilio

It works in 50 lines. It also breaks in production for at least four reasons:

  1. No retrieval. The agent hallucinates business hours.
  2. No tools. It can't book an appointment.
  3. No tenant isolation. Multi-tenant data leaks are one missing filter away.
  4. No fallback path. When the model returns garbage, the call is over.

The architecture I actually shipped

Two backends, on purpose

The slow path (config, billing, webhooks) lives in Xano. The fast path (audio streaming, model orchestration) lives in NestJS. One language doesn't have to do both jobs.

Caller ──► Twilio ──► NestJS gateway
                         │
                         ├──► Gemini Live (audio streaming)
                         │
                         ├──► RAG (pgvector, tenant-scoped)
                         │
                         └──► Tool API (Xano, tenant-scoped)

Every layer asserts tenant_id. No exceptions, no shortcuts.

Gemini Live vs OpenAI Realtime

I get asked this every week. Honest answer: both work.

  • Pick Gemini Live when the client is on GCP, latency from EU/US matters, or you want multimodal pricing rolled into existing credits.
  • Pick OpenAI Realtime when you need rock-solid function-calling reliability for complex tool chains, you're already in the OpenAI ecosystem, or you want voice cloning via parallel ElevenLabs.

There is no "winner." The right choice depends on your stack.

Three gotchas you'll hit

1. Audio chunk sizes matter

Twilio streams audio in 20ms chunks. Most models want 100–250ms windows. Buffer wrong and you get either choppy responses or 2-second tail latency. Get the buffering right early — it touches every other tuning later.

2. RAG retrieval is per-turn, not per-call

Don't retrieve once at call start. Retrieve before every model turn — the user's question shifts as the call goes on. Yes, it costs more. Yes, the user notices when you don't.

3. Tools must be idempotent

The model will double-call them. Either:

  • Make every tool idempotent (preferred), or
  • Track call IDs server-side and dedupe.

There is no third option.

What I'd do next

The next iteration adds:

  • Conversation memory across calls (per-caller, not per-tenant)
  • Sentiment-based escalation to a human agent
  • Outbound calls (the API is symmetric; the product is not)

If you're building something like this and want to compare architecture notes, send a message — I'm always happy to talk shop.

Tags:#AI#Voice#Twilio#Gemini#RAG

Hiring for something this touches?

Send a note — happy to dig into the architecture and trade-offs.