April 10, 2026 3 min read

How I built a Voice AI receptionist with Twilio + Gemini Live

An honest walkthrough of the architecture, the tradeoffs, and the gotchas you only hit in production.

Voice AI is having a moment. Every founder I talk to wants a receptionist agent. The demos are easy. Production is not.

This post walks through the architecture I shipped for a multi-tenant Voice AI platform, with the trade-offs I'd tell you about over coffee.

The naive architecture (don't ship this)

The "Hello World" of voice agents looks like:

twilio webhook → start streaming → call OpenAI → stream back to Twilio

It works in 50 lines. It also breaks in production for at least four reasons:

No retrieval. The agent hallucinates business hours.
No tools. It can't book an appointment.
No tenant isolation. Multi-tenant data leaks are one missing filter away.
No fallback path. When the model returns garbage, the call is over.

The architecture I actually shipped

Two backends, on purpose

The slow path (config, billing, webhooks) lives in Xano. The fast path (audio streaming, model orchestration) lives in NestJS. One language doesn't have to do both jobs.

Caller ──► Twilio ──► NestJS gateway
                         │
                         ├──► Gemini Live (audio streaming)
                         │
                         ├──► RAG (pgvector, tenant-scoped)
                         │
                         └──► Tool API (Xano, tenant-scoped)

Every layer asserts tenant_id. No exceptions, no shortcuts.

Gemini Live vs OpenAI Realtime

I get asked this every week. Honest answer: both work.

Pick Gemini Live when the client is on GCP, latency from EU/US matters, or you want multimodal pricing rolled into existing credits.
Pick OpenAI Realtime when you need rock-solid function-calling reliability for complex tool chains, you're already in the OpenAI ecosystem, or you want voice cloning via parallel ElevenLabs.

There is no "winner." The right choice depends on your stack.

Make every tool idempotent (preferred), or
Track call IDs server-side and dedupe.

There is no third option.

What I'd do next

The next iteration adds:

Conversation memory across calls (per-caller, not per-tenant)
Sentiment-based escalation to a human agent
Outbound calls (the API is symmetric; the product is not)

If you're building something like this and want to compare architecture notes, send a message — I'm always happy to talk shop.

Tags:#AI#Voice#Twilio#Gemini#RAG

Hiring for something this touches?

Send a note — happy to dig into the architecture and trade-offs.

Send a message More posts

How I built a Voice AI receptionist with Twilio + Gemini Live

The naive architecture (don't ship this)

The architecture I actually shipped

Gemini Live vs OpenAI Realtime

Three gotchas you'll hit

1. Audio chunk sizes matter

2. RAG retrieval is per-turn, not per-call

3. Tools must be idempotent

What I'd do next

Hiring for something this touches?