Some early calls lasted up to 5 minutes, and users stayed engaged. The assistant reflected answers accurately and kept momentum toward a booking.
All call control and model streaming ran over WebSockets (audio frames, partial transcripts, token streams). The pipeline: PSTN → STT → LLM orchestration → cloud TTS, maintained ~700–800 ms average response time per turn under load, while preserving barge-in behavior.
Automated calls/SMS after booking and post-meeting. The first wave reduced no-shows; the second captured outcomes and reactivated stalled leads.
Users could reschedule via call or SMS. The assistant handled change requests inline, updated the calendar, and re-issued confirmations automatically.
Typical guided conversations ran 3–5 minutes: brief qualification, one or two product suggestions, and immediate booking if the user was ready.
On-the-fly product suggestion (RecSys)
During the call, we ranked products using rules + a lightweight learning-to-rank layer (embeddings over product descriptors). When scores were close, a gentle multi-armed bandit bias learned which option converted better by segment. The assistant offered 1–2 options and booked the appropriate line (advisors / underwriters / loans).
TTS: We standardized on cloud TTS for consistent prosody and fast starts, controlling cost via concise responses and caching of common prompts.
STT: Self-hosted Whisper-family plus major cloud engines. Non-English calls were the hardest accents, numerals, domain terms, so we added exact-phrase lists and domain lexicons for critical slots.
Wordy non-English intros raised AHT without improving bookings → kept openings concise.
Full local TTS for rare languages under-performed → cloud TTS with custom pronunciations for key terms proved more reliable.
Long first-call questionnaires reduced completion → trimmed to 6–8 core fields; moved the rest to short SMS follow-ups.
LLM + structured extraction
An LLM-orchestrated flow produced structured outputs (JSON) for key fields (income band, timeline, product interest, constraints). A light CoT-style planner asked clarifying follow-ups (“net or gross?”, “monthly or yearly?”) when confidence dipped, then emitted validated JSON to the CRM.
We trialed energy-based and neural VAD. Neural was steadier in noise; we tuned sensitivity per locale for quieter speakers and kept a hybrid approach so callers could interject naturally.
Start small, iterate fast
First pass was simple: incoming call → consent → short qualification → product suggestion → book directly into the team’s shared calendar. Then we tightened timing, extraction accuracy, and scheduling UX.