DEV Community

Cover image for Real-Time Voice Meets RAG: Building a Domain-Specific AI Chatbot
K Om Senapati
K Om Senapati

Posted on

Real-Time Voice Meets RAG: Building a Domain-Specific AI Chatbot

AssemblyAI Voice Agents Challenge: Domain Expert

This is a submission for the AssemblyAI Voice Agents Challenge

What I Built

Built a small side project recently: a voice-based chatbot that answers sociology questions using a domain-trained RAG agent. It’s called Sociopal.

It’s powered by LangGraph, does corrective RAG, and can also search the web when it doesn’t have the answer. AssemblyAI handles speech-to-text, and ElevenLabs takes care of the speech output.

You ask a sociology-related question using your voice. The app transcribes your voice to text, queries a backend agent trained on sociology docs, and gives a response. If the answer isn’t found in the vector DB, it falls back to web search and tries again.

The final response is both displayed and spoken aloud using ElevenLabs.

Demo

Not deployed yet, but here’s a short demo video:

GitHub Repository

⭐ Github 👇

Sociopal

A domain expert AI voice agent for sociology.

Learn more about Sociopal

Sociopal is a Corrective RAG (CRAG) agent powered by a vectorDB containing curated sociology information and web search. It is designed to answer questions and provide detailed explanations related to sociology.

Technology Stack

Frontend:

  • Next.js
  • AssemblyAI (speech-to-text)
  • ElevenLabs (text-to-speech)

Backend:

  • FastAPI
  • LangGraph
  • Groq
  • ChromaDB
  • DuckDuckGo (web search)

Getting Started

1. Clone the Repository

git clone https://github.com/k0msenapati/sociopal.git
Enter fullscreen mode Exit fullscreen mode

2. Navigate to the Project Directory

cd sociopal
Enter fullscreen mode Exit fullscreen mode

Frontend Setup

cd ui bun i cp .env.example .env.local
Enter fullscreen mode Exit fullscreen mode

Fill in your ElevenLabs and AssemblyAI API keys in .env.local.

Start the development server:

bun dev
Enter fullscreen mode Exit fullscreen mode

Backend Setup

cd ../agent-py uv sync source .venv/bin/activate cp .env.example .env
Enter fullscreen mode Exit fullscreen mode

Fill in your Groq API key in .env.

Index the Data

uv run --active -m sociology_agent.index
Enter fullscreen mode Exit fullscreen mode

Run the Server

uv run --active uvicorn sociology_agent.server:app --reload
Enter fullscreen mode Exit fullscreen mode



Installation steps are included in the README.

Technical Implementation

AssemblyAI Integration

I’m using AssemblyAI’s Universal-Streaming API to handle real-time voice input. Here’s the rough flow:

1. Getting a Temporary Token

There's an API route (/api/token) that fetches a temporary token:

const url = `https://streaming.assemblyai.com/v3/token?expires_in_seconds=60` 
Enter fullscreen mode Exit fullscreen mode

2. Connecting via WebSocket

Once the token is ready, a WebSocket connection is opened to stream audio:

wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&formatted_finals=true&token=${token} 
Enter fullscreen mode Exit fullscreen mode

On the frontend, I use getUserMedia() to access the mic, then convert the audio to 16-bit PCM and send it over the socket. AssemblyAI returns transcripts in real time, which I display as the user speaks.

It works smoothly with low latency, and transcripts are surprisingly accurate even with casual speech.

Backend Agent

The backend runs a FastAPI app with a /query route. It accepts user queries, passes them to the LangGraph agent, and returns the response.

The agent uses corrective RAG, so if the first answer is incomplete or irrelevant, it will retry with a refined query. It’s also hooked up to a web search tool in case the answer isn’t in the vectorDB.

Final Thoughts

Building this was a fun way to explore how voice can enhance AI agents. Using real-time transcription with AssemblyAI and natural-sounding speech with ElevenLabs made the voice interface smooth to implement.

While this one is trained on sociology data, the setup is actually domain-agnostic. You can swap out the vector database with any other domain-specific content, and the agent will still work just as well.

Definitely worth trying if you're into voice UIs or building smarter assistants.


Thanks for reading, and I look forward to connecting with you again soon!

Follow me for more content like this!

Twitter | GitHub | YouTube

Bye

Top comments (6)

Collapse
 
rohan_sharma profile image
Rohan Sharma

Great job, man!

Collapse
 
abhinav-writes profile image
Abhinav

Amazing 🫡

Collapse
 
ayushjhawar profile image
Ayush Jhawar

Great Work 🎉

Collapse
 
pheonixcoder_56 profile image
Pheonix Coder 🐦‍🔥

Voice addition is just 🫰

Collapse
 
tuhinbanerjee31 profile image
Tuhin Banerjee

cool stuff...

Some comments may only be visible to logged-in visitors. Sign in to view all comments.