This is a submission for the AssemblyAI Voice Agents Challenge
What I Built
EmpathyAI is a real-time voice-powered mental health support application that provides compassionate AI-driven conversations for individuals experiencing emotional distress. The system processes spoken input through advanced speech recognition, analyzes emotional content using AI, and responds with empathetic voice-based support.
Demo
GitHub Repository
React frontend app
https://github.com/vpjigin/EmpathyAIReact.git
Spring-boot backend
https://github.com/vpjigin/EmpathyAISpringBoot.git
AssemblyAI Universal-Streaming Technology
This application demonstrates advanced real-time audio processing powered by AssemblyAI’s Universal-Streaming API. The system enables low-latency, turn-based, and secure transcription, enabling emotionally intelligent AI conversations.
Core Architecture
The architecture follows a multi-layered streaming pipeline:
Client Audio → WebSocket Handler → AssemblyAI Streaming → AI Processing → Response
AssemblyAI Streaming Implementation
- Real-time WebSocket Connection The backend creates a persistent WebSocket connection to AssemblyAI’s streaming endpoint:
private static final String ASSEMBLYAI_STREAMING_URL = "wss://streaming.assemblyai.com/v3/ws"; public CompletableFuture<StreamingSession> createStreamingSession(String sessionId, TranscriptCallback callback) { String connectionUrl = ASSEMBLYAI_STREAMING_URL + "?sample_rate=16000&format_turns=true"; Map<String, String> headers = new HashMap<>(); headers.put("Authorization", apiKey); WebSocketClient client = new WebSocketClient(serverUri, headers) { @Override public void onMessage(String message) { JsonNode jsonMessage = objectMapper.readTree(message); if ("Turn".equals(messageType)) { String transcript = jsonMessage.get("transcript").asText(); boolean isFormatted = jsonMessage.get("turn_is_formatted").asBoolean(); if (isFormatted) { callback.onTranscript(transcript, true); } } } }; }
- Audio Streaming Handler The AudioStreamingWebSocketHandler component bridges client-side audio to the AssemblyAI session:
@Component public class AudioStreamingWebSocketHandler implements WebSocketHandler { @Autowired private AssemblyAIStreamingServiceV2 assemblyAIStreamingService; private void handleBinaryMessage(WebSocketSession session, BinaryMessage message) { StreamingSessionV2 assemblySession = assemblyAISessions.get(session.getId()); if (assemblySession != null) { ByteBuffer audioData = message.getPayload(); byte[] audioBytes = new byte[audioData.remaining()]; audioData.get(audioBytes); assemblySession.sendAudioData(audioBytes); } } private void startStreaming(WebSocketSession session, String conversationUuid) { assemblyAIStreamingService.createStreamingSession(session.getId(), new TranscriptCallback() { @Override public void onTranscript(String text, boolean isFinal) { if (isFinal) { handleFinalTranscript(session, conversation, text); } } }); } }
- Advanced Features Utilized
- Turn-based Transcription: format_turns=true for human-like flow
- 16kHz Audio: sample_rate=16000 ensures clarity
- TLS/SSL Security: Secured with valid certs
- Concurrent Streaming: Multiple session support
Message Type Handling: Supports "Begin", "Turn", and "Termination" types
Dual Implementation Strategy
I implemented two parallel streaming strategies:
AssemblyAIStreamingService: Uses Java-WebSocket for low-level WebSocket handling
AssemblyAIStreamingServiceV2: Uses Spring’s StandardWebSocketClient for seamless Spring Boot integration
// Spring-based implementation public CompletableFuture<StreamingSessionV2> createStreamingSession(String sessionId, TranscriptCallback callback) { StandardWebSocketClient client = new StandardWebSocketClient(); WebSocketHttpHeaders headers = new WebSocketHttpHeaders(); headers.add("Authorization", apiKey); WebSocketHandler handler = new WebSocketHandler() { @Override public void handleMessage(WebSocketSession session, WebSocketMessage<?> message) { // Handle messages using Spring WebSocket framework } }; client.doHandshake(handler, headers, serverUri).get(); }
Technical Capabilities Leveraged
1.Real-time Binary Audio Streaming
2.Low-latency (<1s) Transcription
3.Turn-based Conversation Context
4.Error Recovery & Retry Mechanism
5.Scalable Concurrent Sessions
Project Structure (Brief)
├── controller/
├── service/
├── websocket/
├── model/
├── config/
Top comments (0)