Skip to main contentVision Agents is an open-source Video AI framework for building real-time voice and video applications built and maintained by the team at Stream. It ships with Stream Video as its default low-latency transport, powered by our global edge network. The framework is edge/transport agnostic meaning developers can also bring any edge layer they like. What can you build?
Vision Agents makes it simple to prototype and scale a wide range of AI-powered video apps, including: - Coaching & Training — live sports coaching, guided workouts
- Collaboration — meeting assistants, note-taking, transcription
- Automation & Robotics — IoT control, surveillance, manufacturing workflows
- Video AI — video avatars, character agents
Built-in AI integrations
Out of the box, Vision Agents supports popular providers across the AI stack: - LLMs: OpenAI, Anthropic, Gemini, xAI
- Realtime APIs: Gemini (websockets), OpenAI (WebRTC)
- Speech-to-Text (STT): Deepgram, Moonshine, Assembly AI
- Text-to-Speech (TTS): ElevenLabs, Assembly AI, Cartesia, Moonshine
- Turn / Voice Detection: Fal, Silero, Krisp
- Audio & Video Processing: YOLO
- Memory & Context: In-memory, Stream Chat
Each integration is built on extensible base classes. For example, with BaseProcessor
or VideoProcessorMixin
, you can plug in custom computer-vision models like Ultralytics YOLO. 👉 Ready to dive in? Follow the installation guide to build your first Agent.