- The gateway starts sending responses as soon as the model begins generating content
- Each response chunk contains a delta (increment) of the content
- The final chunk indicates the completion of the response
Examples
You can enable streaming by setting thestream parameter to true in your inference request. The response will be returned as a Server-Sent Events (SSE) stream, followed by a final [DONE] message. When using a client library, the client will handle the SSE stream under the hood and return a stream of chunk objects. See API Reference for more details. You can also find a runnable example on GitHub.
Chat Functions
In chat functions, typically each chunk will contain a delta (increment) of the text content:JSON Functions
For JSON functions, each chunk contains a portion of the JSON string being generated. Note that the chunks may not be valid JSON on their own - you’ll need to concatenate them to get the complete JSON response. The gateway doesn’t return parsed or validated JSON objects when streaming.Technical Notes
- Token usage information is only available in the final chunk with content (before the [DONE]message)
- Streaming may not be available with certain inference-time optimizations