The ability to run large language models (LLMs) directly in the browser has opened new possibilities for privacy-preserving, client-side AI applications. In this blog post, we’ll explore how to run DeepSeek Janus-Pro-1B, a powerful text-to-image generation model, entirely in the browser using WebGPU and Hugging Face’s Transformers.js library.
Why Browser-Based Inference?
- Privacy: Data never leaves the user’s device.
- Cost Efficiency: No server infrastructure required.
- Accessibility: Runs on any device with a modern browser and WebGPU support.
DeepSeek Janus-Pro-1B, designed for multimodal tasks like text-to-image generation, is now accessible via browser-based inference thanks to optimizations in Transformers.js and WebGPU acceleration.
Key Tools & Libraries
- Transformers.js: A JavaScript port of Hugging Face’s Transformers library, optimized for browser execution.
- WebGPU: A modern API for GPU acceleration in browsers, replacing WebGL with improved performance for ML workloads.
- ONNX Runtime: Enables model execution via optimized computation graphs.
Demo Code Walkthrough
The following example demonstrates how to load and run DeepSeek Janus-Pro-1B in a Web Worker for non-blocking inference. The full code is available in the GitHub repository.
import { AutoProcessor, MultiModalityCausalLM, BaseStreamer, TextStreamer, InterruptableStoppingCriteria, } from "@huggingface/transformers"; // Define constants const IMAGE_GENERATION_COMMAND_PREFIX = "/imagine "; const MAX_NEW_TEXT_TOKENS = 1024; /** * Helper function to perform WebGPU feature detection */ let fp16_supported = false; async function check() { try { const adapter = await navigator.gpu.requestAdapter(); if (!adapter) { throw new Error("WebGPU is not supported (no adapter found)"); } fp16_supported = adapter.features.has("shader-f16"); self.postMessage({ status: "success", data: fp16_supported, }); } catch (e) { self.postMessage({ status: "error", data: e.toString(), }); } } /** * This class uses the Singleton pattern to enable lazy-loading of the pipeline */ class ImageGenerationPipeline { static model_id = "onnx-community/Janus-Pro-1B-ONNX"; static async getInstance(progress_callback = null) { this.processor ??= AutoProcessor.from_pretrained(this.model_id, { progress_callback, }); this.model ??= MultiModalityCausalLM.from_pretrained(this.model_id, { dtype: fp16_supported ? { prepare_inputs_embeds: "q4", language_model: "q4f16", lm_head: "fp16", gen_head: "fp16", gen_img_embeds: "fp16", image_decode: "fp32", } : { prepare_inputs_embeds: "fp32", language_model: "q4", lm_head: "fp32", gen_head: "fp32", gen_img_embeds: "fp32", image_decode: "fp32", }, device: { prepare_inputs_embeds: "wasm", // TODO use "webgpu" when bug is fixed language_model: "webgpu", lm_head: "webgpu", gen_head: "webgpu", gen_img_embeds: "webgpu", image_decode: "webgpu", }, progress_callback, }); return Promise.all([this.processor, this.model]); } } class ProgressStreamer extends BaseStreamer { constructor(total, on_progress) { super(); this.total = total; this.on_progress = on_progress; this.count = null; this.start_time = null; } put(value) { if (this.count === null) { // Ignore the first batch of tokens (prompt) this.count = 0; this.start_time = performance.now(); return; } const progress = ++this.count / this.total; this.on_progress({ count: this.count, total: this.total, progress, time: performance.now() - this.start_time, }); } end() { /* no nothing */ } } const stopping_criteria = new InterruptableStoppingCriteria(); async function generate(messages) { // For this demo, we only respond to the last message const message = messages.at(-1); // Tell the main thread we are starting self.postMessage({ status: "start" }); // Load the pipeline const [processor, model] = await ImageGenerationPipeline.getInstance(); // Determine if the user wants to generate an image or text if (message.content.startsWith(IMAGE_GENERATION_COMMAND_PREFIX)) { const text = message.content.replace(IMAGE_GENERATION_COMMAND_PREFIX, ""); const conversation = [ { role: "<|User|>", // uses title case content: text, }, ]; const inputs = await processor(conversation, { chat_template: "text_to_image", }); const callback_function = (output) => { self.postMessage({ status: "image-update", ...output, }); }; const num_image_tokens = processor.num_image_tokens; const streamer = new ProgressStreamer(num_image_tokens, callback_function); const outputs = await model.generate_images({ ...inputs, min_new_tokens: num_image_tokens, max_new_tokens: num_image_tokens, do_sample: true, streamer, }); const blob = await outputs[0].toBlob(); // Send the output back to the main thread self.postMessage({ status: "image-update", blob, }); } else { const inputs = await processor( message.image ? [ { role: "<|User|>", content: "<image_placeholder>\n" + message.content, images: [message.image], }, ] : [ { role: "<|System|>", content: "You are a helpful assistant. Answer the user's questions in a concise manner.", }, { role: "<|User|>", content: message.content, }, ], ); let startTime; let numTokens = 0; let tps; const token_callback_function = () => { startTime ??= performance.now(); if (numTokens++ > 0) { tps = (numTokens / (performance.now() - startTime)) * 1000; } }; const callback_function = (output) => { self.postMessage({ status: "text-update", output, tps, numTokens, }); }; const streamer = new TextStreamer(processor.tokenizer, { skip_prompt: true, skip_special_tokens: true, callback_function, token_callback_function, }); // Generate response const outputs = await model.generate({ ...inputs, max_new_tokens: MAX_NEW_TEXT_TOKENS, do_sample: false, streamer, stopping_criteria, }); } // Tell the main thread we are done self.postMessage({ status: "complete", }); } async function load() { self.postMessage({ status: "loading", data: "Loading model...", }); // Load the pipeline and save it for future use. const [processor, model] = await ImageGenerationPipeline.getInstance((x) => { // We also add a progress callback to the pipeline so that we can // track model loading. self.postMessage(x); }); self.postMessage({ status: "ready" }); } // Listen for messages from the main thread self.addEventListener("message", async (e) => { const { type, data } = e.data; switch (type) { case "check": check(); break; case "load": load(); break; case "generate": stopping_criteria.reset(); generate(data); break; case "interrupt": stopping_criteria.interrupt(); break; case "reset": stopping_criteria.reset(); break; } }); Running the Demo
Check out the live demo here: DeepSeek Janus-Pro-1B Browser Demo.
Key Features of the Demo:
- Real-time progress updates during model loading and inference.
- WebGPU-accelerated generation (requires Chrome 113+ or Edge 113+).
- Full client-side execution—no data is sent to external servers.
Challenges & Optimizations
- Model Quantization: The model is quantized to 8-bit to reduce its size and improve loading speed.
- Memory Management: Web Workers prevent UI freezing during inference.
- Browser Compatibility: WebGPU is still experimental but critical for performance.
Conclusion
Running DeepSeek Janus-Pro-1B in the browser showcases the potential of client-side AI. With tools like Transformers.js and WebGPU, complex models can now operate efficiently in constrained environments while preserving user privacy.
Next Steps:
- Experiment with different prompts and model configurations.
- Explore fine-tuning the model for domain-specific tasks.
- Monitor WebGPU adoption to ensure broader compatibility.
For developers, this marks an exciting shift toward decentralized, user-centric AI applications. Dive into the example code and start building! 🚀
Top comments (1)
Helpful | thanks