Analyze image files using the Gemini API

You can ask a Gemini model to analyze image files that you provide either inline (base64-encoded) or via URL. When you use Firebase AI Logic, you can make this request directly from your app.

With this capability, you can do things like:

  • Create captions or answer questions about images
  • Write a short story or a poem about an image
  • Detect objects in an image and return bounding box coordinates for them
  • Label or categorize a set of images for sentiment, style, or other characteristic

Jump to code samples Jump to code for streamed responses


See other guides for additional options for working with images
Generate structured output Multi-turn chat Analyze images on-device Generate images

Before you begin

Click your Gemini API provider to view provider-specific content and code on this page.

If you haven't already, complete the getting started guide, which describes how to set up your Firebase project, connect your app to Firebase, add the SDK, initialize the backend service for your chosen Gemini API provider, and create a GenerativeModel instance.

For testing and iterating on your prompts, we recommend using Google AI Studio.

Generate text from image files (base64-encoded)

Before trying this sample, complete the Before you begin section of this guide to set up your project and app.
In that section, you'll also click a button for your chosen Gemini API provider so that you see provider-specific content on this page.

You can ask a Gemini model to generate text by prompting with text and images—providing each input file's mimeType and the file itself. Find requirements and recommendations for input files later on this page.

Swift

You can call generateContent() to generate text from multimodal input of text and images.

Single file input

 import FirebaseAILogic // Initialize the Gemini Developer API backend service let ai = FirebaseAI.firebaseAI(backend: .googleAI()) // Create a `GenerativeModel` instance with a model that supports your use case let model = ai.generativeModel(modelName: "gemini-2.5-flash") guard let image = UIImage(systemName: "bicycle") else { fatalError() } // Provide a text prompt to include with the image let prompt = "What's in this picture?" // To generate text output, call generateContent and pass in the prompt let response = try await model.generateContent(image, prompt) print(response.text ?? "No text in response.") 

Multiple file input

 import FirebaseAILogic // Initialize the Gemini Developer API backend service let ai = FirebaseAI.firebaseAI(backend: .googleAI()) // Create a `GenerativeModel` instance with a model that supports your use case let model = ai.generativeModel(modelName: "gemini-2.5-flash") guard let image1 = UIImage(systemName: "car") else { fatalError() } guard let image2 = UIImage(systemName: "car.2") else { fatalError() } // Provide a text prompt to include with the images let prompt = "What's different between these pictures?" // To generate text output, call generateContent and pass in the prompt let response = try await model.generateContent(image1, image2, prompt) print(response.text ?? "No text in response.") 

Kotlin

You can call generateContent() to generate text from multimodal input of text and images.

For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.

Single file input

 // Initialize the Gemini Developer API backend service // Create a `GenerativeModel` instance with a model that supports your use case val model = Firebase.ai(backend = GenerativeBackend.googleAI())  .generativeModel("gemini-2.5-flash") // Loads an image from the app/res/drawable/ directory val bitmap: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky) // Provide a prompt that includes the image specified above and text val prompt = content {  image(bitmap)  text("What developer tool is this mascot from?") } // To generate text output, call generateContent with the prompt val response = model.generateContent(prompt) print(response.text) 

Multiple file input

For Kotlin, the methods in this SDK are suspend functions and need to be called from a Coroutine scope.
 // Initialize the Gemini Developer API backend service // Create a `GenerativeModel` instance with a model that supports your use case val model = Firebase.ai(backend = GenerativeBackend.googleAI())  .generativeModel("gemini-2.5-flash") // Loads an image from the app/res/drawable/ directory val bitmap1: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky) val bitmap2: Bitmap = BitmapFactory.decodeResource(resources, R.drawable.sparky_eats_pizza) // Provide a prompt that includes the images specified above and text val prompt = content {  image(bitmap1)  image(bitmap2)  text("What is different between these pictures?") } // To generate text output, call generateContent with the prompt val response = model.generateContent(prompt) print(response.text) 

Java

You can call generateContent() to generate text from multimodal input of text and images.

For Java, the methods in this SDK return a ListenableFuture.

Single file input

 // Initialize the Gemini Developer API backend service // Create a `GenerativeModel` instance with a model that supports your use case GenerativeModel ai = FirebaseAI.getInstance(GenerativeBackend.googleAI())  .generativeModel("gemini-2.5-flash"); // Use the GenerativeModelFutures Java compatibility layer which offers // support for ListenableFuture and Publisher APIs GenerativeModelFutures model = GenerativeModelFutures.from(ai); Bitmap bitmap = BitmapFactory.decodeResource(getResources(), R.drawable.sparky); // Provide a prompt that includes the image specified above and text Content content = new Content.Builder()  .addImage(bitmap)  .addText("What developer tool is this mascot from?")  .build(); // To generate text output, call generateContent with the prompt ListenableFuture<GenerateContentResponse> response = model.generateContent(content); Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {  @Override  public void onSuccess(GenerateContentResponse result) {  String resultText = result.getText();  System.out.println(resultText);  }  @Override  public void onFailure(Throwable t) {  t.printStackTrace();  } }, executor); 

Multiple file input

 // Initialize the Gemini Developer API backend service // Create a `GenerativeModel` instance with a model that supports your use case GenerativeModel ai = FirebaseAI.getInstance(GenerativeBackend.googleAI())  .generativeModel("gemini-2.5-flash"); // Use the GenerativeModelFutures Java compatibility layer which offers // support for ListenableFuture and Publisher APIs GenerativeModelFutures model = GenerativeModelFutures.from(ai); Bitmap bitmap1 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky); Bitmap bitmap2 = BitmapFactory.decodeResource(getResources(), R.drawable.sparky_eats_pizza); // Provide a prompt that includes the images specified above and text Content prompt = new Content.Builder()  .addImage(bitmap1)  .addImage(bitmap2)  .addText("What's different between these pictures?")  .build(); // To generate text output, call generateContent with the prompt ListenableFuture<GenerateContentResponse> response = model.generateContent(prompt); Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {  @Override  public void onSuccess(GenerateContentResponse result) {  String resultText = result.getText();  System.out.println(resultText);  }  @Override  public void onFailure(Throwable t) {  t.printStackTrace();  } }, executor); 

Web

You can call generateContent() to generate text from multimodal input of text and images.

Single file input

 import { initializeApp } from "firebase/app"; import { getAI, getGenerativeModel, GoogleAIBackend } from "firebase/ai"; // TODO(developer) Replace the following with your app's Firebase configuration // See: https://firebase.google.com/docs/web/learn-more#config-object const firebaseConfig = {  // ... }; // Initialize FirebaseApp const firebaseApp = initializeApp(firebaseConfig); // Initialize the Gemini Developer API backend service const ai = getAI(firebaseApp, { backend: new GoogleAIBackend() }); // Create a `GenerativeModel` instance with a model that supports your use case const model = getGenerativeModel(ai, { model: "gemini-2.5-flash" }); // Converts a File object to a Part object. async function fileToGenerativePart(file) {  const base64EncodedDataPromise = new Promise((resolve) => {  const reader = new FileReader();  reader.onloadend = () => resolve(reader.result.split(',')[1]);  reader.readAsDataURL(file);  });  return {  inlineData: { data: await base64EncodedDataPromise, mimeType: file.type },  }; } async function run() {  // Provide a text prompt to include with the image  const prompt = "What do you see?";  const fileInputEl = document.querySelector("input[type=file]");  const imagePart = await fileToGenerativePart(fileInputEl.files[0]);  // To generate text output, call generateContent with the text and image  const result = await model.generateContent([prompt, imagePart]);  const response = result.response;  const text = response.text();  console.log(text); } run(); 

Multiple file input

 import { initializeApp } from "firebase/app"; import { getAI, getGenerativeModel, GoogleAIBackend } from "firebase/ai"; // TODO(developer) Replace the following with your app's Firebase configuration // See: https://firebase.google.com/docs/web/learn-more#config-object const firebaseConfig = {  // ... }; // Initialize FirebaseApp const firebaseApp = initializeApp(firebaseConfig); // Initialize the Gemini Developer API backend service const ai = getAI(firebaseApp, { backend: new GoogleAIBackend() }); // Create a `GenerativeModel` instance with a model that supports your use case const model = getGenerativeModel(ai, { model: "gemini-2.5-flash" }); // Converts a File object to a Part object. async function fileToGenerativePart(file) {  const base64EncodedDataPromise = new Promise((resolve) => {  const reader = new FileReader();  reader.onloadend = () => resolve(reader.result.split(',')[1]);  reader.readAsDataURL(file);  });  return {  inlineData: { data: await base64EncodedDataPromise, mimeType: file.type },  }; } async function run() {  // Provide a text prompt to include with the images  const prompt = "What's different between these pictures?";  // Prepare images for input  const fileInputEl = document.querySelector("input[type=file]");  const imageParts = await Promise.all(  [...fileInputEl.files].map(fileToGenerativePart)  );  // To generate text output, call generateContent with the text and images  const result = await model.generateContent([prompt, ...imageParts]);  const response = result.response;  const text = response.text();  console.log(text); } run(); 

Dart

You can call generateContent() to generate text from multimodal input of text and images.

Single file input

 import 'package:firebase_ai/firebase_ai.dart'; import 'package:firebase_core/firebase_core.dart'; import 'firebase_options.dart'; // Initialize FirebaseApp await Firebase.initializeApp(  options: DefaultFirebaseOptions.currentPlatform, ); // Initialize the Gemini Developer API backend service // Create a `GenerativeModel` instance with a model that supports your use case final model =  FirebaseAI.googleAI().generativeModel(model: 'gemini-2.5-flash'); // Provide a text prompt to include with the image final prompt = TextPart("What's in the picture?"); // Prepare images for input final image = await File('image0.jpg').readAsBytes(); final imagePart = InlineDataPart('image/jpeg', image); // To generate text output, call generateContent with the text and image final response = await model.generateContent([  Content.multi([prompt,imagePart]) ]); print(response.text); 

Multiple file input

 import 'package:firebase_ai/firebase_ai.dart'; import 'package:firebase_core/firebase_core.dart'; import 'firebase_options.dart'; // Initialize FirebaseApp await Firebase.initializeApp(  options: DefaultFirebaseOptions.currentPlatform, ); // Initialize the Gemini Developer API backend service // Create a `GenerativeModel` instance with a model that supports your use case final model =  FirebaseAI.googleAI().generativeModel(model: 'gemini-2.5-flash'); final (firstImage, secondImage) = await (  File('image0.jpg').readAsBytes(),  File('image1.jpg').readAsBytes() ).wait; // Provide a text prompt to include with the images final prompt = TextPart("What's different between these pictures?"); // Prepare images for input final imageParts = [  InlineDataPart('image/jpeg', firstImage),  InlineDataPart('image/jpeg', secondImage), ]; // To generate text output, call generateContent with the text and images final response = await model.generateContent([  Content.multi([prompt, ...imageParts]) ]); print(response.text); 

Unity

You can call GenerateContentAsync() to generate text from multimodal input of text and images.

Single file input

 using Firebase; using Firebase.AI; // Initialize the Gemini Developer API backend service var ai = FirebaseAI.GetInstance(FirebaseAI.Backend.GoogleAI()); // Create a `GenerativeModel` instance with a model that supports your use case var model = ai.GetGenerativeModel(modelName: "gemini-2.5-flash"); // Convert a Texture2D into InlineDataParts var grayImage = ModelContent.InlineData("image/png",  UnityEngine.ImageConversion.EncodeToPNG(UnityEngine.Texture2D.grayTexture)); // Provide a text prompt to include with the image var prompt = ModelContent.Text("What's in this picture?"); // To generate text output, call GenerateContentAsync and pass in the prompt var response = await model.GenerateContentAsync(new [] { grayImage, prompt }); UnityEngine.Debug.Log(response.Text ?? "No text in response."); 

Multiple file input

 using Firebase; using Firebase.AI; // Initialize the Gemini Developer API backend service var ai = FirebaseAI.GetInstance(FirebaseAI.Backend.GoogleAI()); // Create a `GenerativeModel` instance with a model that supports your use case var model = ai.GetGenerativeModel(modelName: "gemini-2.5-flash"); // Convert Texture2Ds into InlineDataParts var blackImage = ModelContent.InlineData("image/png",  UnityEngine.ImageConversion.EncodeToPNG(UnityEngine.Texture2D.blackTexture)); var whiteImage = ModelContent.InlineData("image/png",  UnityEngine.ImageConversion.EncodeToPNG(UnityEngine.Texture2D.whiteTexture)); // Provide a text prompt to include with the images var prompt = ModelContent.Text("What's different between these pictures?"); // To generate text output, call GenerateContentAsync and pass in the prompt var response = await model.GenerateContentAsync(new [] { blackImage, whiteImage, prompt }); UnityEngine.Debug.Log(response.Text ?? "No text in response."); 

Learn how to choose a model appropriate for your use case and app.

Stream the response

Before trying this sample, complete the Before you begin section of this guide to set up your project and app.
In that section, you'll also click a button for your chosen Gemini API provider so that you see provider-specific content on this page.

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results. To stream the response, call generateContentStream.



Requirements and recommendations for input image files

Note that a file provided as inline data is encoded to base64 in transit, which increases the size of the request. You get an HTTP 413 error if a request is too large.

See "Supported input files and requirements" page to learn detailed information about the following:

Supported image MIME types

Gemini multimodal models support the following image MIME types:

  • PNG - image/png
  • JPEG - image/jpeg
  • WebP - image/webp

Limits per request

There isn't a specific limit to the number of pixels in an image. However, larger images are scaled down and padded to fit a maximum resolution of 3072 x 3072 while preserving their original aspect ratio.

Maximum files per request: 3,000 image files



What else can you do?

Try out other capabilities

Learn how to control content generation

You can also experiment with prompts and model configurations and even get a generated code snippet using Google AI Studio.

Learn more about the supported models

Learn about the models available for various use cases and their quotas and pricing.


Give feedback about your experience with Firebase AI Logic