All Products
Search
Document Center

Platform For AI:Deploy an MLLM multimodal large language model application in EAS in 5 minutes

Last Updated:Jul 24, 2025

Multimodal Large Language Model (MLLM) can process multiple data modalities simultaneously, integrating different types of information such as text, images, and audio to comprehensively understand complex contexts and tasks. It is suitable for scenarios requiring cross-modal understanding and generation. Through EAS, you can deploy MLLM inference service applications with one click in 5 minutes to obtain large model inference capabilities. This topic describes how to deploy and call MLLM inference services through EAS with one click.

Background information

In recent years, various large language models (LLMs) have achieved unprecedented results in language tasks. LLMs are used to generate natural language text and demonstrate strong capabilities in multiple types of tasks, such as sentiment analytics, machine translation, and text summarization. However, these models are limited to text data and cannot process other forms of data, such as images, audio, or videos. Only models with multimodal comprehension can approach the cognitive abilities of the human brain.

Therefore, Multimodal Large Language Models (MLLMs) have sparked a research boom. With the widespread application of large models such as GPT-4o in the industry, MLLMs have become one of the current popular applications. This new type of large language model can process multiple data modalities simultaneously, integrating different types of information such as text, images, and audio to comprehensively understand complex contexts and tasks.

When you need to automate MLLM deployment, EAS provides you with a one-click solution. Through EAS, you can deploy popular MLLM inference service applications with one click in 5 minutes to obtain large model inference capabilities.

Prerequisites

Deploy a model service in EAS

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).

  2. Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

  3. On the Custom Deployment page, configure the following key parameters. For information about other parameters, see Parameters for custom deployment in the console.

    Parameter

    Description

    Environment Context

    Deployment Method

    Select Image-based Deployment and Enable Web App.

    Image Configuration

    Select Alibaba Cloud Image > chat-mllm-webui > chat-mllm-webui:1.0.

    Note

    We recommend that you select the latest version of the image when you deploy the model service.

    Command

    After you select an image, the system automatically configures this parameter. You can modify the model_type parameter to deploy different models. The following table provides the supported model types.

    Resource Information

    Deploy Resources

    Select a GPU type. We recommend that you select the ml.gu7i.c16m60.1-gu30 instance type, which is the most cost-effective.

    Models

    model_type

    Model link

    qwen_vl_chat

    qwen/Qwen-VL-Chat

    qwen_vl_chat_int4

    qwen/Qwen-VL-Chat-Int4

    qwen_vl

    qwen/Qwen-VL

    glm4v_9b_chat

    ZhipuAI/glm-4v-9b

    llava1_5-7b-instruct

    swift/llava-1___5-7b-hf

    llava1_5-13b-instruct

    swift/llava-1___5-13b-hf

    internvl_chat_v1_5_int8

    AI-ModelScope/InternVL-Chat-V1-5-int8

    internvl-chat-v1_5

    AI-ModelScope/InternVL-Chat-V1-5

    mini-internvl-chat-2b-v1_5

    OpenGVLab/Mini-InternVL-Chat-2B-V1-5

    mini-internvl-chat-4b-v1_5

    OpenGVLab/Mini-InternVL-Chat-4B-V1-5

    internvl2-2b

    OpenGVLab/InternVL2-2B

    internvl2-4b

    OpenGVLab/InternVL2-4B

    internvl2-8b

    OpenGVLab/InternVL2-8B

    internvl2-26b

    OpenGVLab/InternVL2-26B

    internvl2-40b

    OpenGVLab/InternVL2-40B

  4. After you configure the parameters, click Deploy.

Call a service

Use the web UI to perform model inference

  1. On the Elastic Algorithm Service (EAS) page, click the name of the target service, click View Web App in the upper-right corner of the page, and then follow the instructions in the console to open the WebUI page.

  2. On the web UI page, perform model inference.cb3daf8135235cbd35c456965fc60199

Call API operations to perform model inference

  1. Obtain the endpoint and token of the service.

    1. On the Elastic Algorithm Service (EAS) page, click the target service name. Then, in the Basic Information section, click View Invocation Information.

    2. In the Invocation Information pane, obtain the service Token and endpoint.

  2. Call API operations to perform model inference.

    PAI provides the following APIs:

    infer forward

    Obtain the inference result.

    Note

    WebUI and API calls cannot be used simultaneously. If you have already used the WebUI to make a call, first execute the clear chat history code to clear the chat history, and then run the infer forward code to obtain the inference result.

    The key parameters that need to be replaced in the sample code are described as follows:

    Parameter

    Description

    hosts

    The endpoint that you obtained in Step 1.

    authorization

    The service token that you obtained in Step 1.

    prompt

    The content of the question. A question in English is recommended.

    image_path

    The on-premises path in which the image resides.

    Click to view all request input and output descriptions

    • The following table describes the input parameters.

      Parameter

      Data type

      Description

      Default value

      prompt

      String

      The content of the question. This parameter is required.

      No default value

      image

      Base64 encoding format

      The image.

      None

      chat_history

      List[List]

      Chat history.

      []

      temperature

      Float

      The randomness of the model output. A large value specifies high randomness. The value 0 specifies a fixed output. The value ranges from 0 to 1.

      0.2

      top_p

      Float

      The proportion of outputs selected from the generated results.

      0.7

      max_output_tokens

      Int

      The maximum number of tokens in the output.

      512

      use_stream

      Bool

      Specifies whether to enable the streaming output mode:

      • True

      • False

      True

    • The output is an answer to the question and is of the STRING type.

    The following sample code provides an example on how to use Python to perform model inference:

    import requests import json import base64 def post_get_history(url='http://127.0.0.1:7860', headers=None): r = requests.post(f'{url}/get_history', headers=headers, timeout=1500) data = r.content.decode('utf-8') return data def post_infer(prompt, image=None, chat_history=[], temperature=0.2, top_p=0.7, max_output_tokens=512, use_stream = True, url='http://127.0.0.1:7860', headers={}): datas = { "prompt": prompt, "image": image, "chat_history": chat_history, "temperature": temperature, "top_p": top_p, "max_output_tokens": max_output_tokens, "use_stream": use_stream, } if use_stream: headers.update({'Accept': 'text/event-stream'}) response = requests.post(f'{url}/infer_forward', json=datas, headers=headers, stream=True, timeout=1500) if response.status_code != 200: print(f"Request failed with status code {response.status_code}") return process_stream(response) else: r = requests.post(f'{url}/infer_forward', json=datas, headers=headers, timeout=1500) data = r.content.decode('utf-8') print(data) def image_to_base64(image_path): """ Convert an image file to a Base64 encoded string. :param image_path: The file path to the image. :return: A Base64 encoded string representation of the image. """ with open(image_path, "rb") as image_file: # Read the binary data of the image image_data = image_file.read() # Encode the binary data to Base64 base64_encoded_data = base64.b64encode(image_data) # Convert bytes to string and remove any trailing newline characters base64_string = base64_encoded_data.decode('utf-8').replace('\n', '') return base64_string def process_stream(response, previous_text=""): MARK_RESPONSE_END = '##END' # DONOT CHANGE buffer = previous_text current_response = "" for chunk in response.iter_content(chunk_size=100): if chunk: text = chunk.decode('utf-8') current_response += text parts = current_response.split(MARK_RESPONSE_END) for part in parts[:-1]: new_part = part[len(previous_text):] if new_part: print(new_part, end='', flush=True) previous_text = part current_response = parts[-1] remaining_new_text = current_response[len(previous_text):] if remaining_new_text: print(remaining_new_text, end='', flush=True) if __name__ == '__main__': # Replace <service_url> with the service endpoint. hosts = '<service_url>' # Replace <token> with the service token. head = { 'Authorization': '<token>' } # get chat history chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history'] # The content of the question. A question in English is recommended. prompt = 'Please describe the image' # Replace path_to_your_image with the local path of the image. image_path = 'path_to_your_image' image_base_64 = image_to_base64(image_path) post_infer(prompt = prompt, image = image_base_64, chat_history = chat_history, use_stream=False, url=hosts, headers=head) 

    get chat history

    Obtain the chat history.

    • The key parameters that need to be replaced in the sample code are described as follows:

      Parameter

      Description

      hosts

      Configure the service endpoint obtained in Step 1.

      authorization

      Configure the service Token obtained in Step 1.

    • No input parameters are required.

    • The following table describes the output parameters.

      Parameter

      Type

      Note

      chat_history

      List[List]

      Conversation history.

    The following sample code provides an example on how to use Python to perform model inference:

    import requests import json def post_get_history(url='http://127.0.0.1:7860', headers=None): r = requests.post(f'{url}/get_history', headers=headers, timeout=1500) data = r.content.decode('utf-8') return data if __name__ == '__main__': # Replace <service_url> with the service URL hosts = '<service_url>' # Replace <token> with the service token head = { 'Authorization': '<token>' } chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history'] print(chat_history) 

    clear chat history

    Clear the chat history.

    • The key parameters that need to be replaced in the sample code are described as follows:

      Parameter

      Description

      hosts

      Configure the endpoint obtained in Step 1.

      authorization

      Configure the service token obtained in Step 1.

    • No input parameters are required.

    • The returned result is success.

    The following sample code provides an example on how to use Python to perform model inference:

    import requests import json def post_clear_history(url='http://127.0.0.1:7860', headers=None): r = requests.post(f'{url}/clear_history', headers=headers, timeout=1500) data = r.content.decode('utf-8') return data if __name__ == '__main__': # Replace <service_url> with the service endpoint. hosts = '<service_url>' # Replace <token> with the service token. head = { 'Authorization': '<token>' } clear_info = post_clear_history(url=hosts, headers=head) print(clear_info)