Deploy an MLLM application with one click in EAS - Platform For AI

Multimodal Large Language Model (MLLM) can process multiple data modalities simultaneously, integrating different types of information such as text, images, and audio to comprehensively understand complex contexts and tasks. It is suitable for scenarios requiring cross-modal understanding and generation. Through EAS, you can deploy MLLM inference service applications with one click in 5 minutes to obtain large model inference capabilities. This topic describes how to deploy and call MLLM inference services through EAS with one click.

Background information

In recent years, various large language models (LLMs) have achieved unprecedented results in language tasks. LLMs are used to generate natural language text and demonstrate strong capabilities in multiple types of tasks, such as sentiment analytics, machine translation, and text summarization. However, these models are limited to text data and cannot process other forms of data, such as images, audio, or videos. Only models with multimodal comprehension can approach the cognitive abilities of the human brain.

Therefore, Multimodal Large Language Models (MLLMs) have sparked a research boom. With the widespread application of large models such as GPT-4o in the industry, MLLMs have become one of the current popular applications. This new type of large language model can process multiple data modalities simultaneously, integrating different types of information such as text, images, and audio to comprehensively understand complex contexts and tasks.

When you need to automate MLLM deployment, EAS provides you with a one-click solution. Through EAS, you can deploy popular MLLM inference service applications with one click in 5 minutes to obtain large model inference capabilities.

Prerequisites

Platform for AI (PAI) is activated and a default workspace is created. For more information, see Activate PAI and create a default workspace.
If you use a RAM user to deploy a model, you need to grant the RAM user the management permissions for EAS. For more information, see Cloud product dependencies and authorization: EAS.

Deploy a model service in EAS

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

On the Custom Deployment page, configure the following key parameters. For information about other parameters, see Parameters for custom deployment in the console.

Parameter		Description
Environment Context	Deployment Method	Select Image-based Deployment and Enable Web App.
	Image Configuration	Select Alibaba Cloud Image > chat-mllm-webui > chat-mllm-webui:1.0. Note We recommend that you select the latest version of the image when you deploy the model service.
	Command	After you select an image, the system automatically configures this parameter. You can modify the model_type parameter to deploy different models. The following table provides the supported model types.
Resource Information	Deploy Resources	Select a GPU type. We recommend that you select the ml.gu7i.c16m60.1-gu30 instance type, which is the most cost-effective.

Models

model_type	Model link
qwen_vl_chat	qwen/Qwen-VL-Chat
qwen_vl_chat_int4	qwen/Qwen-VL-Chat-Int4
qwen_vl	qwen/Qwen-VL
glm4v_9b_chat	ZhipuAI/glm-4v-9b
llava1_5-7b-instruct	swift/llava-1___5-7b-hf
llava1_5-13b-instruct	swift/llava-1___5-13b-hf
internvl_chat_v1_5_int8	AI-ModelScope/InternVL-Chat-V1-5-int8
internvl-chat-v1_5	AI-ModelScope/InternVL-Chat-V1-5
mini-internvl-chat-2b-v1_5	OpenGVLab/Mini-InternVL-Chat-2B-V1-5
mini-internvl-chat-4b-v1_5	OpenGVLab/Mini-InternVL-Chat-4B-V1-5
internvl2-2b	OpenGVLab/InternVL2-2B
internvl2-4b	OpenGVLab/InternVL2-4B
internvl2-8b	OpenGVLab/InternVL2-8B
internvl2-26b	OpenGVLab/InternVL2-26B
internvl2-40b	OpenGVLab/InternVL2-40B

After you configure the parameters, click Deploy.

Call a service

Use the web UI to perform model inference

On the Elastic Algorithm Service (EAS) page, click the name of the target service, click View Web App in the upper-right corner of the page, and then follow the instructions in the console to open the WebUI page.
On the web UI page, perform model inference.

Call API operations to perform model inference

Obtain the endpoint and token of the service.
1. On the Elastic Algorithm Service (EAS) page, click the target service name. Then, in the Basic Information section, click View Invocation Information.
2. In the Invocation Information pane, obtain the service Token and endpoint.

Call API operations to perform model inference.

PAI provides the following APIs:

infer forward

Obtain the inference result.

Note

WebUI and API calls cannot be used simultaneously. If you have already used the WebUI to make a call, first execute the clear chat history code to clear the chat history, and then run the infer forward code to obtain the inference result.

The key parameters that need to be replaced in the sample code are described as follows:

Parameter	Description
hosts	The endpoint that you obtained in Step 1.
authorization	The service token that you obtained in Step 1.
prompt	The content of the question. A question in English is recommended.
image_path	The on-premises path in which the image resides.

Click to view all request input and output descriptions

The following table describes the input parameters.

Parameter	Data type	Description	Default value
prompt	String	The content of the question. This parameter is required.	No default value
image	Base64 encoding format	The image.	None
chat_history	List[List]	Chat history.	[]
temperature	Float	The randomness of the model output. A large value specifies high randomness. The value 0 specifies a fixed output. The value ranges from 0 to 1.	0.2
top_p	Float	The proportion of outputs selected from the generated results.	0.7
max_output_tokens	Int	The maximum number of tokens in the output.	512
use_stream	Bool	Specifies whether to enable the streaming output mode: True False	True

The output is an answer to the question and is of the STRING type.

The following sample code provides an example on how to use Python to perform model inference:

import requests import json import base64 def post_get_history(url='http://127.0.0.1:7860', headers=None): r = requests.post(f'{url}/get_history', headers=headers, timeout=1500) data = r.content.decode('utf-8') return data def post_infer(prompt, image=None, chat_history=[], temperature=0.2, top_p=0.7, max_output_tokens=512, use_stream = True, url='http://127.0.0.1:7860', headers={}): datas = { "prompt": prompt, "image": image, "chat_history": chat_history, "temperature": temperature, "top_p": top_p, "max_output_tokens": max_output_tokens, "use_stream": use_stream, } if use_stream: headers.update({'Accept': 'text/event-stream'}) response = requests.post(f'{url}/infer_forward', json=datas, headers=headers, stream=True, timeout=1500) if response.status_code != 200: print(f"Request failed with status code {response.status_code}") return process_stream(response) else: r = requests.post(f'{url}/infer_forward', json=datas, headers=headers, timeout=1500) data = r.content.decode('utf-8') print(data) def image_to_base64(image_path): """ Convert an image file to a Base64 encoded string. :param image_path: The file path to the image. :return: A Base64 encoded string representation of the image. """ with open(image_path, "rb") as image_file: # Read the binary data of the image image_data = image_file.read() # Encode the binary data to Base64 base64_encoded_data = base64.b64encode(image_data) # Convert bytes to string and remove any trailing newline characters base64_string = base64_encoded_data.decode('utf-8').replace('\n', '') return base64_string def process_stream(response, previous_text=""): MARK_RESPONSE_END = '##END' # DONOT CHANGE buffer = previous_text current_response = "" for chunk in response.iter_content(chunk_size=100): if chunk: text = chunk.decode('utf-8') current_response += text parts = current_response.split(MARK_RESPONSE_END) for part in parts[:-1]: new_part = part[len(previous_text):] if new_part: print(new_part, end='', flush=True) previous_text = part current_response = parts[-1] remaining_new_text = current_response[len(previous_text):] if remaining_new_text: print(remaining_new_text, end='', flush=True) if __name__ == '__main__': # Replace <service_url> with the service endpoint. hosts = '<service_url>' # Replace <token> with the service token. head = { 'Authorization': '<token>' } # get chat history chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history'] # The content of the question. A question in English is recommended. prompt = 'Please describe the image' # Replace path_to_your_image with the local path of the image. image_path = 'path_to_your_image' image_base_64 = image_to_base64(image_path) post_infer(prompt = prompt, image = image_base_64, chat_history = chat_history, use_stream=False, url=hosts, headers=head)

get chat history

Obtain the chat history.

The key parameters that need to be replaced in the sample code are described as follows:
Parameter
Description
hosts
Configure the service endpoint obtained in Step 1.
authorization
Configure the service Token obtained in Step 1.
No input parameters are required.
The following table describes the output parameters.
Parameter
Type
Note
chat_history
List[List]
Conversation history.

The following sample code provides an example on how to use Python to perform model inference:

import requests import json def post_get_history(url='http://127.0.0.1:7860', headers=None): r = requests.post(f'{url}/get_history', headers=headers, timeout=1500) data = r.content.decode('utf-8') return data if __name__ == '__main__': # Replace <service_url> with the service URL hosts = '<service_url>' # Replace <token> with the service token head = { 'Authorization': '<token>' } chat_history = json.loads(post_get_history(url=hosts, headers=head))['chat_history'] print(chat_history)

clear chat history

Clear the chat history.

The key parameters that need to be replaced in the sample code are described as follows:
Parameter
Description
hosts
Configure the endpoint obtained in Step 1.
authorization
Configure the service token obtained in Step 1.
No input parameters are required.
The returned result is success.

The following sample code provides an example on how to use Python to perform model inference:

import requests import json def post_clear_history(url='http://127.0.0.1:7860', headers=None): r = requests.post(f'{url}/clear_history', headers=headers, timeout=1500) data = r.content.decode('utf-8') return data if __name__ == '__main__': # Replace <service_url> with the service endpoint. hosts = '<service_url>' # Replace <token> with the service token. head = { 'Authorization': '<token>' } clear_info = post_clear_history(url=hosts, headers=head) print(clear_info)

Parameter	Type	Note
chat_history	List[List]	Conversation history.