@@ -67,13 +67,13 @@ Legend:
6767<details class =" admonition abstract " markdown =" 1 " >
6868<summary >Show more</summary >
6969
70- First start serving your model
70+ First start serving your model:
7171
7272``` bash
7373vllm serve NousResearch/Hermes-3-Llama-3.1-8B
7474```
7575
76- Then run the benchmarking script
76+ Then run the benchmarking script:
7777
7878``` bash
7979# download dataset
@@ -87,7 +87,7 @@ vllm bench serve \
8787 --num-prompts 10
8888```
8989
90- If successful, you will see the following output
90+ If successful, you will see the following output:
9191
9292``` text
9393============ Serving Benchmark Result ============
@@ -125,7 +125,7 @@ If the dataset you want to benchmark is not supported yet in vLLM, even then you
125125
126126``` bash
127127# start server
128- VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
128+ vllm serve meta-llama/Llama-3.1-8B-Instruct
129129```
130130
131131``` bash
@@ -167,7 +167,7 @@ vllm bench serve \
167167##### InstructCoder Benchmark with Speculative Decoding
168168
169169``` bash
170- VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
170+ vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
171171 --speculative-config $' {"method": "ngram",
172172 "num_speculative_tokens": 5, "prompt_lookup_max": 5,
173173 "prompt_lookup_min": 2}'
@@ -184,7 +184,7 @@ vllm bench serve \
184184##### Spec Bench Benchmark with Speculative Decoding
185185
186186``` bash
187- VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
187+ vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
188188 --speculative-config $' {"method": "ngram",
189189 "num_speculative_tokens": 5, "prompt_lookup_max": 5,
190190 "prompt_lookup_min": 2}'
@@ -366,7 +366,6 @@ Total num output tokens: 1280
366366
367367``` bash
368368VLLM_WORKER_MULTIPROC_METHOD=spawn \
369- VLLM_USE_V1=1 \
370369vllm bench throughput \
371370 --dataset-name=hf \
372371 --dataset-path=likaixin/InstructCoder \
@@ -781,6 +780,104 @@ This should be seen as an edge case, and if this behavior can be avoided by sett
781780
782781</details >
783782
783+ #### Embedding Benchmark
784+
785+ Benchmark the performance of embedding requests in vLLM.
786+
787+ <details class =" admonition abstract " markdown =" 1 " >
788+ <summary >Show more</summary >
789+
790+ ##### Text Embeddings
791+
792+ Unlike generative models which use Completions API or Chat Completions API,
793+ you should set ` --backend openai-embeddings ` and ` --endpoint /v1/embeddings ` to use the Embeddings API.
794+
795+ You can use any text dataset to benchmark the model, such as ShareGPT.
796+
797+ Start the server:
798+
799+ ``` bash
800+ vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
801+ ```
802+
803+ Run the benchmark:
804+
805+ ``` bash
806+ # download dataset
807+ # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
808+ vllm bench serve \
809+ --model jinaai/jina-embeddings-v3 \
810+ --backend openai-embeddings \
811+ --endpoint /v1/embeddings \
812+ --dataset-name sharegpt \
813+ --dataset-path < your data path> /ShareGPT_V3_unfiltered_cleaned_split.json
814+ ```
815+
816+ ##### Multi-modal Embeddings
817+
818+ Unlike generative models which use Completions API or Chat Completions API,
819+ you should set ` --endpoint /v1/embeddings ` to use the Embeddings API. The backend to use depends on the model:
820+
821+ - CLIP: ` --backend openai-embeddings-clip `
822+ - VLM2Vec: ` --backend openai-embeddings-vlm2vec `
823+
824+ For other models, please add your own implementation inside < gh-file:vllm/benchmarks/lib/endpoint_request_func.py > to match the expected instruction format.
825+
826+ You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it.
827+ For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings.
828+
829+ Serve and benchmark CLIP:
830+
831+ ``` bash
832+ # Run this in another process
833+ vllm serve openai/clip-vit-base-patch32
834+
835+ # Run these one by one after the server is up
836+ # download dataset
837+ # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
838+ vllm bench serve \
839+ --model openai/clip-vit-base-patch32 \
840+ --backend openai-embeddings-clip \
841+ --endpoint /v1/embeddings \
842+ --dataset-name sharegpt \
843+ --dataset-path < your data path> /ShareGPT_V3_unfiltered_cleaned_split.json
844+
845+ vllm bench serve \
846+ --model openai/clip-vit-base-patch32 \
847+ --backend openai-embeddings-clip \
848+ --endpoint /v1/embeddings \
849+ --dataset-name hf \
850+ --dataset-path lmarena-ai/VisionArena-Chat
851+ ```
852+
853+ Serve and benchmark VLM2Vec:
854+
855+ ``` bash
856+ # Run this in another process
857+ vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
858+ --trust-remote-code \
859+ --chat-template examples/template_vlm2vec_phi3v.jinja
860+
861+ # Run these one by one after the server is up
862+ # download dataset
863+ # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
864+ vllm bench serve \
865+ --model TIGER-Lab/VLM2Vec-Full \
866+ --backend openai-embeddings-vlm2vec \
867+ --endpoint /v1/embeddings \
868+ --dataset-name sharegpt \
869+ --dataset-path < your data path> /ShareGPT_V3_unfiltered_cleaned_split.json
870+
871+ vllm bench serve \
872+ --model TIGER-Lab/VLM2Vec-Full \
873+ --backend openai-embeddings-vlm2vec \
874+ --endpoint /v1/embeddings \
875+ --dataset-name hf \
876+ --dataset-path lmarena-ai/VisionArena-Chat
877+ ```
878+
879+ </details >
880+
784881[ ] ( ) { #performance-benchmarks }
785882
786883## Performance Benchmarks
0 commit comments