EmbeddedLLM
diff --git a/‎docs/contributing/benchmarks.md‎
Lines changed: 104 additions & 7 deletions b/‎docs/contributing/benchmarks.md‎
Lines changed: 104 additions & 7 deletions
diff --git a/‎vllm/benchmarks/datasets.py‎
Lines changed: 4 additions & 4 deletions b/‎vllm/benchmarks/datasets.py‎
Lines changed: 4 additions & 4 deletions
@@ -67,13 +67,13 @@ Legend:
 <details class="admonition abstract" markdown="1">
 <summary>Show more</summary>
 
-First start serving your model
+First start serving your model:
 
 ```bash
 vllm serve NousResearch/Hermes-3-Llama-3.1-8B
 ```
 
-Then run the benchmarking script
+Then run the benchmarking script:
 
 ```bash
 # download dataset
@@ -87,7 +87,7 @@ vllm bench serve \
  --num-prompts 10
 ```
 
-If successful, you will see the following output
+If successful, you will see the following output:
 
 ```text
 ============ Serving Benchmark Result ============
@@ -125,7 +125,7 @@ If the dataset you want to benchmark is not supported yet in vLLM, even then you
 
 ```bash
 # start server
-VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
+vllm serve meta-llama/Llama-3.1-8B-Instruct
 ```
 
 ```bash
@@ -167,7 +167,7 @@ vllm bench serve \
 ##### InstructCoder Benchmark with Speculative Decoding
 
 ``` bash
-VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
+vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --speculative-config $'{"method": "ngram",
  "num_speculative_tokens": 5, "prompt_lookup_max": 5,
  "prompt_lookup_min": 2}'
@@ -184,7 +184,7 @@ vllm bench serve \
 ##### Spec Bench Benchmark with Speculative Decoding
 
 ``` bash
-VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
+vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --speculative-config $'{"method": "ngram",
  "num_speculative_tokens": 5, "prompt_lookup_max": 5,
  "prompt_lookup_min": 2}'
@@ -366,7 +366,6 @@ Total num output tokens: 1280
 
 ``` bash
 VLLM_WORKER_MULTIPROC_METHOD=spawn \
-VLLM_USE_V1=1 \
 vllm bench throughput \
  --dataset-name=hf \
  --dataset-path=likaixin/InstructCoder \
@@ -781,6 +780,104 @@ This should be seen as an edge case, and if this behavior can be avoided by sett
 
 </details>
 
+#### Embedding Benchmark
+
+Benchmark the performance of embedding requests in vLLM.
+
+<details class="admonition abstract" markdown="1">
+<summary>Show more</summary>
+
+##### Text Embeddings
+
+Unlike generative models which use Completions API or Chat Completions API,
+you should set `--backend openai-embeddings` and `--endpoint /v1/embeddings` to use the Embeddings API.
+
+You can use any text dataset to benchmark the model, such as ShareGPT.
+
+Start the server:
+
+```bash
+vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
+```
+
+Run the benchmark:
+
+```bash
+# download dataset
+# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+vllm bench serve \
+ --model jinaai/jina-embeddings-v3 \
+ --backend openai-embeddings \
+ --endpoint /v1/embeddings \
+ --dataset-name sharegpt \
+ --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json
+```
+
+##### Multi-modal Embeddings
+
+Unlike generative models which use Completions API or Chat Completions API,
+you should set `--endpoint /v1/embeddings` to use the Embeddings API. The backend to use depends on the model:
+
+- CLIP: `--backend openai-embeddings-clip`
+- VLM2Vec: `--backend openai-embeddings-vlm2vec`
+
+For other models, please add your own implementation inside <gh-file:vllm/benchmarks/lib/endpoint_request_func.py> to match the expected instruction format.
+
+You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it.
+For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings.
+
+Serve and benchmark CLIP:
+
+```bash
+# Run this in another process
+vllm serve openai/clip-vit-base-patch32
+
+# Run these one by one after the server is up
+# download dataset
+# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+vllm bench serve \
+ --model openai/clip-vit-base-patch32 \
+ --backend openai-embeddings-clip \
+ --endpoint /v1/embeddings \
+ --dataset-name sharegpt \
+ --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json
+
+vllm bench serve \
+ --model openai/clip-vit-base-patch32 \
+ --backend openai-embeddings-clip \
+ --endpoint /v1/embeddings \
+ --dataset-name hf \
+ --dataset-path lmarena-ai/VisionArena-Chat
+```
+
+Serve and benchmark VLM2Vec:
+
+```bash
+# Run this in another process
+vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
+ --trust-remote-code \
+ --chat-template examples/template_vlm2vec_phi3v.jinja
+
+# Run these one by one after the server is up
+# download dataset
+# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+vllm bench serve \
+ --model TIGER-Lab/VLM2Vec-Full \
+ --backend openai-embeddings-vlm2vec \
+ --endpoint /v1/embeddings \
+ --dataset-name sharegpt \
+ --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json
+
+vllm bench serve \
+ --model TIGER-Lab/VLM2Vec-Full \
+ --backend openai-embeddings-vlm2vec \
+ --endpoint /v1/embeddings \
+ --dataset-name hf \
+ --dataset-path lmarena-ai/VisionArena-Chat
+```
+
+</details>
+
 [](){ #performance-benchmarks }
 
 ## Performance Benchmarks
 
@@ -1582,10 +1582,10 @@ def get_samples(args, tokenizer) -> list[SampleRequest]:
  "like to add support for additional dataset formats."
  )
 
- if dataset_class.IS_MULTIMODAL and args.backend not in [
- "openai-chat",
- "openai-audio",
- ]:
+ if dataset_class.IS_MULTIMODAL and not (
+ args.backend in ("openai-chat", "openai-audio")
+ or "openai-embeddings-" in args.backend
+ ):
  # multi-modal benchmark is only available on OpenAI Chat
  # endpoint-type.
  raise ValueError(