Skip to content

Commit 0b520cd

Browse files
jsondaicopybara-github
authored andcommitted
chore: GenAI Eval: revise Eval SDK documentation and remove inconsistency
PiperOrigin-RevId: 743684708
1 parent 8ac7550 commit 0b520cd

File tree

4 files changed

+46
-59
lines changed

4 files changed

+46
-59
lines changed

vertexai/evaluation/eval_task.py

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -60,12 +60,13 @@
6060
class EvalTask:
6161
"""A class representing an EvalTask.
6262
63-
An Evaluation Tasks is defined to measure the model's ability to perform a
64-
certain task in response to specific prompts or inputs. Evaluation tasks must
65-
contain an evaluation dataset, and a list of metrics to evaluate. Evaluation
66-
tasks help developers compare prompt templates, track experiments, compare
67-
models and their settings, and assess the quality of the model's generated
68-
text.
63+
An evaluation task assesses the ability of a Gen AI model, agent or
64+
application to perform a specific task in response to prompts.
65+
Each evaluation task includes an evaluation dataset, which can be a set of
66+
test cases and a set of metrics for assessment. These tasks provide the
67+
framework for running evaluations in a standardized and repeatable way,
68+
allowing for comparative assessment with varying run-specific parameters.
69+
6970
7071
Dataset Details:
7172
@@ -74,6 +75,8 @@ class EvalTask:
7475
* reference_column_name: "reference"
7576
* response_column_name: "response"
7677
* baseline_model_response_column_name: "baseline_model_response"
78+
* rubrics_column_name: "rubrics"
79+
7780
7881
Requirement for different use cases:
7982
* Bring-your-own-response (BYOR): You already have the data that you
@@ -86,14 +89,14 @@ class EvalTask:
8689
`baseline_model_response` column is present while the
8790
corresponding model is specified, an error will be raised.
8891
89-
* Perform model inference without a prompt template: You have a dataset
90-
containing the input prompts to the model and want to perform model
92+
* Perform model/agent inference without a prompt template: You have a dataset
93+
containing the input prompts to the model/agent and want to perform
9194
inference before evaluation. A column named `prompt` is required
92-
in the evaluation dataset and is used directly as input to the model.
95+
in the evaluation dataset and is used directly as input to the model/agent.
9396
94-
* Perform model inference with a prompt template: You have a dataset
97+
* Perform model/agent inference with a prompt template: You have a dataset
9598
containing the input variables to the prompt template and want to
96-
assemble the prompts for model inference. Evaluation dataset
99+
assemble the prompts for inference. Evaluation dataset
97100
must contain column names corresponding to the variable names in
98101
the prompt template. For example, if prompt template is
99102
"Instruction: {instruction}, context: {context}", the dataset must

vertexai/evaluation/metrics/_base.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,8 @@ class _ModelBasedMetric(_Metric):
4343
"""A Model-based Metric.
4444
4545
An evaluation metric that evaluates generative AI model responses with
46-
another ML model (eg. Gemini) as a rater. It can be for a single model,
47-
or two models.
46+
another generative model as a judge. This metric can be used to evaluate a
47+
single model, or two models side-by-side.
4848
4949
For more details on when to use model-based metrics, see
5050
[Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).

vertexai/preview/evaluation/eval_task.py

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -68,12 +68,13 @@
6868
class EvalTask:
6969
"""A class representing an EvalTask.
7070
71-
An Evaluation Tasks is defined to measure the model's ability to perform a
72-
certain task in response to specific prompts or inputs. Evaluation tasks must
73-
contain an evaluation dataset, and a list of metrics to evaluate. Evaluation
74-
tasks help developers compare prompt templates, track experiments, compare
75-
models and their settings, and assess the quality of the model's generated
76-
text.
71+
An evaluation task assesses the ability of a Gen AI model, agent or
72+
application to perform a specific task in response to prompts.
73+
Each evaluation task includes an evaluation dataset, which can be a set of
74+
test cases and a set of metrics for assessment. These tasks provide the
75+
framework for running evaluations in a standardized and repeatable way,
76+
allowing for comparative assessment with varying run-specific parameters.
77+
7778
7879
Dataset Details:
7980
@@ -82,6 +83,8 @@ class EvalTask:
8283
* reference_column_name: "reference"
8384
* response_column_name: "response"
8485
* baseline_model_response_column_name: "baseline_model_response"
86+
* rubrics_column_name: "rubrics"
87+
8588
8689
Requirement for different use cases:
8790
* Bring-your-own-response (BYOR): You already have the data that you
@@ -94,14 +97,14 @@ class EvalTask:
9497
`baseline_model_response` column is present while the
9598
corresponding model is specified, an error will be raised.
9699
97-
* Perform model inference without a prompt template: You have a dataset
98-
containing the input prompts to the model and want to perform model
100+
* Perform model/agent inference without a prompt template: You have a dataset
101+
containing the input prompts to the model/agent and want to perform
99102
inference before evaluation. A column named `prompt` is required
100-
in the evaluation dataset and is used directly as input to the model.
103+
in the evaluation dataset and is used directly as input to the model/agent.
101104
102-
* Perform model inference with a prompt template: You have a dataset
105+
* Perform model/agent inference with a prompt template: You have a dataset
103106
containing the input variables to the prompt template and want to
104-
assemble the prompts for model inference. Evaluation dataset
107+
assemble the prompts for inference. Evaluation dataset
105108
must contain column names corresponding to the variable names in
106109
the prompt template. For example, if prompt template is
107110
"Instruction: {instruction}, context: {context}", the dataset must
@@ -111,9 +114,7 @@ class EvalTask:
111114
112115
The supported metrics descriptions, rating rubrics, and the required
113116
input variables can be found on the Vertex AI public documentation page.
114-
[Evaluation methods and metrics](
115-
https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval
116-
).
117+
[Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).
117118
118119
Usage Examples:
119120
@@ -143,7 +144,7 @@ class EvalTask:
143144
```
144145
145146
2. To perform evaluation with Gemini model inference, specify the `model`
146-
parameter with a GenerativeModel instance. The input column name to the
147+
parameter with a `GenerativeModel` instance. The input column name to the
147148
model is `prompt` and must be present in the dataset.
148149
149150
```
@@ -209,8 +210,8 @@ def custom_model_fn(input: str) -> str:
209210
```
210211
211212
5. To perform pairwise metric evaluation with model inference step, specify
212-
the `baseline_model` input to a PairwiseMetric instance and the candidate
213-
`model` input to the EvalTask.evaluate() function. The input column name
213+
the `baseline_model` input to a `PairwiseMetric` instance and the candidate
214+
`model` input to the `EvalTask.evaluate()` function. The input column name
214215
to both models is `prompt` and must be present in the dataset.
215216
216217
```
@@ -221,7 +222,7 @@ def custom_model_fn(input: str) -> str:
221222
metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(
222223
"pairwise_groundedness"
223224
),
224-
baseline_model=baseline_model
225+
baseline_model=baseline_model,
225226
)
226227
eval_dataset = pd.DataFrame({
227228
"prompt" : [...],
@@ -232,7 +233,7 @@ def custom_model_fn(input: str) -> str:
232233
experiment="my-pairwise-experiment",
233234
).evaluate(
234235
model=candidate_model,
235-
experiment_run_name="gemini-pairwise-eval-run"
236+
experiment_run_name="gemini-pairwise-eval-run",
236237
)
237238
```
238239
"""

vertexai/preview/evaluation/metrics/_base.py

Lines changed: 10 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,8 @@ class _ModelBasedMetric(_Metric):
5353
"""A Model-based Metric.
5454
5555
An evaluation metric that evaluates generative AI model responses with
56-
another ML model (eg. Gemini) as a rater. It can be for a single model,
57-
or two models.
56+
another generative model as a judge. This metric can be used to evaluate a
57+
single model, or two models side-by-side.
5858
5959
For more details on when to use model-based metrics, see
6060
[Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).
@@ -148,38 +148,21 @@ def __init__(
148148

149149

150150
class RubricGenerationConfig:
151-
"""The rubric generation config.
152-
153-
Attributes:
154-
prompt_template: The prompt template for rubric generation.
155-
model: The model to use for rubric generation.
156-
parsing_fn: The function to parse the rubric generation response.
157-
"""
151+
"""The rubric generation config."""
158152

159153
def __init__(
160154
self,
161155
prompt_template: str,
162156
model: Optional[_ModelType] = None,
163157
parsing_fn: Optional[Callable[[str], List[str]]] = None,
164158
):
165-
"""Initializes the rubric generation config."""
159+
"""Initializes the rubric generation config.
160+
161+
Args:
162+
prompt_template: The prompt template for rubric generation.
163+
model: The model to use for rubric generation.
164+
parsing_fn: The function to parse the rubric generation response.
165+
"""
166166
self.prompt_template = prompt_template
167167
self.model = model
168168
self.parsing_fn = parsing_fn
169-
170-
171-
def make_metric(
172-
name: str, metric_function: Callable[[Dict[str, Any]], Dict[str, Any]]
173-
) -> CustomMetric:
174-
"""Makes a custom metric.
175-
176-
Args:
177-
name: The name of the metric
178-
metric_function: The evaluation function. Must use the dataset row/instance
179-
as the metric_function input. Returns per-instance metric result as a
180-
dictionary. The metric score must mapped to the CustomMetric.name as key.
181-
182-
Returns:
183-
A CustomMetric instance, can be passed to evaluate() function.
184-
"""
185-
return CustomMetric(name, metric_function)

0 commit comments

Comments
 (0)