Loop

Loop is an AI assistant in Braintrust playgrounds, experiments, datasets, logs, and the BTQL sandbox.

In playgrounds, it helps you optimize and generate prompts, datasets and evals. On the experiments page, it helps you read and interpret the experiments in a project. In datasets, you can generate and edit datapoint rows at scale. In logs, it helps you find analytical insights about your project. In the BTQL sandbox, it helps you write and debug BTQL queries.

Loop

Loop is in public beta and is off by default. To turn it on, flip the feature flag in your settings. If you are on a hybrid deployment, Loop is available starting with v0.0.74.

Select a model

Loop uses the AI models available in your Braintrust account via the Braintrust API Proxy. We currently support the following models:

  • claude-4-sonnet
  • claude-4.1-opus
  • gpt-5
  • gpt-4.1
  • o3
  • o4-mini
  • claude-3-5-sonnet

To choose a model, navigate to the gear icon in the Loop chat window and select from the list of available models.

Available tools

Loop currently has the following tools. Tool availability changes based on the page you are viewing:

  • Search docs: Semantically search the Braintrust documentation site to find relevant information
  • Get summarized results: Fetch summarized data of current page contents
  • Get detailed results: Retrieve detailed data of current page contents (evaluation results, dataset rows, and more)
  • Edit prompt: Generate and modify prompts in the playground
  • Run eval: Execute evaluations in the playground
  • Edit data: Generate and modify datasets
  • Get scorers: Get all available scorers in the project
  • Edit scorers: Edit scorer selection in the playground
  • Create code scorer: Create or edit code-based scorer
  • Create LLM judge scorer: Create or edit LLM judge scorer
  • BTQL query: Generate and run BTQL query against project logs to conduct analysis
  • Run BTQL Generate and run sandbox BTQL query against all data sources
  • Get data source Solicit data source selection from user for BTQL query
  • Infer schema: Inspect project logs and create an understanding of the shape of the data
  • Continue execution: Resume tasks after Loop has run out of iteration

You can remove any of these tools from your Loop workflow by selecting the gear icon and deselecting a tool from the available list.

Generate and optimize prompts

Loop can help you generate a prompt from scratch. To do so, make sure you have an empty task open, then use Loop to generate a prompt.

Generate prompt

If you have existing prompts, you can optimize them using Loop.

To optimize a prompt, ask Loop in the chat window, or select the Loop icon in the top bar of any existing task. From there, you can add the prompt to your chat, or quick optimize.

After Loop provides a suggested optimization, you can review and accept the suggestion or keep iterating.

Generate and optimize datasets

If no dataset exists, Loop can create one automatically. You must have a task in order for Loop to generate a tailored dataset for the evaluation task.

Generate dataset

You can review the dataset and further refine it as needed.

After you run your playground, you can also ask Loop to optimize your dataset. The agent will provide various areas for optimizations based on an analysis of your current dataset.

Optimize dataset

Loop can also modify datasets to a specific shape you define, and generate synthetic datasets based on existing patterns from your playgrounds, logs, experiments, and datasets.

Generate dataset from logs

Analyze project logs

Loop can understand the shape of your project's logs data and make arbitrary queries to answer questions about your logs data. This ability can be used to find analytical insights or used in conjunction with Loop's other abilities.

For analytical insights, you can ask things like "what are the most common errors", "what are the most common inputs from users", and "what user retention trends do you see?" and Loop will gather the necessary data from your logs to answer your question.

Optimize eval

For using this in conjunction with Loop's other abilities, you might navigate to the dataset page and ask Loop, "Can you find the most common errors users face and generate dataset rows based on the findings? Follow the formatting of existing rows you see in this dataset", and Loop will gather the context necessary from logs and generate your dataset based on the findings.

Optimize eval

Write and debug BTQL queries

In the BTQL sandbox, Loop can:

  • Generate BTQL queries from natural language descriptions
  • Fix syntax, binder, and runtime errors
  • Explain query results and suggest follow-up analyses

BTQL sandbox

Specify a data source

When asking Loop to write or modify BTQL queries, you can specify the data source in several ways:

Explicitly specify entity type and ID

"Write a query to find errors in the experiment exp_123" "Show me the rows from dataset dataset_456 with the tag foo" "Analyze recent logs in proj_789"

Let Loop prompt you for a data source

If you don't specify a data source, Loop will ask you to select one from the available options in your workspace.

"Find the most common errors in the last week" "Show me experiments with high factuality scores"

BTQL sandbox

Reference the current query's data source

When you have an existing query in the sandbox, you can refer to it implicitly.

"Modify this query to include error rates" "Add a filter for the last 7 days" "Group this data by model"

Loop understands the context of your current query and will try to use the same data source unless you specify otherwise.

Write queries from scratch

Loop can create BTQL queries based on your natural language requests. Describe what data you want to analyze, and Loop will generate the appropriate query.

"Show me the most recent errors from the last 24 hours" "Find experiment rows with factuality scores above 0.8" "Aggregate token usage by model for this month"

Modifying queries

Loop can rewrite existing queries to better match your analytical needs:

"Modify this to include error rates" "Add a filter for the last 7 days" "Group this by model and show average scores" "Convert this to show percentiles instead of averages" "Filter out rows where scores.Factuality is null"

BTQL query modification

Debug and fix errors

Loop can help you resolve various types of errors that occur when writing and running BTQL queries.

Fix with Loop

Parser errors

These occur when BTQL can't parse your query due to syntax issues:

  • Missing quotes around string literals
  • Unmatched parentheses or brackets
  • Invalid operators or keywords
  • Malformed expressions

Parser errors appear as you type and provide specific feedback about invalid syntax. Hovering on the red underline will show a popup with the error and a Fix with Loop button.

Binder errors

These occur during validation when BTQL checks your query against the data schema:

  • References to non-existent fields (for example, metadata.nonexistent_field)
  • Type mismatches in comparisons
  • Invalid field access patterns

Binder errors appear as you type and provide specific feedback about which fields or operations are invalid. Hovering on the red underline will show a popup with the error and a Fix with Loop button.

Runtime errors

These occur when executing your query against the actual data:

  • Data source not found or inaccessible
  • Query timeout due to complexity or data size
  • Permission or access control issues
  • Database connection problems

Runtime errors are displayed in the results panel after you run a query, along with a Fix with Loop button.

Loop analyzes the specific error type and context to provide targeted fixes, whether it's correcting syntax, suggesting the right field names, or helping optimize query performance.

Search documentation information

If you need help using Braintrust or understanding concepts, Loop will semantically search through the documentation to provide the answer to your questions.

Search docs with loop

Generate and edit scorers

If no scorers exist, Loop can create one for you. You must have a dataset and a task in order for Loop to generate a scorer that is specific to your use case. The agent will begin by checking what data you have, what existing scorers are available, and fetching some sample results to understand the data structure.

Create new scorer

If you select Accept, the new scorer will be added to the playground.

Loop can also help you improve and edit existing scorers.

Edit existing scorer

You can create or edit scorers from experiment, dataset, or logs pages, and Loop will gather context from the resources on the page.

Generate scorer from logs

Tune scorers based on target classification

Loop can take manually labelled target classification from evaluations in the playground and adjust scorer classification behavior.

Select the rows that the scorers did not perform expectedly on, then select Tune scorer.

tune scorer - step 1

Select the desired classification, provide optional additional instruction and submit to Loop to tune the scorer. Loop will adjust the scorer based on the provided context.

tune scorer - step 2

Run and assess evals

After your tasks, dataset, and scorers are set up, Loop can run an evaluation for you, analyze it, and suggest further improvements.

Optimize eval

Analyze and interpret your experiments

Loop can read the results of your experiment(s), summarize the results, and help discover new insights.

Optimize eval

Settings

By default, Loop will ask you for confirmation before executing certain tool calls, like running an evaluation. If you'd like Loop to freely create and edit resources, and run evaluations, turn on auto-accept in the Settings dropdown menu.

Model allowlist

On the Settings page, administrators can customize which models are available to be used in Loop for the organization.

Loop model allowlist

Loop - Docs - Guides - Braintrust