Reference

Braintrust Query Language (BTQL)

Braintrust Query Language (BTQL) is a precise, SQL-like syntax for querying Braintrust experiments, logs, and datasets. Use BTQL to better analyze and understand your data.

Why use BTQL?

BTQL gives you precise control over your AI application data. You can:

  • Filter and search for relevant logs and experiments
  • Create consistent, reusable queries for monitoring
  • Build automated reporting and analysis pipelines
  • Write complex queries to analyze model performance

BTQL in Braintrust

Use BTQL when filtering logs and experiments, in the BTQL sandbox, and programmatically through the Braintrust API.

Filter logs and experiments

Use BTQL to filter logs and experiments based on specific criteria. You can filter logs by tags, metadata, or any other relevant fields. Filtering in logs and experiments only supports filter clauses.

At the top of your experiment or log view, click Filter to open the filter editor and click BTQL to switch to BTQL mode.

BTQL filter editor

BTQL sandbox

To test BTQL with autocomplete, validation, and a table of results, use the BTQL sandbox in the dashboard. In your project, click BTQL sandbox at the bottom of the sidebar.

BTQL sandbox

API access

Access BTQL programmatically with the Braintrust API:

curl -X POST https://api.braintrust.dev/btql \  -H "Authorization: Bearer <YOUR_API_KEY>" \  -H "Content-Type: application/json" \  -d '{"query": "select: * | from: project_logs('"'<YOUR_PROJECT_ID>'"') | filter: tags includes '"'triage'"'"}'

The API accepts these parameters:

  • query (required): your BTQL query string
  • fmt: response format (json or parquet, defaults to json)
  • tz_offset: timezone offset in minutes for time-based operations
  • audit_log: include audit log data

For correct day boundaries, set tz_offset to match your timezone. For example, use 480 for US Pacific Standard Time.

Query structure

BTQL queries follow a familiar SQL-like structure that lets you define what data you want, how you want it returned, and how to analyze it.

This example returns every log from a project where Factuality is greater than 0.8, sorts by created date descending, and limits the results to 100.

select: * -- Fields to retrieve from: project_logs('<PROJECT_ID>') -- Data source (identifier or function call) filter: scores.Factuality > 0.8 -- Filter conditions sample: 25% -- Random sampling (optional) sort: created desc -- Sort order limit: 100 -- Result size limit cursor: '<CURSOR>' -- Pagination token
  • select: choose which fields to retrieve
  • from: specify the data source - can be an identifier (like project_logs) or a function call (like experiment("id")). Has an optional designator for the shape of the data: spans, traces, summary. If not specified, defaults to spans.
  • filter: define conditions to filter the data
  • sample: randomly sample a subset of the filtered data (rate or count-based)
  • sort: set the order of results (asc or desc)
  • limit and cursor: control result size and enable pagination

Retrieve records

When retrieving records with BTQL, you can either use select or dimensions and measures. You can use most tools when using either method, but you must use dimensions and measures if you want to aggregate functions to retrieve results.

select

select in BTQL is identical to the select clause in SQL. You can select specific fields, compute values, or use * to retrieve every field.

-- Get specific fields select:  metadata.model as model,  scores.Factuality as score,  created as timestamp from: project_logs('my-project-id')

BTQL allows you to transform data directly in the select clause. This query returns metadata.model, whether metrics.tokens is greater than 1000, and a quality indicator of either "high" or "low" depending on whether or not the Factuality score is greater than 0.8.

select:  -- Simple field access  metadata.model,    -- Computed values  metrics.tokens > 1000 as is_long_response,    -- Conditional logic  (scores.Factuality > 0.8 ? "high" : "low") as quality from: project_logs('my-project-id')

You can also use functions in the select clause to transform values and create meaningful aliases for your results. This query extracts the day the log was created, the hour, and a Factuality score rounded to 2 decimal places.

select:  -- Date/time functions  day(created) as date,  hour(created) as hour,    -- Numeric calculations  round(scores.Factuality, 2) as rounded_score from: project_logs('my-project-id')

dimensions and measures

Instead of select, you can use dimensions and measures to group and aggregate data. This query returns a row for each distinct model with the day it was created, the total number of calls, the average Factuality score, and the latency percentile.

-- Analyze model performance over time dimensions:  metadata.model as model,  day(created) as date measures:  count(1) as total_calls,  avg(scores.Factuality) as avg_score,  percentile(latency, 0.95) as p95_latency from: project_logs('my-project-id')

The available aggregate functions are:

  • count(expr): number of rows
  • sum(expr): sum of numeric values
  • avg(expr): mean (average) of numeric values
  • min(expr): minimum value
  • max(expr): maximum value
  • percentile(expr, p): a percentile where p is between 0 and 1

from

The from clause identifies where the records are coming from. This can be an identifier like project_logs or a function call like experiment("id").

Data source shapes

You can add an optional parameter to the from clause that defines how the data is returned. The options are spans (default), traces, and summary.

spans

spans returns individual spans.

select: * from: project_logs('my-project-id') spans limit: 10

traces

traces returns all spans from traces that contain at least one matching span.

select: * from: project_logs('my-project-id') traces limit: 10

summary

summary provides a high-level overview of your experiment results by aggregating metrics across all spans in a trace. It returns one row per trace, making it ideal for analyzing overall performance and comparing results across experiments.

summary includes aggregated metrics like total tokens, total cost, and average scores, while also providing previews of the input, output, expected, and metadata fields. This allows faster loading of data and easier comparison between experiments.

select: * from: project_logs('my-project-id') summary -- Returns one row per trace with aggregated metrics across all spans in that trace preview_length: 1024 -- Optional, controls truncation of preview fields. Default is 124. limit: 10

The preview fields (input, output, expected, error) show only data from the root span and are truncated to a default length of 124 characters, although you can change the length with the preview_length clause. Set preview_length: -1 to disable truncation entirely.

When using summary, scores are averaged across all spans in the trace, and metrics like tokens and costs are summed. The duration metrics are calculated from the span timings.

filter

The filter clause lets you specify conditions to narrow down results. Similar to the WHERE clause in SQL, it supports a wide range of operators and functions, including complex conditions.

This example filter only retrieves data where:

  • Factuality score is greater than 0.8
  • model is "gpt-4"
  • tag list includes "triage"
  • [text search something]
  • created date is later than January 1, 2024
  • more than 1000 tokens were used or the [query? log? trace? span?] was made in production
filter:  -- Simple comparisons  scores.Factuality > 0.8 and  metadata.model = "gpt-4" and    -- Array operations  tags includes "triage" and    -- Text search  input ILIKE '%question%' and    -- Date ranges  created > '2024-01-01' and    -- Complex conditions  (  metrics.tokens > 1000 or  metadata.is_production = true  )

Note: Negative filters on tags (e.g., NOT tags includes "resolved") may not work as expected. Since tags are only applied to the root span of a trace, and queries return complete traces, negative tag filters will match child spans (which don't have tags) and return the entire trace. We recommend using positive tag filters instead.

Time intervals

BTQL supports intervals for time-based operations.

This query returns all project logs from 'my-project-id' that were created in the last day.

-- Basic intervals select: * from: project_logs('my-project-id') filter: created > now() - interval 1 day

This query returns all project logs from 'my-project-id' that were created up to an hour ago.

-- Multiple time conditions select: * from: project_logs('my-project-id') filter:  created > now() - interval 1 hour and  created < now()

This query returns all project logs from 'my-project-id' that were created last week and last month.

-- Examples with different units select: * from: project_logs('my-project-id') filter:  created > now() - interval 7 day and -- Last week  created > now() - interval 1 month -- Last month

sort

The sort clause determines the order of results. The options are desc (descending) and asc (ascending) on a numerical field. You can sort by a single field, multiple fields, or computed values.

-- Sort by single field sort: created desc   -- Sort by multiple fields sort: scores.Factuality desc, created asc   -- Sort by computed values sort: len(tags) desc

pivot and unpivot

pivot and unpivot are advanced clauses that transform your results for easier analysis and comparison.

pivot

The pivot clause transforms your results to make comparisons easier by converting rows into columns. This is useful when comparing metrics across different categories or time periods.

Syntax:

pivot: <measure1>, <measure2>, ...
-- Compare model performance metrics across models dimensions: day(created) as date measures:  avg(scores.Factuality) as avg_factuality,  avg(metrics.tokens) as avg_tokens,  count(1) as call_count from: project_logs('my-project-id') pivot: avg_factuality, avg_tokens, call_count   -- Results will look like: -- { -- "date": "2024-01-01", -- "gpt-4_avg_factuality": 0.92, -- "gpt-4_avg_tokens": 150, -- "gpt-4_call_count": 1000, -- "gpt-3.5-turbo_avg_factuality": 0.85, -- "gpt-3.5-turbo_avg_tokens": 120, -- "gpt-3.5-turbo_call_count": 2000 -- }

This query returns a record for each model with Factuality score and latency percentile across time periods.

-- Compare metrics across time periods dimensions: metadata.model as model measures:  avg(scores.Factuality) as avg_score,  percentile(latency, 0.95) as p95_latency from: project_logs('my-project-id') pivot: avg_score, p95_latency   -- Results will look like: -- { -- "model": "gpt-4", -- "0_avg_score": 0.91, -- "0_p95_latency": 2.5, -- "1_avg_score": 0.89, -- "1_p95_latency": 2.8, -- ... -- }

This query returns a record for each tag and aggregates the number of instances of that tag per model.

-- Compare tag distributions across models dimensions: tags[0] as primary_tag measures: count(1) as tag_count from: project_logs('my-project-id') pivot: tag_count   -- Results will look like: -- { -- "primary_tag": "quality", -- "gpt-4_tag_count": 500, -- "gpt-3.5-turbo_tag_count": 300 -- }

Pivot columns are automatically named by combining the dimension value and measure name. For example, if you pivot by metadata.model and a model named "gpt-4" to measure avg_score, the name becomes gpt-4_avg_score.

unpivot

The unpivot clause transforms columns into rows, which is useful when you need to analyze arbitrary scores and metrics without specifying each score name. This is helpful when working with dynamic sets of metrics or when you need to know all possible score names in advance.

-- Convert wide format to long format for arbitrary scores dimensions: created as date measures: count(1) as count from: project_logs('my-project-id') unpivot: count as (score_name, score_value)   -- Results will look like: -- { -- "date": "2024-01-01", -- "score_name": "Factuality", -- "score_value": 0.92 -- }, -- { -- "date": "2024-01-01", -- "score_name": "Coherence", -- "score_value": 0.88 -- }

limit and cursor

limit

The limit clause controls the size of the result in number of records.

-- Basic limit limit: 100

cursor

The cursor clause implements pagination. Cursors are automatically returned in BTQL responses. A default limit is applied in a query without a limit clause, and the number of returned results can be overridden by using an explicit limit. In order to implement pagination, after an initial query, provide the subsequent cursor token returned in the results in the cursor clause in follow-on queries. When a cursor has reached the end of the result set, the data array will be empty, and no cursor token will be returned by the query.

-- Pagination using cursor (only works without sort) select: * from: project_logs('<PROJECT_ID>') limit: 100 cursor: '<CURSOR_TOKEN>' -- From previous query response

Cursors can only be used for pagination when no sort clause is specified. If you need sorted results, you'll need to implement offset-based pagination by using the last value from your sort field as a filter in the next query.

-- Offset-based pagination with sorting -- Page 1 (first 100 results) select: * from: project_logs('<PROJECT_ID>') sort: created desc limit: 100
-- Page 2 (next 100 results) select: * from: project_logs('<PROJECT_ID>') filter: created < '2024-01-15T10:30:00Z' -- Last created timestamp from previous page sort: created desc limit: 100

Expressions

BTQL operators

You can use the following operators in your BTQL queries.

-- Comparison operators = -- Equal to (alias for 'eq') != -- Not equal to (alias for 'ne', can also use '<>') > -- Greater than (alias for 'gt') < -- Less than (alias for 'lt') >= -- Greater than or equal (alias for 'ge') <= -- Less than or equal (alias for 'le') IN -- Check if value exists in a list of values   -- Null operators IS NULL -- Check if value is null IS NOT NULL -- Check if value is not null ISNULL -- Unary operator to check if null ISNOTNULL -- Unary operator to check if not null   -- Text matching LIKE -- Case-sensitive pattern matching with SQL wildcards NOT LIKE -- Negated case-sensitive pattern matching ILIKE -- Case-insensitive pattern matching with SQL wildcards NOT ILIKE -- Negated case-insensitive pattern matching MATCH -- Full-word semantic search (faster but requires exact word matches, e.g. 'apple' won't match 'app') NOT MATCH -- Negated full-word semantic search   -- Array operators INCLUDES -- Check if array/object contains value (alias: CONTAINS) NOT INCLUDES -- Check if array/object does not contain value   -- Logical operators AND -- Both conditions must be true OR -- Either condition must be true NOT -- Unary operator to negate condition   -- Arithmetic operators + -- Addition (alias: add) - -- Subtraction (alias: sub) * -- Multiplication (alias: mul) / -- Division (alias: div) % -- Modulo (alias: mod) -x -- Unary negation (alias: neg)

BTQL functions

You can use the following functions in select, filter, dimensions, and measures clauses.

-- Date/time functions second(timestamp) -- Extract second from timestamp minute(timestamp) -- Extract minute from timestamp hour(timestamp) -- Extract hour from timestamp day(timestamp) -- Extract day from timestamp week(timestamp) -- Extract week from timestamp month(timestamp) -- Extract month from timestamp year(timestamp) -- Extract year from timestamp current_timestamp() -- Get current timestamp (alias: now()) current_date() -- Get current date   -- String functions lower(text) -- Convert text to lowercase upper(text) -- Convert text to uppercase concat(text1, text2, ...) -- Concatenate strings   -- Array functions len(array) -- Get length of array contains(array, value) -- Check if array contains value (alias: includes)   -- Null handling functions coalesce(val1, val2, ...) -- Return first non-null value nullif(val1, val2) -- Return null if val1 equals val2 least(val1, val2, ...) -- Return smallest non-null value greatest(val1, val2, ...) -- Return largest non-null value   -- Type conversion round(number, precision) -- Round to specified precision   -- Cast functions to_string(value) -- Cast value to string to_boolean(value) -- Cast value to boolean to_integer(value) -- Cast value to integer to_number(value) -- Cast value to number to_date(value) -- Cast value to date to_datetime(value) -- Cast value to datetime to_interval(value) -- Cast value to interval   -- Aggregate functions (only in measures) count(expr) -- Count number of rows sum(expr) -- Sum numeric values avg(expr) -- Calculate mean of numeric values min(expr) -- Find minimum value max(expr) -- Find maximum value percentile(expr, p) -- Calculate percentile (p between 0 and 1)

Field access

BTQL provides flexible ways to access nested data in arrays and objects:

-- Object field access metadata.model -- Access nested object field e.g. {"metadata": {"model": "value"}} metadata."field name" -- Access field with spaces e.g. {"metadata": {"field name": "value"}} metadata."field-name" -- Access field with hyphens e.g. {"metadata": {"field-name": "value"}} metadata."field.name" -- Access field with dots e.g. {"metadata": {"field.name": "value"}}   -- Array access (0-based indexing) tags[0] -- First element tags[-1] -- Last element   -- Combined array and object access metadata.models[0].name -- Field in first array element responses[-1].tokens -- Field in last array element spans[0].children[-1].id -- Nested array traversal

Array indices are 0-based, and negative indices count from the end (-1 is the last element).

Conditional expressions

BTQL supports conditional logic using the ternary operator (? :):

-- Basic conditions select:  (scores.Factuality > 0.8 ? "high" : "low") as quality,  (error IS NOT NULL ? -1 : metrics.tokens) as valid_tokens from: project_logs('my-project-id')
-- Nested conditions select:  (scores.Factuality > 0.9 ? "excellent" :  scores.Factuality > 0.7 ? "good" :  scores.Factuality > 0.5 ? "fair" : "poor") as rating from: project_logs('my-project-id')
-- Use in calculations select:  (metadata.model = "gpt-4" ? metrics.tokens * 2 : metrics.tokens) as adjusted_tokens,  (error IS NULL ? metrics.latency : 0) as valid_latency from: project_logs('my-project-id')

Examples

Track token usage

This query helps you monitor token consumption across your application.

from: project_logs('<YOUR_PROJECT_ID>') filter: created > '<ISO_8601_TIME>' dimensions: day(created) as time measures:  sum(metrics.total_tokens) as total_tokens,  sum(metrics.prompt_tokens) as input_tokens,  sum(metrics.completion_tokens) as output_tokens sort: time asc

The response shows daily token usage:

{  "time": "2024-11-09T00:00:00Z",  "total_tokens": 100000,  "input_tokens": 50000,  "output_tokens": 50000 }

Monitor model quality

Track model performance across different versions and configurations.

-- Compare factuality scores across models dimensions:  metadata.model as model,  day(created) as date measures:  avg(scores.Factuality) as avg_factuality,  percentile(scores.Factuality, 0.05) as p05_factuality,  percentile(scores.Factuality, 0.95) as p95_factuality,  count(1) as total_calls filter: created > '2024-01-01' sort: date desc, model asc
-- Find potentially problematic responses select: * from: project_logs('<PROJECT_ID>') filter:  scores.Factuality < 0.5 and  metadata.is_production = true and  created > now() - interval 1 day sort: scores.Factuality asc limit: 100
-- Compare performance across specific models select: * from: project_logs('<PROJECT_ID>') filter:  metadata.model IN ["gpt-4", "gpt-4-turbo", "claude-3-opus"] and  scores.Factuality IS NOT NULL and  created > now() - interval 7 day sort: scores.Factuality desc limit: 500

Analyze errors

Identify and investigate errors in your application.

-- Error rate by model dimensions:  metadata.model as model,  hour(created) as hour measures:  count(1) as total,  count(error) as errors,  count(error) / count(1) as error_rate filter: created > now() - interval 1 day sort: error_rate desc
-- Find common error patterns dimensions:  error.type as error_type,  metadata.model as model measures:  count(1) as error_count,  avg(metrics.latency) as avg_latency filter:  error IS NOT NULL and  created > now() - interval 7 day sort: error_count desc
-- Exclude known error types from analysis select: * from: project_logs('<PROJECT_ID>') filter:  error IS NOT NULL and  error.type NOT IN ["rate_limit", "timeout", "network_error"] and  metadata.is_production = true and  created > now() - interval 1 day sort: created desc limit: 100

Analyze latency

Monitor and optimize response times.

-- Track p95 latency by endpoint dimensions:  metadata.endpoint as endpoint,  hour(created) as hour measures:  percentile(metrics.latency, 0.95) as p95_latency,  percentile(metrics.latency, 0.50) as median_latency,  count(1) as requests filter: created > now() - interval 1 day sort: hour desc, p95_latency desc
-- Find slow requests select:  metadata.endpoint,  metrics.latency,  metrics.tokens,  input,  created from: project_logs('<PROJECT_ID>') filter:  metrics.latency > 5000 and -- Requests over 5 seconds  created > now() - interval 1 hour sort: metrics.latency desc limit: 20

Analyze prompts

Analyze prompt effectiveness and patterns.

-- Track prompt token efficiency dimensions:  metadata.prompt_template as template,  day(created) as date measures:  avg(metrics.prompt_tokens) as avg_prompt_tokens,  avg(metrics.completion_tokens) as avg_completion_tokens,  avg(metrics.completion_tokens) / avg(metrics.prompt_tokens) as token_efficiency,  avg(scores.Factuality) as avg_factuality filter: created > now() - interval 7 day sort: date desc, token_efficiency desc
-- Find similar prompts select: * from: project_logs('<PROJECT_ID>') filter:  input MATCH 'explain the concept of recursion' and  scores.Factuality > 0.8 sort: created desc limit: 10

Analyze based on tags

Use tags to track and analyze specific behaviors.

-- Monitor feedback patterns dimensions:  tags[0] as primary_tag,  metadata.model as model measures:  count(1) as feedback_count,  avg(scores.Factuality > 0.8 ? 1 : 0) as high_quality_rate filter:  tags includes 'feedback' and  created > now() - interval 30 day sort: feedback_count desc
-- Track issue resolution select:  created,  tags,  metadata.model,  scores.Factuality,  response from: project_logs('<PROJECT_ID>') filter:  tags includes 'needs-review' and  NOT tags includes 'resolved' and  created > now() - interval 1 day sort: scores.Factuality asc
BTQL - Docs - Reference - Braintrust