Evaluators¶
Evaluators define the logic for analyzing messages and generating evaluation metrics. Each evaluator takes individual messages from a dataset and optionally a generated response, then outputs structured results in a table.
Each evaluator has an evaluation mode — either message-level or session-level — that must match the dataset it is used with. When configuring an evaluation run, evaluators whose mode is incompatible with the selected dataset are automatically disabled.
Evaluator Types¶
LLM Evaluator¶
The LLM Evaluator uses language models to evaluate responses based on a custom prompt. This can be used as an LLM-as-judge to evaluate the performance of a chatbot, or to gain insight properties of both the user and assistant messages.
Example prompt:
Rate the helpfulness and accuracy of this response on a scale of 1-5:
User question: {input.content}
Reference answer: {output.content}
Generated answer: {generated_response}
Consider the conversation context: {context.topic}
Template Variables:
The available variables depend on the evaluator's evaluation mode.
Message-level variables¶
| Variable | Description |
|---|---|
{input.content} |
The human message content |
{output.content} |
The dataset message's AI response content. This may be an expected/reference answer (for manually created datasets) or the actual AI response (for session-cloned datasets). |
{generated_response} |
The generated response from your chatbot (if generation is enabled) |
{context.[parameter]} |
Any context variable, e.g. {context.topic} |
{full_history} |
Complete conversation history as formatted text |
Session-level variables¶
In session-level mode, {input.content} and {output.content} are empty. Use the following variables instead:
| Variable | Description |
|---|---|
{summary} |
The session snapshot — the full conversation context captured at the time of the last AI message |
{context.[parameter]} |
Any context variable, e.g. {context.current_datetime} |
Note
Generation is not available for session-level datasets, so {generated_response} is not applicable in session-level prompts.
See Evaluation Datasets for how data is mapped into these fields.
Output Schema¶
The output schema defines the metrics that the LLM should attempt to output. Each item in the schema will become a column in the output table. You can specify the data type for each field to ensure structured, validated output.
Available Types:
- string: Text output (default behavior)
- integer: Whole numbers (e.g., counts, ratings)
- float: Decimal numbers (e.g., confidence scores, percentages)
- choices (enum): Predefined options from a list
The system automatically validates the LLM's output against the specified types using a dynamically generated schema. If the output doesn't match the expected format, the system will retry up to 3 times before failing, ensuring reliable structured data.
Example Output Schema:
| Column Name | Type | Description |
|---|---|---|
| expected_helpfulness | integer | The helpfulness, on a scale of 1-5 of the expected assistant message |
| actual_helpfulness | integer | The helpfulness, on a scale of 1-5 of the actual assistant message |
| user_sentiment | choices | The sentiment of the user message (options: positive, neutral, negative) |
| confidence_score | float | Confidence in the evaluation, from 0.0 to 1.0 |
Tag Rules¶
Tag Rules let you automatically apply tags to sessions or messages when an evaluator output field matches a condition. This makes it easy to surface and filter results — for example, flagging all sessions where the evaluator detected negative sentiment, or marking messages that scored below a threshold.
Tag Rules are available only on LLM evaluators. They run automatically on every non-preview evaluation run. Preview runs do not trigger tag application.
How tags are applied
The target of the tag depends on the evaluator's mode:
- Message mode: the tag is applied to the specific chat message being evaluated.
- Session mode: the tag is applied to the session's chat.
Each tag application is recorded in an audit log and displayed in the Applied Tags column on the run results page.
Tag reconciliation on rerun
The philosophy behind tag rules is simple: a message or session either meets the criteria for a tag, or it does not. To keep tag state consistent with the latest evaluation results, rerunning an evaluation reconciles the tags managed by its evaluators' tag rules:
- For each message or session in the rerun's scope, any tag named by an evaluator's tag rules that was not applied by the latest run is removed from that message or session.
- Reconciliation removes a tag regardless of who applied it. If a human had previously added a tag that is also managed by a tag rule, and the new evaluator output does not satisfy that rule, the human-applied tag is removed.
- FULL reruns reconcile every row in the dataset. DELTA reruns only reconcile the rows in their scope.
- PREVIEW runs neither apply nor remove tags.
Tags whose names are not referenced by any tag rule on the run's evaluators are never touched by reconciliation.
Defining a rule
Each rule has three parts:
| Field | Description |
|---|---|
| Output field | The output schema field whose value is tested |
| Tag name | The tag to apply when the condition is met |
| Condition | The value or range that triggers the tag |
The condition options vary by the type of the output field:
| Output field type | Condition behaviour |
|---|---|
| choices (enum) | Apply the tag when the field equals one of the defined choice values (e.g. sentiment == "negative") |
| integer / float | Apply the tag when the field equals a specific value, or falls within a min..max range |
| string | Apply the tag when the field equals a specific value |
Example Tag Rules
Given the output schema from the example above, you could define the following rules:
| Output field | Tag name | Condition |
|---|---|---|
| user_sentiment | negative-sentiment | equals negative |
| expected_helpfulness | low-helpfulness | range 1..2 |
| confidence_score | low-confidence | range 0.0..0.4 |
With these rules, any evaluation run will automatically tag messages or sessions that meet the conditions, without requiring manual review of every row.
Python Evaluator¶
The Python Evaluator allows custom code execution against each message.
The code must define a main function which takes the input, output, full_history, and generated_response. It should return a dict whose keys will become columns in the output table.
Function Arguments:
| Argument | Type | Description | Example |
|---|---|---|---|
input |
dict | The human message data with content and role keys |
{'content': 'What is 2+2?', 'role': 'human'} |
output |
dict | The expected AI response with content and role keys |
{'content': '2+2 equals 4', 'role': 'ai'} |
context |
dict | Additional metadata and variables | {'topic': 'math', 'difficulty': 'easy', 'user_id': '123'} |
full_history |
str | Complete conversation history | "user: Hello\nassistant: Hi there!\nuser: What is 2+2?" |
generated_response |
str | AI-generated response being evaluated | "The answer is 4. Is there anything else I can help with?" |
Example:
def main(input: dict, output: dict, context: dict, full_history: str, generated_response: str, **kwargs) -> dict:
"""Evaluates response quality based on accuracy, length, and politeness.
"""
expected_answer = output['content'].lower()
actual_answer = generated_response.lower()
has_correct_answer = expected_answer in actual_answer
response_length = len(generated_response.split())
is_polite = any(word in actual_answer for word in ['please', 'thank', 'help', 'happy'])
return {
'correct_answer': has_correct_answer,
'response_length': response_length,
'politeness_score': 1.0 if is_polite else 0.0,
'topic': context.get('topic', 'unknown')
}