Evaluators¶

Evaluators define the logic for analyzing messages and generating evaluation metrics. Each evaluator takes individual messages from a dataset and optionally a generated response, then outputs structured results in a table.

Each evaluator has an evaluation mode — either message-level or session-level — that must match the dataset it is used with. When configuring an evaluation run, evaluators whose mode is incompatible with the selected dataset are automatically disabled.

Evaluator Types¶

LLM Evaluator¶

The LLM Evaluator uses language models to evaluate responses based on a custom prompt. This can be used as an LLM-as-judge to evaluate the performance of a chatbot, or to gain insight properties of both the user and assistant messages.

Example prompt:

Rate the helpfulness and accuracy of this response on a scale of 1-5:

User question: {input.content}
Reference answer: {output.content}
Generated answer: {generated_response}

Consider the conversation context: {context.topic}

Template Variables:

The available variables depend on the evaluator's evaluation mode.

Message-level variables¶

Variable	Description
`{input.content}`	The human message content
`{output.content}`	The dataset message's AI response content. This may be an expected/reference answer (for manually created datasets) or the actual AI response (for session-cloned datasets).
`{generated_response}`	The generated response from your chatbot (if generation is enabled)
`{context.[parameter]}`	Any context variable, e.g. `{context.topic}`
`{full_history}`	Complete conversation history as formatted text

Session-level variables¶

In session-level mode, {input.content} and {output.content} are empty. Use the following variables instead:

Variable	Description
`{summary}`	The session snapshot — the full conversation context captured at the time of the last AI message
`{context.[parameter]}`	Any context variable, e.g. `{context.current_datetime}`

Note

Generation is not available for session-level datasets, so {generated_response} is not applicable in session-level prompts.

See Evaluation Datasets for how data is mapped into these fields.

Output Schema¶

The output schema defines the metrics that the LLM should attempt to output. Each item in the schema will become a column in the output table. You can specify the data type for each field to ensure structured, validated output.

Available Types:

string: Text output (default behavior)
integer: Whole numbers (e.g., counts, ratings)
float: Decimal numbers (e.g., confidence scores, percentages)
choices (enum): Predefined options from a list

The system automatically validates the LLM's output against the specified types using a dynamically generated schema. If the output doesn't match the expected format, the system will retry up to 3 times before failing, ensuring reliable structured data.

Example Output Schema:

Column Name	Type	Description
expected_helpfulness	integer	The helpfulness, on a scale of 1-5 of the expected assistant message
actual_helpfulness	integer	The helpfulness, on a scale of 1-5 of the actual assistant message
user_sentiment	choices	The sentiment of the user message (options: positive, neutral, negative)
confidence_score	float	Confidence in the evaluation, from 0.0 to 1.0

Tag Rules¶

Tag Rules let you automatically apply tags to sessions or messages when an evaluator output field matches a condition. This makes it easy to surface and filter results — for example, flagging all sessions where the evaluator detected negative sentiment, or marking messages that scored below a threshold.

Tag Rules are available only on LLM evaluators. They run automatically on every non-preview evaluation run. Preview runs do not trigger tag application.

How tags are applied

The target of the tag depends on the evaluator's mode:

Message mode: the tag is applied to the specific chat message being evaluated.
Session mode: the tag is applied to the session's chat.

Each tag application is recorded in an audit log and displayed in the Applied Tags column on the run results page.

Tag reconciliation on rerun

The philosophy behind tag rules is simple: a message or session either meets the criteria for a tag, or it does not. To keep tag state consistent with the latest evaluation results, rerunning an evaluation reconciles the tags managed by its evaluators' tag rules:

For each message or session in the rerun's scope, any tag named by an evaluator's tag rules that was not applied by the latest run is removed from that message or session.
Reconciliation removes a tag regardless of who applied it. If a human had previously added a tag that is also managed by a tag rule, and the new evaluator output does not satisfy that rule, the human-applied tag is removed.
FULL reruns reconcile every row in the dataset. DELTA reruns only reconcile the rows in their scope.
PREVIEW runs neither apply nor remove tags.

Tags whose names are not referenced by any tag rule on the run's evaluators are never touched by reconciliation.

Defining a rule

Each rule has three parts:

Field	Description
Output field	The output schema field whose value is tested
Tag name	The tag to apply when the condition is met
Condition	The value or range that triggers the tag

The condition options vary by the type of the output field:

Output field type	Condition behaviour
choices (enum)	Apply the tag when the field equals one of the defined choice values (e.g. `sentiment == "negative"`)
integer / float	Apply the tag when the field equals a specific value, or falls within a `min..max` range
string	Apply the tag when the field equals a specific value

Example Tag Rules

Given the output schema from the example above, you could define the following rules:

Output field	Tag name	Condition
user_sentiment	negative-sentiment	equals `negative`
expected_helpfulness	low-helpfulness	range `1..2`
confidence_score	low-confidence	range `0.0..0.4`

With these rules, any evaluation run will automatically tag messages or sessions that meet the conditions, without requiring manual review of every row.

Python Evaluator¶

The Python Evaluator allows custom code execution against each message.

The code must define a main function which takes the input, output, full_history, and generated_response. It should return a dict whose keys will become columns in the output table.

Function Arguments:

Argument	Type	Description	Example
`input`	dict	The human message data with `content` and `role` keys	`{'content': 'What is 2+2?', 'role': 'human'}`
`output`	dict	The expected AI response with `content` and `role` keys	`{'content': '2+2 equals 4', 'role': 'ai'}`
`context`	dict	Additional metadata and variables	`{'topic': 'math', 'difficulty': 'easy', 'user_id': '123'}`
`full_history`	str	Complete conversation history	`"user: Hello\nassistant: Hi there!\nuser: What is 2+2?"`
`generated_response`	str	AI-generated response being evaluated	`"The answer is 4. Is there anything else I can help with?"`

Example:

def main(input: dict, output: dict, context: dict, full_history: str, generated_response: str, **kwargs) -> dict:
    """Evaluates response quality based on accuracy, length, and politeness.
    """

    expected_answer = output['content'].lower()
    actual_answer = generated_response.lower()
    has_correct_answer = expected_answer in actual_answer

    response_length = len(generated_response.split())
    is_polite = any(word in actual_answer for word in ['please', 'thank', 'help', 'happy'])

    return {
        'correct_answer': has_correct_answer,
        'response_length': response_length,
        'politeness_score': 1.0 if is_polite else 0.0,
        'topic': context.get('topic', 'unknown')
    }