Evaluators¶
Evaluators define the logic for analyzing messages and generating evaluation metrics. Each evaluator takes individual messages from a dataset and optionally a generated response, then outputs structured results in a table.
Evaluator Types¶
LLM Evaluator¶
The LLM Evaluator uses language models to evaluate responses based on a custom prompt. This can be used as an LLM-as-judge to evaluate the performance of a chatbot, or to gain insight properties of both the user and assistant messages.
Example prompt:
Rate the helpfulness and accuracy of this response on a scale of 1-5:
User question: {input.content}
Reference answer: {output.content}
Generated answer: {generated_response}
Consider the conversation context: {context.topic}
Template Variables: The following variables are available to be used in the LLM prompt.
{input.content}: The human message content{output.content}: The dataset message's AI response content. This may be an expected/reference answer (for manually created datasets) or the actual AI response (for session-cloned datasets).{generated_response}: The generated response from your chatbot (if generation enabled){context.[parameter]}: Access any context variables, e.g.,{context.topic}{full_history}: Complete conversation history as formatted text
See Evaluation Datasets for how data is mapped into these fields.
Output Schema¶
The output schema defines the metrics that the LLM should attempt to output. Each item in the schema will become a column in the output table. You can specify the data type for each field to ensure structured, validated output.
Available Types:
- string: Text output (default behavior)
- integer: Whole numbers (e.g., counts, ratings)
- float: Decimal numbers (e.g., confidence scores, percentages)
- choices (enum): Predefined options from a list
The system automatically validates the LLM's output against the specified types using a dynamically generated schema. If the output doesn't match the expected format, the system will retry up to 3 times before failing, ensuring reliable structured data.
Example Output Schema:
| Column Name | Type | Description |
|---|---|---|
| expected_helpfulness | integer | The helpfulness, on a scale of 1-5 of the expected assistant message |
| actual_helpfulness | integer | The helpfulness, on a scale of 1-5 of the actual assistant message |
| user_sentiment | choices | The sentiment of the user message (options: positive, neutral, negative) |
| confidence_score | float | Confidence in the evaluation, from 0.0 to 1.0 |
Python Evaluator¶
The Python Evaluator allows custom code execution against each message.
The code must define a main function which takes the input, output, full_history, and generated_response. It should return a dict whose keys will become columns in the output table.
Function Arguments:
| Argument | Type | Description | Example |
|---|---|---|---|
input |
dict | The human message data with content and role keys |
{'content': 'What is 2+2?', 'role': 'human'} |
output |
dict | The expected AI response with content and role keys |
{'content': '2+2 equals 4', 'role': 'ai'} |
context |
dict | Additional metadata and variables | {'topic': 'math', 'difficulty': 'easy', 'user_id': '123'} |
full_history |
str | Complete conversation history | "user: Hello\nassistant: Hi there!\nuser: What is 2+2?" |
generated_response |
str | AI-generated response being evaluated | "The answer is 4. Is there anything else I can help with?" |
Example:
def main(input: dict, output: dict, context: dict, full_history: str, generated_response: str, **kwargs) -> dict:
"""Evaluates response quality based on accuracy, length, and politeness.
"""
expected_answer = output['content'].lower()
actual_answer = generated_response.lower()
has_correct_answer = expected_answer in actual_answer
response_length = len(generated_response.split())
is_polite = any(word in actual_answer for word in ['please', 'thank', 'help', 'happy'])
return {
'correct_answer': has_correct_answer,
'response_length': response_length,
'politeness_score': 1.0 if is_polite else 0.0,
'topic': context.get('topic', 'unknown')
}