Skip to content

Eval Engine

Overview

The Eval Engine is used to evaluate the performance of an LLM's output against a given input. The engine returns a score between 0 and 1, along with a reason for the score.

Eval Engines require either criteria or evaluation_steps to be set. If criteria is set, Griptape will generate evaluation_steps for you. This is useful for getting started, but you may to explicitly set evaluation_steps for more complex evaluations. Either criteria or evaluation_steps must be set, but not both.

import json

from griptape.engines import EvalEngine

engine = EvalEngine(
    criteria="Determine whether the actual output is factually correct based on the expected output.",
)

score, reason = engine.evaluate(
    input="If you have a red house made of red bricks, a blue house made of blue bricks, what is a greenhouse made of?",
    expected_output="Glass",
    actual_output="Glass",
)

print("Eval Steps", json.dumps(engine.evaluation_steps, indent=2))
print(f"Score: {score}")
print(f"Reason: {reason}")
Eval Steps [
  "Compare the actual output to the expected output to identify any discrepancies.",
  "Verify the factual accuracy of the actual output by cross-referencing with the expected output.",
  "Assess whether the actual output meets the criteria outlined in the expected output.",
  "Determine if any information in the actual output contradicts the expected output."
]
Score: 1.0
Reason: The actual output 'Glass' matches the expected output 'Glass', with no discrepancies or contradictions.