Eval Engine

Overview

The Eval Engine is used to evaluate the performance of an LLM's output against a given input. The engine returns a score between 0 and 1, along with a reason for the score.

Eval Engines require either criteria or evaluation_steps to be set. If criteria is set, Griptape will generate evaluation_steps for you. This is useful for getting started, but you may to explicitly set evaluation_steps for more complex evaluations. Either criteria or evaluation_steps must be set, but not both.

CodeLogs

import json

from griptape.engines import EvalEngine

engine = EvalEngine(
    criteria="Determine whether the actual output is factually correct based on the expected output.",
)

score, reason = engine.evaluate(
    input="If you have a red house made of red bricks, a blue house made of blue bricks, what is a greenhouse made of?",
    expected_output="Glass",
    actual_output="Glass",
)

print("Eval Steps", json.dumps(engine.evaluation_steps, indent=2))
print(f"Score: {score}")
print(f"Reason: {reason}")

Eval Steps [
  "Compare the actual output to the expected output to identify any discrepancies in factual information.",
  "Verify that all key facts present in the expected output are accurately reflected in the actual output.",
  "Check for any additional information in the actual output that is not present in the expected output and assess its factual correctness.",
  "Ensure that the actual output does not omit any critical facts that are present in the expected output."
]
Score: 1.0
Reason: The actual output 'Glass' matches the expected output 'Glass' exactly, with no discrepancies or omissions in factual information.