Dev diary - 24. July 2025

Runtime LLM evaluation with Promptfoo & Echo

At work, we've been tasked with generating content using an LLM — but generating is only half the job. We also need to evaluate the content to ensure it meets a range of criteria relevant to our business logic (e.g. positive sentiment and tone). For this, we decided to use Promptfoo (https://promptfoo.dev/).

What is Promptfoo?

Promptfoo is a framework designed to systematically evaluate LLM outputs. It's commonly used via YAML configuration files to define providers, prompts, and assertions (tests) that help assess how well your LLM prompts perform across different models.

But Promptfoo isn't limited to static YAML configurations. It also exposes a JavaScript/TypeScript API, which is ideal if you want to run evaluations programmatically — such as during runtime in a production app, which is exactly our use case.

A particularity in our workflow: the "Echo" provider

Typically, Promptfoo sends prompts to an LLM provider (like OpenAI, Anthropic, or VertexAI), gathers responses, and then runs tests on the results. However, in our case, we already have the generated content beforehand, so we don't want Promptfoo to re-generate anything.

This is where Promptfoo's echo provider comes in. As documented in the Promptfoo providers guide (https://promptfoo.dev/docs/providers/echo), the echo provider simply returns the prompt text as-is instead of sending it to an LLM. That way, Promptfoo can proceed directly to the evaluation phase without re-generating content.

In our implementation, we configure Promptfoo like this:

typescript

>const evaluationResults = await promptfooEvaluate({
>&nbsp;&nbsp;providers: ['echo'],&nbsp;
>&nbsp;&nbsp;prompts: [JSON.stringify(previouslyGeneratedContent, null, 2)],
>&nbsp;&nbsp;tests: getTestCases(),
>});
>

Using LLMs as judges with llm-rubric

Once the content is passed through the echo provider, we define our test assertions using Promptfoo's llm-rubric assertion type. This tells Promptfoo to send a scoring rubric to an LLM, which then acts as the judge, evaluating the content based on the provided rubric.

Here's a simplified example of an llm-rubric test case:

typescript

>{
>&nbsp;&nbsp;assert: [
>&nbsp;&nbsp;&nbsp;&nbsp;{
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;type: 'llm-rubric',
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;value: `
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Evaluate the sentiment and tone of the content.
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Requirements:
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Positive, engaging tone
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Professional and trustworthy language
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Clear value proposition
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Compelling call-to-action language
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Avoids negative or discouraging language
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Instructions:
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1. Analyze the overall sentiment (positive, neutral, negative)
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2. Check for professional and engaging tone
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3. Look for clear value propositions
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;4. Identify any negative language that might discourage users
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Scoring:
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- PASS: Predominantly positive, professional, and engaging
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- FAIL: Negative sentiment, unprofessional tone, or discouraging language
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Output format:
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Overall sentiment: [positive/neutral/negative]
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Tone assessment: [professional/casual/unprofessional]
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Result: PASS/FAIL
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Key issues: [if any]
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- Strengths: [positive aspects]
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`,
>&nbsp;&nbsp;&nbsp;&nbsp;},
>&nbsp;&nbsp;],
>}
>

Getting the results

For the moment we just have a simple:

typescript

>const passesEvaluation = evaluationResults.results.every(
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;({ gradingResult }) => gradingResult.pass,
>&nbsp;&nbsp;&nbsp;&nbsp;);
>

Conclusion

Using Promptfoo with the echo provider has allowed us to integrate runtime evaluations seamlessly, without re-generating content. Combined with llm-rubric, we leverage an LLM to objectively grade our content against well-defined quality checks — ensuring consistent, high-quality outputs in production.

If you're automating LLM evaluations and need flexibility in runtime scenarios, Promptfoo's API combined with echo and llm-rubric is a powerful, production-ready solution. Furthermore, as a bonus, it’s super easy to use this same setup for E2E tests!

Author

Germán Distel

I'm a Fullstack JavaScript Developer who enjoys bringing ideas to life using NestJS, React, and Angular—plus a bit of Python when needed. Outside of coding, I like trying out new technologies, reading, and spending quality time playing with my daughter.

Runtime LLM evaluation with Promptfoo & Echo

What is Promptfoo?

A particularity in our workflow: the "Echo" provider

Using LLMs as judges with llm-rubric

Getting the results

Conclusion

Read more

Let's talk