To Eval or Not to Eval

All generalisations are false. Including this one.

Evals have become the de facto standard over the last few years. Going by the broader consensus, if you’re building with LLMs and don’t have an eval setup, you’re setting your product up for failure.

And this makes sense, evals make total sense. Why wouldn’t they work? We write unit, integration, e2e tests for our code. Aren’t evals just their equivalent for prompts?

In my experience, coming up on five years of working with LMs, there’s a blind spot that’s often overlooked.

Quick Context

I lead the engineering team at Pepper. To keep it short: content is at the very core of Pepper. We have tens of thousands of creators who work with us to deliver this content. Note that the nature of content we’ve produced over the years is extremely versatile. From deep technical white papers to push notifications for one of India’s top e-commerce companies to entire brand campaign videos for flagship smartphones, we’ve worked across the board.

Now, most of the software that I worked on over the years focused on augmenting this content production pipeline; be it tools that create initial drafts or audits to help creators with feedback on their work and everything in between. When gpt came about, it fit in really nicely into the whole picture, as you can imagine.

Broadly, you can bucket LM usage into two categories: decision making (intelligence) and content generation. One could argue that content generation intrinsically involves decision making but we’ll put semantics aside for now.

Let’s focus on content generation and that too just text.

Problem Statement

Consider this: If a company asked you to make them 500 web pages which would include product/feature pages, support pages, listing pages etc, what would your ground truth in the eval set look like??

Would you go about collecting or scraping similar pages of its competitors?
Would you create a synthetic dataset?
Would you pay your creators out of your own pocket to make you a sample set?

Mind you, originality, depth, and quality of writing are what your clients are paying for.

They want consistency in tone, voice, and style but not so much that it feels programmatically generated.
They want you to take inspiration from their competitors but not so much that one starts to resemble the other.
They want you to be CREATIVE.

“Well, why didn’t you try fine-tuning?” I hear you asking. We did. It worked for some, for some time.

The Problem

It’s the classic chicken-and-egg problem.
You need the data that you are supposed to create.

It’s the classic triple-constraint problem.
You are required to optimise for cost, speed and quality.

It’s non-generalisable.
I can’t use the same eval set even for a client’s direct competitor.

Evals don’t work here. No matter how you approach it, the problem is practically circular and extremely path-dependent. Which is why evals fail here. By design, evals solve for bounded scopes where it’s easier to define and outline the expectations.

What we realised was in the time it took us to turn all the knobs and levers: fine-tune, prompt, eval and make any amount of good progress, an AI lab came up with the next iteration of their SOTA model which combined with really good prompt and context engineering took us 90% of the way there.

Do I want to deliver value to the customer or make my test cases pass? The choice was pretty straightforward.

What We Do

So, we ended up adopting an optional evals policy. We use it for our text-to-SQL pipelines, format regressions, etc., but for creative quality, we rely on our fellow humans and live user metrics.

Keep it simple, stupid.