A Responsible Approach to Generative AI Development: It's Evaluation-Driven

This June we launched an invite-only beta of AskCR, a new experimental product that answers CR members questions in a naturalistic way, using Large Language Model (LLM) technology under the hood. “Naturalistic” for our purposes, means in the “way that CR users want to ask a question”, with the goal of the response being both accurate and easy to understand.

Your job is to ask your product questions. Our job is to answer them correctly. So, how do we discern the quality of AskCR’s answers? We formalized a way of assessing the output quality of AskCR by creating a test of questions that AskCR had to pass, and then testing AskCR against the questions every time we release a new version. This process is called Evaluation-Driven Development (EDD).

What is Evaluation-Driven Development?

Evaluation-Driven Development (EDD) involves using a rigorous, repeatable evaluation method to confirm that accurate responses are returned by a system.

AskCR’s evaluation process uses a gold standard evaluation test set where there is an initially small number of questions, each with a known, correct answer that AskCR should be able to return. When conceptualizing what AskCR aims to be as a product, answering these questions accurately represents the first minimal version of the product. This approach is called EDD because the code is written to answer certain types of AskCR questions. As the system is evaluated using the gold standard questions and answers, the improvements are noted and areas that are not passing these evaluation tests drive the next iteration of product feature planning and implementation. The evaluations (and really, the evaluation results) drive development. After numerous, repeated rounds of evaluation tests the result is a system that can answer all of the questions in the evaluation test set, which represents the breadth of questions CR users will ask.

EDD is based on the same methodological underpinnings as test-driven development, or TDD, where software engineering progress is iteratively measured by unit tests that cover constrained scenarios. These scenarios where the code “should work” are small, and as the tests are written to test the code in parallel or even just ahead of software development. TDD is a dominant method for software development over the last 20 years but EDD has emerged as a valuable way to test LLM systems. Crucially, the questions used in the evaluation replicate how users will interact with a variety of chatbot systems that sound trustworthy and accurate, but need to be rigorously evaluated to make sure that they are not, in effect, “trash talking”. AskCR aims to be extremely accurate and EDD creates a process to measure and implement a helpful LLM-based chatbot.

How do we do Evaluation Driven Development in AskCR?

As we noted earlier, the iterative process of testing AskCR on a wide and deep set of questions uses best practice from rigorous evaluation. This test question set covered a sample of the products that CR reviews that are representative of what we think users will ask, and questions that we don’t expect them to ask, but need to have good answers to. This last type of question is very important.

Right now we have almost 400 questions that we run through AskCR before we release a new version; we grade every answer AskCR returns. Each question is judged for correctness and given one of three grades: “pass”, “fail”, and “needs clarification”. We ask “does the response appropriately answer the question asked” and usually expect AskCR to provide a link to where the information was retrieved from in the CR knowledge base. Follow up questions where product features or ratings are queried are treated in the same manner: “Does this answer give the user the accurate information that they need?”

Questions whose answers are not correct are valuable – they are clues that help us explore why the system is failing to return a good result. Reflecting on why some questions do not pass this test drives how we develop and improve AskCR. The relative importance of failing questions also helps determine prioritization of AskCR features.

In most cases, there is a particular article that we should be able to retrieve and that contains the relevant information that is used to answer the question. We call these the gold standard articles, as we know there is a specific answer out there. If the system recall is too greedy it may suggest other correct articles as the best one to provide in an answer, but if it is not the gold standard article, we may label the response “needs clarification”.

In some cases, we might not have an article covering the topic (as shown in the lower right section of the image below titled: “no correct answer or article exists”). For example, we do not review guns. There is no information on guns in our databases. However, we need to have a thoughtful and considered answer to any question about guns or any other product category we don’t test, even if we don’t expect our users to ask about them.

The process of examining the way that AskCR answers evaluation questions is done by hand…by humans. There are some aspects of this evaluation that we may be able to automate in the future. However ultimately AskCR will be evaluated by our users, who are humans, so it is crucial that we center the human in the evaluation process – especially now while AskCR is young and in beta.

Currently, domain experts grade the answers; the first evaluation sets had questions where the gold standard answers were agreed upon by at least two people, but in most cases many more. When there are differing perspectives regarding whether a specific question is correct, we often turn the question around and think of what the ideal information to be communicated would be in each circumstance, and whether the returned response accurately articulates that core information.

What have we learned through Evaluation-Driven Development?

Before launching AskCR, the first evaluation included questions that CR customers have asked in the past about products. Additional questions were included that showcased the natural language flexibility of LLM-based chat interfaces such as queries that were only key words, ones that asked for product comparisons, and ones that sought products with a specific set of features. We also asked a great deal of questions that we don’t expect users to ask. Our goal here is to keep AskCR a safety-first environment.

We learned a lot. For a while, our guardrails weren’t catching so you could get some biased, toxic, and dangerous information from AskCR. We had to fix it…and then run an evaluation.

Once those improvements were implemented, we noticed that the system was answering questions using only the underlying foundational LLM. So, for example, the LLM answered the question without referring to the CR article explaining differences in motor oils. We had to fix this because the goal of AskCR is to answer questions from CR’s knowledge base, not using the latent language patterns found in the LLM training data (users who want that can find it in ChatGPT). Next, we noticed that asking follow-up, preference-gathering questions led to more errors in answers. This led us to take more time to develop and test that feature before rolling it out to users. Product comparison questions were similarly complicated, so when these questions were failing, our engineering team started splitting these questions into multiple simple questions and looking at the user’s intention behind the questions. After we identified a new failed question type and improved the code, we evaluated.

How will our Evaluation approach evolve?

CR’s approach to EDD will be evolving with time. As CR learns how best to harness LLMs to provide our users with the information they need in the format that is most usable to them (e.g. AskCR provides you with product comparisons in a CSV) we will focus more on personalizing the responses and making sure they sound like a “friendly expert.” But the next challenge is, how would a “friendly expert” answer your product questions? How do we bring the “CR-ness” our members know and love to AskCR and how are we going to evaluate it? User feedback is key.

How you can help?

When you use AskCR, you can participate in our evaluation-driven development routine by providing feedback on your chat experience. You can give “thumbs up” and “thumbs down” in the chat interface, or share more detailed feedback which our team will examine and use to inform our approach to evaluation.

To check it out, visit consumerreports.org/askcr and log in to your member account. Since AskCR is still an experiment, only a handful of members currently have access; if that’s not you, add your name to our Interest List and we’ll notify you when we open up the beta.

We’re excited to hear what you think of AskCR – and if you have ideas about evaluation-driven development and how we can evolve our system with time, email us at innovationlab@cr.consumer.org.

A Responsible Approach to Generative AI Development: It’s Evaluation-Driven

What is Evaluation-Driven Development?

How do we do Evaluation Driven Development in AskCR?

What have we learned through Evaluation-Driven Development?

How will our Evaluation approach evolve?

How you can help?

Agents Talking to Agents (A2A): Reshaping the Marketplace and Your Power

My Agent Messed Up! Understanding Errors and Recourse in AI Transactions

Defining ‘Loyalty’ for AI Agents: Insights from the Stanford AI Agents x Law Workshop

Consumer Trust, Real-World Impact, Multi-Agent Systems: Highlights from AI Agent Congress 2

What is Evaluation-Driven Development?

How do we do Evaluation Driven Development in AskCR?

What have we learned through Evaluation-Driven Development?

How will our Evaluation approach evolve?

How you can help?

Related Blog Posts

Get the latest on Innovation at Consumer Reports