Students at Carnegie Mellon University Enhance AI Interactions Through User Trust

Editor’s note: this is a guest post from students in the Master of Human-Computer Interaction (MHCI) program at Carnegie Mellon University, who have been working with CR to imagine and design AI agents that help solve common consumer problems. Read on to learn about their design process and check out other installments in the series.

Our previous blog post delved into the myriad pain points consumers face in the digital marketplace, such as difficulty identifying company contact channels, curating case information, tracking issue resolution progress, and seeking assistance in constructing arguments for effective advocacy during business interactions. We also introduced our initial design concepts: The Negotiation Helper, Policy Assistant, and CR Wallet. Among these, The Negotiation Helper was particularly novel and risky, generating excitement from stakeholders and consumers alike, and served as the starting point for our concept validation.

This week our student team at the Human-Computer Interaction Institute at Carnegie Mellon University will discuss how the core elements of The Negotiation Helper evolved through three rounds of concept validation testing. These rounds allowed us to assess the riskiest assumptions of our design and align on the ideal design of an agentic AI that would be highly desirable to users.

Concept Validation Testing: Round 1

In our first round of concept validation testing, we aimed to evaluate the riskiest assumptions of our design by focusing on key concepts such as compiling company contact information, customer service ratings, community reviews, simultaneous multi-modal interaction, and building arguments using AI-generated tips.

Using a Wizard of Oz testing protocol, we engaged five participants in a simulated customer service scenario where they called an Xfinity customer service “agent” and received real-time text tips and policy insights from a “Consumer Reports agent.” This approach allowed us to gather insights on user preferences, identify challenges, and assess potential differences in user experience, particularly concerning multi-modal interactions and unconventional features.

The findings from this initial round revealed:

Information overload: Simultaneous text messages during calls were overwhelming for all users; reading and processing text tips while on call proved to be a challenge for them.
Information before the call: Many users expressed a desire to receive information and actionable tips before initiating the call with customer service. This could include insights into company policies, scripts for navigating the conversation, or summaries of relevant reviews.
Concise information presentation: There was a preference for clear and concise information presentation prior to engaging in the call. Users found the combination of star ratings, percentages, and written reviews overwhelming to review before the call.
Limited use cases: A few users preferred contacting official company apps for customer service issues.
Tips deemed valuable: Most users found evidence-based policy arguments and tips valuable.

Concept Validation Testing: Round 2

Based on these findings from the first round of testing, we reconfigured the interaction of text-based real-time policies and voice interactions with a customer service agent in our second round of concept validation. Specifically, we evaluated several concepts: asynchronous multi-modal interaction (tips before a customer service call), synchronous single-modal interaction (in-chat tips during a customer service call), building arguments using AI-generated tips, actionable AI-summarized tips from customer service reviews, and an AI agent acting on the user’s behalf.

Reconfiguring LLM Policies Across Different Modalities

We conducted tests with 10 participants using a Wizard of Oz protocol. Participants dealt with a TV replacement issue and either called customer service, receiving pre-call tips, or engaged in a live chat, receiving real-time tips. Additionally, we tested a “CR Wizard” (AI agent) calling customer service on behalf of the participants, with tests including a regular-sounding AI agent that made no mistakes and a “friendly” AI agent that made intentional errors, which participants could correct via text or by taking over the call. The objectives were to determine the value and effectiveness of pre-call and real-time tips, gather feedback on AI agents acting on users’ behalf, evaluate AI error discoverability, assess reactions to AI errors and agent tone, and determine the value of community-sourced tips.

Findings revealed that tips provided before customer service calls were generally well-received, with 60% of users favoring them and 40% neutral. However, some participants, like P2, expressed discomfort with citing policy information due to a lack of trust and fear of confrontation or using the information incorrectly. In contrast, in-chat tips during customer service interactions were overwhelmingly favorable and will be retained. Actionable AI-summarized tips from customer service reviews before a call were eliminated, as most participants felt neutral about their value.

CR Wizard Acting on Behalf of Users

The “CR Wizard” concept was well-received, with 60% of users favoring it and an additional 20% feeling neutral. However, the discoverability of AI errors was only detected by 50% of participants, indicating a need for improvement. When errors were identified, 60% of users preferred to correct them via text over voice interjection, suggesting that both options should be available to cater to different user archetypes. Prompt AI error repair provided participants with a sense of relief and control. There was a 50-50 split in preferences for the AI agent’s tone, indicating the need for customizable persona options.

Based on these insights, we learned that users need to be able to customize their desired outcomes, such as opting for a refund instead of a replacement. We also need to provide live transcription to improve error discoverability and allow participants to multitask more effectively. This round of testing made salient the necessity to assess mistake tolerance to maintain user trust post-AI error and enhance user repair interactions by offering both text and takeover options.

Concept Validation Testing: Round 3

In our pursuit of exploring novel AI interactions, in our third round of concept testing, we focused on 1) assessing user tolerability of AI mistakes and 2) evaluating ideal intervention interactions of the CR Wizard when an error was identified. Our team strived to identify reactions to AI agent mistakes and the most effective post-error responses to help maintain user trust in the product. Subsequently, we evaluated the ease and experience of swipe interaction to take over, transition time for takeover, utility and options for pre-written text interjections, reactions to AI agent mistakes, and post-mistake corrective responses to maintain user trust.

Evaluating and Maintaining Consumer Trust Post AI Errors

Participants were presented with scenarios involving a TV replacement issue to gauge their tolerance for AI errors. After establishing the CR Wizard’s potential value (Storyboard #1), they viewed scenarios where the Wizard made mistakes. Participants generally forgave date errors (Storyboard #2) but held the CR Wizard and Consumer Reports accountable for more significant errors, like policy clause hallucinations (Storyboard #3). Despite the mistakes, many were willing to retry the product once improved. They expected apologies, feedback channels, and compensation, though some found errors unrepairable without complete accuracy.

Error Repair Interactions

Following storyboard sessions, participants were asked to resolve a TV issue using the CR Wizard. During a pre-recorded conversation, the CR Wizard made a mistake, allowing participants to interject via text or take over the call. This helped us observe error repair interactions and gather feedback.

Seventy percent of participants wanted the AI agent to handle calls independently, expressing frustration with the need to actively monitor the call. They felt it defeated the purpose of having an AI agent, as they still had to invest time and attention, which made them feel they might as well handle the call themselves. Additionally, some participants felt discomfort from the power imbalance of listening to a human agent talk to an AI agent while they were on the call, with one noting it felt like “playing God” and expressing a desire to be completely separate from the call and only receive a report afterward. The high cognitive load, discomfort, and anxiety associated with interjecting, calling out errors, and monitoring the AI’s interactions further underscored the need for the AI to operate autonomously to enhance user experience and trust.

To address these pain points, we redesigned the CR Wizard to eliminate the need for users to monitor calls. Now, users receive verification prompts before the call and can opt for updates during the call. Before the call, they verify personal information and review the AI’s script. During the call, pop-up notifications require user verification for key actions. If the user does not respond to the Wizard during the call, the rep is put on hold, and feedback is provided to the customer service rep with a timed delay before ending the call. After the call, users receive a summary and transcript, enhancing the user experience and trust in the CR Wizard.

Next time…

In brief, through three rounds of concept validation testing, we identified challenges such as information overload, user discomfort with AI interactions, and the need for better error repair mechanisms. These insights led to significant design changes, enhancing user experience and trust by removing consumers from direct call monitoring and implementing pre-call verifications, optional during-call updates, and post-call summaries. In this series of research through design, we have greatly transitioned from our initial concept of the Negotiation Helper to the AI-powered CR Wizard.

Tune in next month for our final blog post, where we showcase our final design (perhaps a potential new product name?), aligning user desirability, technical feasibility, and business viability into a comprehensive, innovative product.