Building Wayfair’s First Virtual Assistant: Automating Customer Service by Text Based Intent Prediction

A robot icon with a word bubble saying "Hi, there! I'm Wayfair's virtual assistant. Let me know how I can help!"

Not everyone wants to speak with someone when they need to set up a return or report an issue with their order. We built an NLP-powered Virtual Assistant to provide our customers a fast and easy option that’s available 24/7 to resolve their service-related issues.

Why Build a Virtual Assistant?

At Wayfair, we strive to continually provide our customers best-in-class customer service, and know that it is a key differentiator between us and our competitors. Part of our approach to achieving this involves meeting customers where they are, consistently delivering a delightful experience, and continuously improving our processes. To deliver on these goals, data science partnered with product and engineering teams to build Wayfair’s first Virtual Assistant (VA). The goal is to provide our customers 24/7 support that incorporates the accessibility of a conversational interface and the efficiency of self-service. The Virtual Assistant can fully automate some contacts and can also reduce the time to resolution for customers when they are speaking to a service consultant, by identifying the customer’s intent, and the specific product or order in question.

Building the Virtual Assistant poses several interesting data science, product, and engineering challenges. From the product perspective, there are customer-facing challenges, how do we resolve the customer’s issue in an automated and user-friendly way, and what is the most logical conversion flow for each potential intent? From the data science perspective, we need a customer intent taxonomy with comprehensive and mutually exclusive categories. Once we have the taxonomy, we need to label data against this taxonomy, and finally select a modeling approach with a fast inference time (~100 ms) to ensure that interactions with customers flow as seamlessly as chatting with a live agent. From the engineering perspective, we need to consider how to design each component to be plug-and-play for easy testing and development. We also need to consider how our infrastructure and databases will scale as Wayfair’s business continues to grow, and the monitoring necessary to detect potential errors proactively.

Predicting Customer Intent

Data Collection

The first step in this process was to define a taxonomy that covers the many different intentions behind why customers contact customer service. Previous taxonomies that had been used for contact tagging were quite broad and would often combine the intent of the customer with the resolution provided by the customer service representative. To begin this process we analyzed approximately 10,000 chats between customers and our representatives to begin identifying the various intents customers have when contacting customer service as well as the nuance between each request. From this analysis, we derived a hierarchical taxonomy with 11 broad categories and 70 specific subcategories, spanning questions related to item delivery, returns, replacement items, and changes to order information. A hierarchical taxonomy enables flexibility for modeling approaches, taxonomy updates, and different use-cases depending on the granularity of the intent required. A subset of our taxonomy is shown in Figure 1.

An image of a multi-level pie chart where the inner circle has the highest level of the intent prediction taxonomy, and the outer with the lowest level of granularity. For the parent intent "damage/defect" which makes up about 17% intents, the top three child intents are "damage", "defect" and "replacement questions" — Fig 1: Intent Taxonomy where the size of the wedge indicates the frequency of the intent.

After defining the intent taxonomy we needed to collect a dataset to be sent for manual labeling. Given the frequency distribution of the intents in the taxonomy, we estimated we would need to label ~200,000 chats to have at least 1,000 examples in the lower frequency intents. Once we collected the dataset, we began training a team to label and QA each of the customer chat samples, or utterances, where each utterance was reviewed by three labellers. Approximately 85% of the labeled utterances had agreement between at least two labelers.

Model Training

With labeled data in hand, we moved onto training our model. We considered several different language models and modeling frameworks when approaching this problem. Given the complexity of our taxonomy (70 specific categories), the high accuracy needed to ensure a positive customer experience (> 90% accuracy across all intents), and the success another Data Science team at Wayfair had on a similar classification problem, we decided to start by fine tuning base BERT on our intent classification task. Because the original BERT model already was trained on the BookCorpus (800M Words) and English Wikipedia (2.5B Words), we required much fewer training examples for our classification task. A schematic of the BERT architecture can be found in Figure 2.

We approached this task as a multi-label and multi-class classification problem, because each of our utterances has two labels associated with them (one broad and one specific category). This approach also allowed us to use all of the labels we received from the manual labelers, weighted by the level of agreement. In situations where all the labelers disagree, the labels receive less weight but the model can still learn from these labeled examples. This also effectively boosted our training data by 3x from ~200,000 to ~600,000.

A Multi-headed attention architecture for the tokenized sentence "where is my stuff?" . Each token is assigned position embeddings, segment embeddings, and token embeddings. Those embeddings are passed to the multi-headed attention layers, then feed forward and sigmoid layers to output probabilities. — **Fig 2:** Schematic of the BERT Architecture

We began by using the uncased base-bert model provided by google with a Tensorflow backend. We added a single fully connected layer to BERT to convert the hidden layer size of 768 to 81 (the number of broad and specific intent classes). We used a binary cross entropy with logit loss function, the Adam optimizer with weight decay similar to the original BERT training paper, and a sigmoid activation function to convert the output logits of the model to probabilities. We set thresholds within each class to optimize the F1 score. Initial performance using base BERT showed promising results with an 80% weighted F1 score across all classes.

While the performance of the base BERT model was promising, the 350ms inference time was potentially too slow to ensure a seamless customer experience. We decided to transition to distilBERT, which is expected to retain 97% of the language understanding of full BERT and be 60% faster. We used the same modeling approach described above, we leveraged the pretrained model available in the transformers package from the HuggingFace company on a pytorch backend. At the time of testing we saw a ~2% drop in weighted F1 score and no change in the unweighted F1 or accuracy. Along with comparable performance between the two language models, we saw a large decrease in the inference time down to 88 ms, which fell within the SLA’s provided by our product and engineering partners to ensure a seamless customer experience.

Evaluating Model Performance on Chat Data

To evaluate the performance of the intent prediction model, we used a holdout test set of chat data and performed manual evaluation of the model outputs. Part of the manual evaluation included confirmation that the model was able to understand the nuance between different subcategories. The table below shows some sample predictions the model makes across each unique “Returns” subcategory.

Sample Customer Utterances	Intent Model Prediction
“Hi, I am trying to return this duvet set due to quality issues”	Return Setup
“I bought chairs from your company and they look very nice in the box, but I am not moving until mid June so i don't want to take them out of the box and put them together until then. How long do I have to return them in the event I don't like them or if they are damaged”	Return Policy
“Hi I'm just checking in the return status of an item that I shipped around May 8”	Return Confirmation
“I would like to return this item but I do not have the box anymore. How can I return it. here is my order # Wayfair Order Number”	Return Process
“I was told I would get my return label emailed and I have not received it”	Return Label

Because we do not have a “true ground truth” for the intent prediction task, we also evaluated the model performance on three different test sets with varying levels of difficulty. We evaluated the model performance on only utterances where all three labelers agreed, we allowed the model to match any agent label, and we only let the model match one agent label, either by majority vote or randomly selecting a label if no majority existed. The table below shows the model performance across each group, with the model performing best on the gold labeled test set, and worst when randomly selecting among the bronze labels. Although we are the most confident in the gold labeled data, the performance is inflated because these utterances are likely simpler, and the customer is clearer in describing their intent, which makes it more likely that three agents would agree. In reality we expect the model performance in production to fall somewhere in the range shown in the table below.

Test Set	micro-F1
Utterances where all three agents agree	0.946
Predictions match any agent label	0.895
Match only 1 agent label via majority vote or random selection	0.808

Evaluating Model Performance in Production

After achieving sufficient accuracy on our hold out test set, we launched an A|B test of the Virtual Assistant (VA). This test would allow us to test customer engagement with the VA, our ability to predict customer intent correctly, and ultimately our ability to automate contacts for our customers. During the A|B test, we noticed a slight decrease in prediction accuracy of ~2% compared to the holdout test set. We attribute the decrease in performance to two factors: a shift in the intent distribution from chat utterances, and a change in the utterance length.

Bar chart showing blue bars for Navigation Assistant frequency and orange bars for Chat utterance frequency for each intent. There are large discrepancies between "delivery" intent frequency (higher for Nav Assistant), "damage/defect", "sales", and "pricing" (higher for chat) — **Fig 3:** Change in intent frequency between chat utterances in orange and VA utterances in blue (labeled here as NavAssist). Note some of the intent categories have been aggregated to more general categories for ease of visualization.

Figure 3 shows in orange the distribution of intents used to train the intent prediction model and the distribution of intents predicted by the model during the Virtual Assistant A|B test in blue.

You can see that the damage/defect was the most common intent in the chat training data, whereas delivery-related intents (primarily delivery) were the most common for the Virtual Assistant. Additionally, in backtesting, damage, the most common secondary intent, had a 0.96 F1 score, compared to 0.87 for delivery leading to a decrease in overall model performance. This shift in intent distribution highlights the change in intent types our customers have when interacting with a bot versus a live agent. These changes could also be attributed to the specific page types from which the Virtual Assistant is accessible.

Another factor that likely contributed to the decrease in performance from the hold out test set to production was the change in average utterance length between the chat training data and the Virtual Assistant A|B test. Figure 3 shows in orange the utterance word count distribution for the chat training data in orange, and the navigation assistant utterances in blue. The data was clipped at a length of 31 words which is the 75th percentile of the Virtual Assistant utterances.

Bar graph showing the frequency of word counts between Navigation Assistant and Chat. The distribution of Nav Assistant word counts skews lower than that of chat — **Fig 4:** Change in word count distribution between chat utterances in orange and VA utterances in blue (labeled here as NavAssist).

This chart highlights the difference in communication styles our customers have when interacting with a bot compared to a live agent. Approximately half of the VA utterances are 10 words or less compared to only 25% for chat data. An additional challenge posed by this shift in utterance length are the intents associated with shorter utterances. For the chat training data, a majority of utterances 10 words or less were labeled as “not an intent”, compared to “delivery status” for virtual assistant assistant utterances.

Launching the Virtual Assistant

To launch the Virtual Assistant, we also needed to build the backend to handle the conversation state, sending requests to the intent prediction model, and resolution engine. Initially we tried off-the-shelf solutions from third party vendors, but realized they lead to slow page loads and were not customizable to our unique use case and data. As a response to this we created the HaloAI platform, a system to execute end-to-end ML lifecycles and software. HaloAI is built on micro-service architecture for scalability and separation of concerns so the artifacts are released independently. The HaloAI platform allowed us to meet faster go to market timelines, learn from past data and experiences, iterate faster, and establish a working model between software & ML engineers, DataScience, Product, Analytics and QA.

What Comes Next?

In the video below you can see an example of the completed Virtual Assistant where a customer is asking for assembly instructions for a recently purchased item.

Sample chat between the Virtual Assistant (VA) and a customer:
VA: Hi Bradley! Ask me a question and I'll help you find the answer. If I can't, I'll find someone who can.
Customer: Hello, I can't find the assembly instructions for the sink I recently ordered
VA: Gotcha, can you please select the item you're reaching out about
*VA displays a product card that the customer then clicks on*
VA: Thanks, it looks like there's multiple assembly guides associated with that item Guide 1 link, Guide 2 link, Guide 3 link, Guide 4 link
VA: Was this the information you were looking for
Customer: Yes
VA: Glad to hear!
*VA prompts a feedback rating, and customer gives five stars* — **Fig 5:** Animation showing a sample workflow for a customer asking for missing assembly instructions, confirming the item to which they are referring, receiving the assembly guides and reviewing the overall experience

Now that the Virtual Assistant is live on multiple pages on our site, our next plans are to continue improving upon the prediction accuracy and start incorporating additional functionality. We hope to incorporate multilingual capabilities to the Virtual Assistant to support our non-English speaking customers in both the US and in other countries. We’re also looking to incorporate entity recognition to automate the item selection process and support a more seamless customer experience.

Stay tuned for more updates as this product progresses!