Why Build a Virtual Assistant?
At Wayfair, we strive to continually provide our customers best-in-class customer service, and know that it is a key differentiator between us and our competitors. Part of our approach to achieving this involves meeting customers where they are, consistently delivering a delightful experience, and continuously improving our processes. To deliver on these goals, data science partnered with product and engineering teams to build Wayfair’s first Virtual Assistant (VA). The goal is to provide our customers 24/7 support that incorporates the accessibility of a conversational interface and the efficiency of self-service. The Virtual Assistant can fully automate some contacts and can also reduce the time to resolution for customers when they are speaking to a service consultant, by identifying the customer’s intent, and the specific product or order in question.
Building the Virtual Assistant poses several interesting data science, product, and engineering challenges. From the product perspective, there are customer-facing challenges, how do we resolve the customer’s issue in an automated and user-friendly way, and what is the most logical conversion flow for each potential intent? From the data science perspective, we need a customer intent taxonomy with comprehensive and mutually exclusive categories. Once we have the taxonomy, we need to label data against this taxonomy, and finally select a modeling approach with a fast inference time (~100 ms) to ensure that interactions with customers flow as seamlessly as chatting with a live agent. From the engineering perspective, we need to consider how to design each component to be plug-and-play for easy testing and development. We also need to consider how our infrastructure and databases will scale as Wayfair’s business continues to grow, and the monitoring necessary to detect potential errors proactively.
Predicting Customer Intent
Data Collection
The first step in this process was to define a taxonomy that covers the many different intentions behind why customers contact customer service. Previous taxonomies that had been used for contact tagging were quite broad and would often combine the intent of the customer with the resolution provided by the customer service representative. To begin this process we analyzed approximately 10,000 chats between customers and our representatives to begin identifying the various intents customers have when contacting customer service as well as the nuance between each request. From this analysis, we derived a hierarchical taxonomy with 11 broad categories and 70 specific subcategories, spanning questions related to item delivery, returns, replacement items, and changes to order information. A hierarchical taxonomy enables flexibility for modeling approaches, taxonomy updates, and different use-cases depending on the granularity of the intent required. A subset of our taxonomy is shown in Figure 1.
After defining the intent taxonomy we needed to collect a dataset to be sent for manual labeling. Given the frequency distribution of the intents in the taxonomy, we estimated we would need to label ~200,000 chats to have at least 1,000 examples in the lower frequency intents. Once we collected the dataset, we began training a team to label and QA each of the customer chat samples, or utterances, where each utterance was reviewed by three labellers. Approximately 85% of the labeled utterances had agreement between at least two labelers.
Model Training
With labeled data in hand, we moved onto training our model. We considered several different language models and modeling frameworks when approaching this problem. Given the complexity of our taxonomy (70 specific categories), the high accuracy needed to ensure a positive customer experience (> 90% accuracy across all intents), and the success another Data Science team at Wayfair had on a similar classification problem, we decided to start by fine tuning base BERT on our intent classification task. Because the original BERT model already was trained on the BookCorpus (800M Words) and English Wikipedia (2.5B Words), we required much fewer training examples for our classification task. A schematic of the BERT architecture can be found in Figure 2.
We approached this task as a multi-label and multi-class classification problem, because each of our utterances has two labels associated with them (one broad and one specific category). This approach also allowed us to use all of the labels we received from the manual labelers, weighted by the level of agreement. In situations where all the labelers disagree, the labels receive less weight but the model can still learn from these labeled examples. This also effectively boosted our training data by 3x from ~200,000 to ~600,000.
We began by using the uncased base-bert model provided by google with a Tensorflow backend. We added a single fully connected layer to BERT to convert the hidden layer size of 768 to 81 (the number of broad and specific intent classes). We used a binary cross entropy with logit loss function, the Adam optimizer with weight decay similar to the original BERT training paper, and a sigmoid activation function to convert the output logits of the model to probabilities. We set thresholds within each class to optimize the F1 score. Initial performance using base BERT showed promising results with an 80% weighted F1 score across all classes.
While the performance of the base BERT model was promising, the 350ms inference time was potentially too slow to ensure a seamless customer experience. We decided to transition to distilBERT, which is expected to retain 97% of the language understanding of full BERT and be 60% faster. We used the same modeling approach described above, we leveraged the pretrained model available in the transformers package from the HuggingFace company on a pytorch backend. At the time of testing we saw a ~2% drop in weighted F1 score and no change in the unweighted F1 or accuracy. Along with comparable performance between the two language models, we saw a large decrease in the inference time down to 88 ms, which fell within the SLA’s provided by our product and engineering partners to ensure a seamless customer experience.
Evaluating Model Performance on Chat Data
To evaluate the performance of the intent prediction model, we used a holdout test set of chat data and performed manual evaluation of the model outputs. Part of the manual evaluation included confirmation that the model was able to understand the nuance between different subcategories. The table below shows some sample predictions the model makes across each unique “Returns” subcategory.
Sample Customer Utterances | Intent Model Prediction |
“Hi, I am trying to return this duvet set due to quality issues” | Return Setup |
“I bought chairs from your company and they look very nice in the box, but I am not moving until mid June so i don't want to take them out of the box and put them together until then. How long do I have to return them in the event I don't like them or if they are damaged” | Return Policy |
“Hi I'm just checking in the return status of an item that I shipped around May 8” | Return Confirmation |
“I would like to return this item but I do not have the box anymore. How can I return it. here is my order # Wayfair Order Number” | Return Process |
“I was told I would get my return label emailed and I have not received it” | Return Label |
Because we do not have a “true ground truth” for the intent prediction task, we also evaluated the model performance on three different test sets with varying levels of difficulty. We evaluated the model performance on only utterances where all three labelers agreed, we allowed the model to match any agent label, and we only let the model match one agent label, either by majority vote or randomly selecting a label if no majority existed. The table below shows the model performance across each group, with the model performing best on the gold labeled test set, and worst when randomly selecting among the bronze labels. Although we are the most confident in the gold labeled data, the performance is inflated because these utterances are likely simpler, and the customer is clearer in describing their intent, which makes it more likely that three agents would agree. In reality we expect the model performance in production to fall somewhere in the range shown in the table below.
Test Set | micro-F1 |
Utterances where all three agents agree | 0.946 |
Predictions match any agent label | 0.895 |
Match only 1 agent label via majority vote or random selection | 0.808 |
Evaluating Model Performance in Production
After achieving sufficient accuracy on our hold out test set, we launched an A|B test of the Virtual Assistant (VA). This test would allow us to test customer engagement with the VA, our ability to predict customer intent correctly, and ultimately our ability to automate contacts for our customers. During the A|B test, we noticed a slight decrease in prediction accuracy of ~2% compared to the holdout test set. We attribute the decrease in performance to two factors: a shift in the intent distribution from chat utterances, and a change in the utterance length.
Figure 3 shows in orange the distribution of intents used to train the intent prediction model and the distribution of intents predicted by the model during the Virtual Assistant A|B test in blue.
You can see that the damage/defect was the most common intent in the chat training data, whereas delivery-related intents (primarily delivery) were the most common for the Virtual Assistant. Additionally, in backtesting, damage, the most common secondary intent, had a 0.96 F1 score, compared to 0.87 for delivery leading to a decrease in overall model performance. This shift in intent distribution highlights the change in intent types our customers have when interacting with a bot versus a live agent. These changes could also be attributed to the specific page types from which the Virtual Assistant is accessible.
Another factor that likely contributed to the decrease in performance from the hold out test set to production was the change in average utterance length between the chat training data and the Virtual Assistant A|B test. Figure 3 shows in orange the utterance word count distribution for the chat training data in orange, and the navigation assistant utterances in blue. The data was clipped at a length of 31 words which is the 75th percentile of the Virtual Assistant utterances.
This chart highlights the difference in communication styles our customers have when interacting with a bot compared to a live agent. Approximately half of the VA utterances are 10 words or less compared to only 25% for chat data. An additional challenge posed by this shift in utterance length are the intents associated with shorter utterances. For the chat training data, a majority of utterances 10 words or less were labeled as “not an intent”, compared to “delivery status” for virtual assistant assistant utterances.
Launching the Virtual Assistant
To launch the Virtual Assistant, we also needed to build the backend to handle the conversation state, sending requests to the intent prediction model, and resolution engine. Initially we tried off-the-shelf solutions from third party vendors, but realized they lead to slow page loads and were not customizable to our unique use case and data. As a response to this we created the HaloAI platform, a system to execute end-to-end ML lifecycles and software. HaloAI is built on micro-service architecture for scalability and separation of concerns so the artifacts are released independently. The HaloAI platform allowed us to meet faster go to market timelines, learn from past data and experiences, iterate faster, and establish a working model between software & ML engineers, DataScience, Product, Analytics and QA.
What Comes Next?
In the video below you can see an example of the completed Virtual Assistant where a customer is asking for assembly instructions for a recently purchased item.
Now that the Virtual Assistant is live on multiple pages on our site, our next plans are to continue improving upon the prediction accuracy and start incorporating additional functionality. We hope to incorporate multilingual capabilities to the Virtual Assistant to support our non-English speaking customers in both the US and in other countries. We’re also looking to incorporate entity recognition to automate the item selection process and support a more seamless customer experience.
Stay tuned for more updates as this product progresses!