The Wayfair Query Classifier: How to find the types of products customers are looking for

When customers submit search queries, such as “ivory jute oriental,” it is essential for the search engine to understand which types of products they are looking for. In this case, the customer is likely shopping for an area rug. Equipped with this knowledge, we can tailor the search results page by displaying area rug-specific filters and banners and refine the search results by first showing area rugs. At Wayfair Search, the Query Classifier Model performs this task of real-time prediction of the desired product types based on user search queries. This blog post discusses how we designed, trained, and deployed this deep learning classification model at SearchTech.

Introduction

Wayfair sells more than 33 million products from over 23 thousand suppliers. We categorize these products into over a thousand “product classes.” Some examples of product classes include coffee tables, bathroom storage, beds, or table lamps. The Query Classifier predicts the classes of products with which customers would most likely engage, given their search queries.

A user searches for the term "comfy couch", the query classifier makes a class prediction: "sofas" which informs the search result to show sofas — Figure 1

The figure above illustrates how Query Classifier plays its role in search. When a customer submits a search query, the Query Classifier predicts relevant product classes. The system then uses this information to display related product filters and to boost search results of the appropriate product classes.

The Challenge

Deciphering what types of products are relevant to a search query is a difficult task. Wayfair serves millions of unique search queries every day. While a small percentage of the queries explicitly mention product type and are thus easy to classify, most of the queries do not, and we need to rely on the users’ activities to find out the intended product class. Some queries are ambiguous. For example, the query “desk bed” could refer to bunk beds with a desk underneath, or it could mean tray desks used in bed. Some queries require the system to understand both Wayfair and our competitors’ catalogs. The query “Azema dining table” may look like a dining table class query, but in fact, the customer was looking for patio dining tables. Some queries can be broad and refer to multiple classes, while others mean different things for different customers.

This isn’t the first time Wayfair Search has attempted to build a query classifier. Our previous version of the classifier, a set of logistic regression models based on n-grams, does not perform well on some of the complexities above. It could not capture the nuances and implications of the search queries beyond surface expressions. These complexities prompt us to develop a deep-learning method to solve the problem.

Our Approach

At Wayfair SearchTech, we developed the Search Interpretation Service to perform live search query understanding tasks. When a customer submits a search, the Interpretation service processes the query in several microservices to provide spelling correction, predictions of the query’s relevant classes, and so on. The outputs of the Interpretation service are then consumed by downstream Search applications to retrieve relevant products to present to the customers. The Query Classifier model powers one of these Interpretation microservices. In this article, we will focus on the offline model development part of the Query Classifier project.

Data Collection

The Query Classifier learns from historical customer behavior data. We assume that the customers generally add relevant products to their shopping cart during the search. Thus, we collect historical user search queries and the classes of the added-to-cart (ATC) products as the model development dataset. For this, we define a search experience as the sequence of customer activities after submitting a search query and before moving on to a different activity. For each search experience, we collected the search query and the classes of the ATC products. Each data point in our dataset corresponds to one unique search experience from one unique customer. Some example data points are shown below:

Query	Previously ATC product classes
plancha para tacos	Waffle, Pizzelle & Crepe MakersElectric Grills & Skillets
eight drawer dresser gray	Dressers & ChestsKids Dressers & Chests

We collected ten months of historical behavioral data for the training set and two weeks for dev and test sets. We further refined the test set queries to only include the ones where we have historical model/user data with the aforementioned logistic regression baseline model. Here are some statistics of our datasets:

Data Statistics	# Queries	Average # Word in Query
Train	15541300	3.56
Dev	679240	3.63
Test	254393	3.07

Model Details

The model needs to perform live query classification tasks within a few milliseconds during 500+ requests-per-second peak traffic. We adopted a Convolutional Neural Network (CNN) model to solve the multi-label classification problem within the inference time requirement. The model architecture is shown below.

A model architecture diagram shows search terms going into a pretrained word embedding, progressing into a convolutional 2D, onto global max pooling, dense layer, and finally a pooled output — Figure 2

The model input is the raw text of the customer search query. We developed a custom tokenizer that specializes in search query tokenization. No custom normalization is performed other than digits normalization. After tokenization, we pad the search query to a uniform length. We use pre-trained word embedding to initialize the embedding layer weights. We then apply a convolution layer and global max pooling. The output layer corresponds to the 931 classes from the Wayfair product taxonomy.

Evaluation

We compared the CNN-based Query Classifier against the existing logistic regression n-gram models (baseline). The baseline models are a collection of binary classifiers whose total parameter count exceeds that of the CNN model. The model performance on the test set is shown in the table below.

Model	Precision	Recall	F1
Baseline	0.93	0.56	0.7
Query Classifier	0.88	0.79	0.83

The CNN model showed an 18.6% relative gain in F1 compared to baseline. The F1 increases were attributed to a drastically improved recall despite a slight decline in precision. We further validated the model performance on data spanning over six months and observed consistent performance. The CNN model outperformed baseline because 1) despite having similar numbers of parameters, the CNN model shared these parameters across all classes while the baseline models didn’t; 2) the CNN model showcased better semantic understanding thanks to pre-trained embeddings and much more training data.

Furthermore, the CNN model is also better confidence-calibrated than the baseline model. For a classifier model to be useful, it not only needs to make accurate predictions but also needs to reliably indicate how likely it is to be correct. Guo et al. defined model confidence to be the probability of correctness [1]. For a model to be confidence-calibrated, its prediction score should approximate the probability of the prediction being true. For example, if a model makes 100 class predictions and each prediction scores 0.5, then we would expect 50% of these examples to be correct. Following Guo et al.’s method, we plot the reliability diagrams of the baseline and CNN model on the Office Chair class below. The yellow lines in the diagrams plot the model output prediction (x-axis) against the observed sample truth ratio (y-axis). If a model is well-calibrated, we would expect the model output prediction to align with the observed sample truth ratio, i.e. the yellow line should fall between the region between the diagonal red and blue lines. If the yellow line is above the red/blue boundary, it indicates that the model is underconfidence. Conversely, a yellow line below the red/blue boundary indicates an overconfidence model. In the baseline model diagram, when the model predicts that some queries’ likelihood of being Office Chair is between 0.3 - 0.4, in fact, 80% of those queries actually refer to the Office Chair class. The yellow line’s deviation from the diagonal lines indicates poor calibration. The diagrams show that the baseline model is highly under confidence while the CNN model is better calibrated.

a graph for Office Chairs Baseline Model Confidence shows three lines, a blue lower bound, red upper bound, and yellow truth ratio. The upper and lower bounds show as parallel diagonal lines, with the truth ratio curving above the upper bound line — Figure 3a

a graph for Office Chairs CNN Confidence shows three lines, a red upper bound, blue lower bound, and yellow truth ratio. the upper and lower bounds are parallel diagonal lines, with the truth ratio generally curving between the two — Figure 3b

Here are some specific examples demonstrating the improved performance. The results are shown in the format of the champion class name and corresponding confidence score.

Search queries	Old model output	New model output
computer desk	N/A	Desks (0.96)
leather sofa	N/A	Sofas (0.93)
desk chair	N/A	Office Chairs (0.91)
cotton duvet cover	Bedding Sets (0.64)	Bedding Sets (0.99)
small desk lamp	Table Lamps (0.73)	Table Lamps (0.99)
writing desk	Desks (0.54)	Desks (0.94)

Online Testing

We launched three online A/B tests to validate the offline model performance observation. The three tests are: 1) boosting the rankings of class-appropriate products in search results; 2) displaying class-specific filters; 3) redirecting queries with high class confidence to prebuilt class pages. We designed this phased testing approach to measure the Query Classifier’s impact on each downstream application. The results show that the improved Query Classifier had the most significant impact on the product boosting feature and the second-largest impact on the class-specific filter feature. We have since launched the Query Classifier model on our Wayfair US website to provide customers with a more relevant search experience.

Some example screenshots of the control search results compared against the Query-Classifier-powered search results.

A screenshot of the Wayfair Search Page displaying a search for "hunter green table". Search results show rugs, chair cushions, and tables — Figure 4. Search results for query *hunter green table* in control (baseline model)

A Wayfair search page for the query "hunter green table" shows a page of green tables — Figure 5. Search results for query *hunter green table* with new Query Classifier (CNN)

A Wayfair search page with the search term "mid century bed frame" shows images of mid century nightstands, storage units, benches, mirrors, and a dog bed — Figure 6. Search results for query *mid century bed frame* in control (baseline model)

A wayfair search page with the search term "mid century bed frame" shows images of mid century bed frames — Figure 7. Search results for query *mid century bed frame* with new Query Classifier (CNN)

What's Next

The production Query Classifier only considers the search queries and leaves any customer information (e.g. customer preferences, geo location, previous shopping behavior, etc) on the table when making predictions. The same search queries may have different meanings for different customers. For example, a customer who searches for “dining table for 4” while decorating her dining room is likely looking for indoor dining tables. A customer who uses the same query after purchasing some outdoor seat cushions is more likely to be looking for patio dining tables. We are working on incorporating customer search contexts into the Query Classifier to improve our customer’s search experience further.

Acknowledgment

The success of this CNN-based Query Classifier model is a joint effort of Data Science, Engineering, Product Management, and Analytics teams from building host service for productization to read-outs on final test results. We are looking forward to working together and conquering many other challenges in the future.

References

[1] Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. "On calibration of modern neural networks." In International Conference on Machine Learning, pp. 1321-1330. PMLR, 2017.