Suo Lu

Thinking in Data

| | email

Document Grouping Problem

Suppose you own a clothing enterprise and sell products online, think about the following case:

There are a large amount of T-shirt with many occasions, styles, colors etc. You recording each of them onto the page. Through front-end search system, user can browsing/search any T-shrit they like and punch.

Sounds good, right? Unfortunately, search experience may pretty bad. Because of each kind of clothes is likely to have many colors and sizes, you search result is filled with "same" product.

This is I called "Document Grouping Problem".
How to solve it? Quick answer: sku data fields redundancy with aggregation search.

Data Structure

First of all, regards different color/size as different product is reasonable. They have separate price and stock status, thay are SKUs.

In the Search Index saving skus Cartesian Product:

SKU# NAME KEYWORD (color, size) FAMILY_ID
10001 Sweetly T-shirt 1 Blue, M 1
10002 Sweetly T-shirt 2 Blue, XL 1
10003 Sweetly T-shirt 3 Red, M 1
10004 Sweetly T-shirt 4 Red, XL 1
10005 Easy-chic T-shirt 1 Blue, M 2
10006 Easy-chic T-shirt 2 Blue, XL 2
10007 Easy-chic T-shirt 3 Red, M 2
10008 Easy-chic T-shirt 4 Red, XL 2

For the index building, prepare the family data in the database, separate a table for family info.

SKU# FAMILY_ID ATTR_NAME ATTR_VALUE ATTR_TYPE
10001 1 Color Blue Swatch
10001 1 Size M Dropdown
... ... ... ... ...
10008 2 Color Red Swatch
10008 2 Size XL Dropdown

(Above DB table is just a example, it depends on your product management system and may need optimize.)

Querying

Now let's use aggregation at search.
Aggregation (called Faceting in some search engine) is similar to "Group By" in relational database. Throwing search with query term and point out the fields to aggregate, you can get all matched result with gather family results.

WARNING
Please NOTE that above solution will only working when you don't need exact record pagination info (For example you can fetch 20 records at every scroll down, but you can't go to specific page). Otherwise you have to fetch all records at the same time and "group by" manually.

A alternative solution is create a secondary index for family product. Firstly query as normal, if you want to show all family products at the same zone, then use returned family id query all others.

I'm not go through the detail of aggregation search API here, please refer to SimpleFacetParameters for Solr and Search Aggregations for Elasticsearch.
Endeca Nu native support returning records aggregation.

In Conclusion

Search Experience is very very important, as a search engineer you need to keep looking into it and try to make some improvement at every single day.
In addition to all above, like AJAX, Cache, CDN, Data Mining, there are still so many things worth doing.

24 Nov 2014