Suo Lu

Thinking in Data

| | email

Architecture of Search Platform

This blog is about search platform design note for my latest years experiences, no involved detail of search algorithm or indexing.

Record properties

One entity per index.

Generally speaking, record properties will only contains global common fields. But put diff types together for search, could result in a big table with dozens of columns. So for original data prepared index, may consider to use column-oriented NoSQL.

As a eCommerce example, sku properties would include:

Note: properties must not include: price (but range could be, relate within record dimensions), stock status.
In another word, do not include properties change frequently.

Some search project like Solr support generate indexes via rdb mapping, this is very useful for startup of product. But it become less-use when product grew up to a large-scale system.
Build a new index loader for huge and complex data is required. The loader should well-design for data split, schedulable, push and/or poll.

Facet (dimension)

System would have thousands of categories. Single record may point to more than one, which should map to any category if related.

Like a laptop, have general product category (Product -> Technology -> Computers), and some particular detail (I call "Feature") such as memory, cpu.

Both category and feature are logically the same, which called Dimension. The relationship of dimensions is actually a tree, usually save in a single table of database, and could be represent as a hash of arrays in program's memory.
Leaf-node dimensions are saving in record properties, to serve facet query.

Search and Browse

Web-styled application usually have two entrance: Search and Browse.

Search Mode: which is a html text input.
System parse user term, then query into record properties. Should have place to predefine/modify behavior logic and rank (Server Side Algorithm).

Browse Mode: a friendly navigation guide for user.
List dimension id and links (of course hit records as well) layer by layer, until leaf-node. Fetch records with all leaf id calculated by current layer.

Filter (refinement)

Filter search result by dimensions or properties.

Order and Sort

Sort search result on specific property asc or desc.

Business Rule

Belong to records, search result will be better map with some structure-free data (like K/V store) for page rendered. Including but not limited to follows.

Rules can be enable/disable and schedule, are bound with URI (nav location or search term), save in database and hot build to index.

Endeca have a powerful experience tools called page builder (which is rename to Experience Manager now), have ability to build dynamic page content.
Although not open source I think it is implemented by kind of rule.

Predictive Search

When input at search box, base on current term system would tip user some search suggestions on keyword matches, search history, promotions etc.
Search is trigger by input timeout (e.g. 1s), then throw the term to chains of term validations: length limitation, blacklist filter, script injection cleaner.

Dictionary is collect from 3 area: record refining, users input history, and manually words. Priority is from low to high.

Dict would be initialize with record.
Firstly picked up 2-3 fields and get all values. Then generate two set, one is for whole word (original data), one is for minimum term (split string with space). At last plus two affect addition: white/black words and stop words.

If your application have user behavior system, collect users search term input history.

Term matches algorithm: whole sentence matching + max firstly matching + partly matching. Priority is from high to low.
Dict item more frequently is better, less length is better.
Limit the match result count, but note that should include at least one per each.

Caching is the must.

If PV is too large, query every time is challenging.
Forecasting user input via term's prefix and returning estimated result will have remarkable performance improvement.

Others

There are some other search features and management tools I want to mention.

Keyword 301/302
Internal replace input term (called "Thesaurus") and external redirect (called "Hijack") are for aggregate traffic improve search experience.
And here is a trick way. Like you mainly sell mobile without iPhone, but you have kinds of smart cover. There you go...
Please note don't get these mixed up with spell check or auto suggestion, they are totally different.

User Segment
Many of features could design support user profile which extend from default group. For example you want East/West country visitors have different experience, or don't want to web-crawler search everything from your site.

Administrator
Feature management authorization, resource locking, preview, report and statistic.

20 Aug 2014