Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Introduction to
Full Text Search and Frameworks

@atealxt

Agenda


Full Text Search


Serial Scanning

Full Text Search


Serial Scanning


Full Text Search

In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (text specified by a user).

—— wikipedia

Full Text Search


—— Lucene in Action

Full Text Search


—— Elasticsearch from the Bottom Up, Part 1

Frameworks


Demo First


Endeca

Solr

Elasticsearch

Structure


Structure


Structure


Structure


Indexing & Search


Field options

Indexing & Search


Indexing & Search


Tokenizer

Indexing & Search


Filter

Indexing & Search


Solr schema.xml

Solr analysis tool

Elasticsearch analysis tool

Indexing & Search


Query Syntax

Indexing & Search


Query Parameter

Indexing & Search


Scoring

Term Frequency (tf)
How many times the term t occurs in the document.

Document Frequency (df)
A measure of how “unique” the term is. Very common terms have a high df; very rare terms have a low df.

Term weight

Indexing & Search


Examples

Solr
Get docs name contains 'Test' and features contains '功能'

Elasticsearch
Get 20 play name, text and speech# which text contains 'Tempest', and order by play name

Features


Features


Data Import

Features


Data Import

Solr:
JDBC, CSV, XML, Tika, URL, Flat File

Features


Data Import

Solr:
JDBC, CSV, XML, Tika, URL, Flat File

Elasticsearch:
CouchDB, Dropbox, FileSystem, Git, JDBC, JMS, Kafka, LDAP, MongoDB, Redis, RSS, Solr, Subversion, Twitter, Wikipedia

Features


Aggregation (Faceting)

Endeca
Get black pens

Solr

Elasticsearch
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations.html

Features


Navigation

Database Mapping

Hierarchical Faceting

Features


Search Suggester (Predictive Search, Auto complete)

Features


Search Suggester (Predictive Search, Auto complete)

Trigger by input timeout

Features


Search Suggester (Predictive Search, Auto complete)

Trigger by input timeout

Chains of validations: length, blacklist, script injection

Features


Search Suggester (Predictive Search, Auto complete)

Trigger by input timeout

Chains of validations: length, blacklist, script injection

Dict from: record refining, search history, manual. (P l-h)

Features


Search Suggester (Predictive Search, Auto complete)

Trigger by input timeout

Chains of validations: length, blacklist, script injection

Dict from: record refining, search history, manual. (P l-h)

Matches:
Whole + forward max (prefix) + partly (suffix, infix). (P h-l)
More frequently is better, less length is better.
Limit result count, at least one per each.

Features


Search Suggester (Predictive Search, Auto complete)

Solr

Endeca

Features


Stemming

Features


Keyword 301/302

Evaluating


Fast, Exact, Full, New

Steps

  1. Prepare Data
  2. Pick-up User Cases
  3. Testing
  4. Statistics
  5. Training

Evaluating


Methods

Evaluating


Recall-Precision

A: Relevant Document Collection
B: Search Result



Evaluating


Discounted Cumulative Gain (DCG)

rel: Very Good:2 / Good:1 / Fair:0 / Bad:-1 / Very Bad: -2
reli: Score of #i
p: Number of Docs at result page
IDCG: Ideal DCG

Reference


It is just the start


Spider, Distribute, Shard, NLP, HMM, PageRank, HITS, Bloom~Filter, Trie, SEO, Lexicon, Dynamic~Fields, Server~Side~Algorithm, Performance, FST, Bayesian

Q&A


Thank You!

Use a spacebar or arrow keys to navigate