Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.
For the best experience please use the latest Chrome, Safari or Firefox browser.
Introduction to
Full Text Search and Frameworks
Serial Scanning
Serial Scanning
Full Text Search
In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (text specified by a user).
—— wikipedia
—— Lucene in Action
Endeca
Solr
Elasticsearch
Field options
Tokenizer
Filter
Query Syntax
Query Parameter
Scoring
Term Frequency (tf)
How many times the term t occurs in the document.
Document Frequency (df)
A measure of how “unique” the term is. Very common terms have a high df; very rare terms have a low df.
Term weight
Solr
Get docs name contains 'Test' and features contains '功能'
Elasticsearch
Get 20 play name, text and speech# which text contains 'Tempest', and order by play name
Solr:
JDBC, CSV, XML, Tika, URL, Flat File
Solr:
JDBC, CSV, XML, Tika, URL, Flat File
Elasticsearch:
CouchDB, Dropbox, FileSystem, Git, JDBC, JMS, Kafka, LDAP, MongoDB, Redis, RSS, Solr, Subversion, Twitter, Wikipedia
Endeca
Get black pens
Solr
Elasticsearch
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations.html
Trigger by input timeout
Trigger by input timeout
Chains of validations: length, blacklist, script injection
Trigger by input timeout
Chains of validations: length, blacklist, script injection
Dict from: record refining
, search history
, manual. (P l-h)
Trigger by input timeout
Chains of validations: length, blacklist, script injection
Dict from: record refining
, search history
, manual. (P l-h)
Matches:
Whole + forward max (prefix) + partly (suffix, infix). (P h-l)
More frequently is better, less length is better.
Limit result count, at least one per each.
Solr
Endeca
Steps
Methods
Recall-Precision
A: Relevant Document Collection
B: Search Result
Discounted Cumulative Gain (DCG)
rel: Very Good:2 / Good:1 / Fair:0 / Bad:-1 / Very Bad: -2
reli: Score of #i
p: Number of Docs at result page
IDCG: Ideal DCG
Thank You!
Use a spacebar or arrow keys to navigate