UnsplitFilter : Différence entre versions

De JDONREF Wiki
(Page créée avec « Detokenize tokens in order to reduce term query frequencies for performance consideration. ===== Sample ===== For example, the document : { "fullName": "BOULEVARD PAR... »)
 
(How it works)
Ligne 64 : Ligne 64 :
   
 
There is a few parameters in order to control theses combinations.
 
There is a few parameters in order to control theses combinations.
 
   
 
===== Features =====
 
===== Features =====

Version du 2 décembre 2014 à 23:55

Detokenize tokens in order to reduce term query frequencies for performance consideration.

Sample

For example, the document :

 { "fullName": "BOULEVARD PARIS 07 HOPITAL" }

with the following mapping :

 "fullName" : {"type": "string", "index_analyzer":"myAnalyzer", "search_analyzer":"myAnalyzer"}

and settings like :

 {
   "index" : {
      "analysis" : {
          "analyzer": {
              "myAnalyzer" : {
                  "type" : "custom",
                  "tokenizer" : "whitespace",
                  "filter" : ["lowercase", "unsplit_filter"]
              }
          },
          "filter" : {
              "unsplit_filter" : {
                 "type": "unsplit",
                 "min_words_unsplitted" : 3,
                 "keep_originals" : false
              }
          }
     }
   }
 } 

will index these terms :

  • 07BOULEVARDHOPITALPARIS
  • 07BOULEVARDHOPITAL
  • 07BOULEVARDPARIS
  • 07HOPITALPARIS
  • BOULEVARDHOPITALPARIS

so that the request :

 {
   "query_string" :
   {
       "default_field":"fullName",
       "query":"BOULEVARD PARIS 07",
       "analyzer" : "myAnalyzer"
   }
 }

will match the document.

Note that 07BOULEVARDPARIS got a very lower frequency than 07, BOULEVARD, and PARIS.

How it works

For each token given to the unsplit filter, all ordered combinations are considered. That mean unsplit filter can be used with synonyms and ngrams, and nearly all other token filters (tell me which one does not works !).

The combination tokens got an 'UNSPLITED' token type.

There is a few parameters in order to control theses combinations.

Features
Setting description
keep_originals (true) Keep original tokens in token stream (so that they remain searchable)
frequent_terms_only (true) Uses path parameter as a filter on tokens that are involved in combinations
path (/usr/share/elasticsearch/plugins/jdonrefv4-0.2/word84.txt) path to the file in which the unsplit token filter will find tokens involved in combinations. See word84.txt in the plugin as an example.
min_words_unsplitted (2) Minimum number of tokens in each combination. Set it to 1 with keep_originals to true will index each original token twice (with different token types).
max_subwords (1) Set the maximum number of subwords in each combination. Tokens within the same position and the token type "NGRAM" are considered to be subwords. That means 07BOULEPAR won't be indexed.
span_payload (none) Each combination token will be given that payload.
delimiter (none) Tokens inside a combination will be separated by this one character wide delimiter.