UnsplitFilter

De JDONREF Wiki
Révision de 3 janvier 2015 à 12:38 par Julien2512 (discussion | contributions) (How it works)

Detokenize tokens in order to reduce term query frequencies for performance consideration, that is call an unsplit token filter for elasticsearch.

Sample

For example, the document :

 { "fullName": "BOULEVARD PARIS 07 HOPITAL" }

with the following mapping :

 "fullName" : {"type": "string", "index_analyzer":"myAnalyzer", "search_analyzer":"myAnalyzer"}

and settings like :

 {
   "index" : {
      "analysis" : {
          "analyzer": {
              "myAnalyzer" : {
                  "type" : "custom",
                  "tokenizer" : "whitespace",
                  "filter" : ["lowercase", "unsplit_filter"]
              }
          },
          "filter" : {
              "unsplit_filter" : {
                 "type": "unsplit",
                 "min_words_unsplitted" : 3,
                 "keep_originals" : false
              }
          }
     }
   }
 } 

will index these terms :

  • 07BOULEVARDHOPITALPARIS
  • 07BOULEVARDHOPITAL
  • 07BOULEVARDPARIS
  • 07HOPITALPARIS
  • BOULEVARDHOPITALPARIS

so that the request :

 {
   "query_string" :
   {
       "default_field":"fullName",
       "query":"BOULEVARD PARIS 07",
       "analyzer" : "myAnalyzer"
   }
 }

will match the document.

Note that 07BOULEVARDPARIS got a very lower frequency than 07, BOULEVARD, and PARIS.

How it works

For each token given to the unsplit filter, all ordered combinations are considered. That mean unsplit filter can be used with synonyms and ngrams, and nearly all other token filters (tell me which one does not works !).

The combination tokens got an 'UNSPLITED' token type.

The unsplit token filter sort tokens in each combination (as strings).

There is a few parameters in order to control theses combinations.

Features
Setting description
keep_originals (true) Keep original tokens in token stream (so that they remain searchable)
frequent_terms_only (true) Uses path parameter as a filter on tokens that are involved in combinations
path (/usr/share/elasticsearch/plugins/jdonrefv4-0.2/word84.txt) path to the file in which the unsplit token filter will find tokens involved in combinations. See word84.txt in the plugin as an example.
min_words_unsplitted (2) Minimum number of tokens in each combination. Set it to 1 with keep_originals to true will index each original token twice (with different token types). 0 is a special value that mean the maximum value for each token stream. In other words, only 07BOULEVARDHOPITALPARIS will be generated.
max_subwords (1) Set the maximum number of subwords in each combination. Tokens within the same position and the token type "NGRAM" are considered to be subwords. With max_subwords=1 (default value), that means 07BOULEPAR won't be indexed, but 07BOULEVARDPAR and 07BOULEPARIS will.
span_payload (none) Each combination token will be given that payload.
delimiter (none) Tokens inside a combination will be separated by this one character wide delimiter.