UnsplitFilter

De JDONREF Wiki

Detokenize tokens in order to reduce term query frequencies for performance consideration, that is call an unsplit token filter for elasticsearch.

Sample

For example, the document :

 { "fullName": "BOULEVARD PARIS 07 HOPITAL" }

with the following mapping :

 "fullName" : {"type": "string", "index_analyzer":"myAnalyzer", "search_analyzer":"myAnalyzer"}

and settings like :

 {
   "index" : {
      "analysis" : {
          "analyzer": {
              "myAnalyzer" : {
                  "type" : "custom",
                  "tokenizer" : "whitespace",
                  "filter" : ["lowercase", "unsplit_filter"]
              }
          },
          "filter" : {
              "unsplit_filter" : {
                 "type": "unsplit",
                 "min_words_unsplitted" : 3,
                 "keep_originals" : false
              }
          }
     }
   }
 } 

will index these terms :

  • 07BOULEVARDHOPITALPARIS
  • 07BOULEVARDHOPITAL
  • 07BOULEVARDPARIS
  • 07HOPITALPARIS
  • BOULEVARDHOPITALPARIS

so that the request :

 {
   "query_string" :
   {
       "default_field":"fullName",
       "query":"BOULEVARD PARIS 07",
       "analyzer" : "myAnalyzer"
   }
 }

will match the document.

Note that 07BOULEVARDPARIS got a very lower frequency than 07, BOULEVARD, and PARIS.

How it works

For each token given to the unsplit filter, all ordered combinations are considered. That mean unsplit filter can be used with synonyms and ngrams, and nearly all other token filters (tell me which one does not works !).

The combination tokens got an 'UNSPLITED' token type.

The unsplit token filter sort tokens in each combination (as strings).

There is a few parameters in order to control theses combinations.

Features
Setting description
keep_originals (true) Keep original tokens in token stream (so that they remain searchable)
frequent_terms_only (true) Uses path parameter as a filter on tokens that are involved in combinations. Set to false only when path is not used.
path (/usr/share/elasticsearch/plugins/jdonrefv4-0.2/word84.txt) path to the file in which the unsplit token filter will find tokens involved in combinations. See word84.txt in the plugin as an example.
min_words_unsplitted (2) Minimum number of tokens in each combination. Set it to 1 with keep_originals to true will index each original token twice (with different token types). 0 is a special value that mean the maximum value for each token stream. In other words, only 07BOULEVARDHOPITALPARIS will be generated. min_words_unsplitted can be a percentage like "70%".
max_subwords (1) Set the maximum number of subwords in each combination. Tokens within the same position and the token type "NGRAM" are considered to be subwords. With max_subwords=1 (default value), that means 07BOULEPAR won't be indexed, but 07BOULEVARDPAR and 07BOULEPARIS will.
subwords_types (NGRAM) Since 0.4. You might change the subwords's token type that max_subwords parameter tell about.
required_payloads 11 BOULEVARD|2 PARIS|2 07|3 HOPITAL|5 will include 24 because token 24 is associate with the payload 11. These token can't be alone (alone_payloads) at the same time for now.
ordered_payloads (none) Since 0.4. Groups of tokens with the same payload are taken in order, so that only the last can be a subword. With "ordered_payloads":[2], that mean 07BOULEHOPITAL won't be indexed, but 07BOULEVARDHOP will because HOPITAL is the last word from tokens with payload 2.
alone_payloads (none) Since 0.4. Only one token by group with the same payload appear in each combination. With "alone_payloads":[2], that mean the combination BOULEHOPITAL won't be indexed because BOULE et HOPITAL got the same payload that is 2. These token can't be required (required_payloads) at the same time for now.
at_least_one_payloads (none) Since 0.4. One or more token by group with the same payload appear in each combination. With "at_least_one_payloads":[2], that mean the combination 07PARIS won't be indexed because BOULE or HOPITAL got the same payload that is 2 and at least one is mandatory.
span_payload (none) Each combination token will be given that payload.
span_score_payload (false) Since 0.4. Make a score for each combination as their payloads. Score formula is made of score_items and score_value. It is a relative score, that mean it
score_maximum (200) Since 0.4. The score made by span_score_payload is absolute, that mean if it reach the maximum, the combination match 100% the query.
score_items (none) Since 0.4. score_items & score_value works together. For each score_items a payload integer value is associated with a score in score_value. Formula follow alone_payloads and at_least_one_payloads and required_payloads. That mean for payloads referenced in alone_payloads you only nead one token to get the score_value. Combinations without required payloads won't be generated, that mean a score of 0. Other payloads are cumulative : that mean one score_value by token of each payload. The total score is calculated by cross_multiplication.
score_value (none) Since 0.4. See score_items.
frequent_terms_limit (0) Since 0.4. Combination mades of tokens from path file only are limited with the frequent_terms_limit parameter. That means no more than frequent_terms_limit combination made of tokens from path file only will be indexed. 0 value means no limit.
minimum_score (100) Since 0.4. Used with span_score_payload only. Combination of score under minimum_score won't be indexed.
delimiter (none) Tokens inside a combination will be separated by this one character wide delimiter.