UnsplitFilter : Différence entre versions

De JDONREF Wiki
(Features)
(Features)
Ligne 92 : Ligne 92 :
 
|-
 
|-
 
| required_payloads
 
| required_payloads
| (none) Since 0.4. Tokens with these payloads will appear in each combination.
+
| (none) Since 0.4. Tokens with these payloads will appear in each combination. With "required_payloads":[11], that mean each combination from 24|11 BOULEVARD|2 PARIS|2 07|3 HOPITAL|5 will include 24 because token 24 is associate with the payload 11.
 
|-
 
|-
 
| ordered_payloads
 
| ordered_payloads
| (none) Since 0.4. Groups of tokens with the same payload are taken in order, so that only the last can be a subword. With "ordered_payloads":[3], that mean 07BOULEHOPITAL won't be indexed, but 07BOULEVARDHOP will because HOPITAL is the last word from tokens with payload 3.
+
| (none) Since 0.4. Groups of tokens with the same payload are taken in order, so that only the last can be a subword. With "ordered_payloads":[2], that mean 07BOULEHOPITAL won't be indexed, but 07BOULEVARDHOP will because HOPITAL is the last word from tokens with payload 2.
 
|-
 
|-
 
| span_payload
 
| span_payload

Version du 2 mai 2015 à 23:53

Detokenize tokens in order to reduce term query frequencies for performance consideration, that is call an unsplit token filter for elasticsearch.

Sample

For example, the document :

 { "fullName": "BOULEVARD PARIS 07 HOPITAL" }

with the following mapping :

 "fullName" : {"type": "string", "index_analyzer":"myAnalyzer", "search_analyzer":"myAnalyzer"}

and settings like :

 {
   "index" : {
      "analysis" : {
          "analyzer": {
              "myAnalyzer" : {
                  "type" : "custom",
                  "tokenizer" : "whitespace",
                  "filter" : ["lowercase", "unsplit_filter"]
              }
          },
          "filter" : {
              "unsplit_filter" : {
                 "type": "unsplit",
                 "min_words_unsplitted" : 3,
                 "keep_originals" : false
              }
          }
     }
   }
 } 

will index these terms :

  • 07BOULEVARDHOPITALPARIS
  • 07BOULEVARDHOPITAL
  • 07BOULEVARDPARIS
  • 07HOPITALPARIS
  • BOULEVARDHOPITALPARIS

so that the request :

 {
   "query_string" :
   {
       "default_field":"fullName",
       "query":"BOULEVARD PARIS 07",
       "analyzer" : "myAnalyzer"
   }
 }

will match the document.

Note that 07BOULEVARDPARIS got a very lower frequency than 07, BOULEVARD, and PARIS.

How it works

For each token given to the unsplit filter, all ordered combinations are considered. That mean unsplit filter can be used with synonyms and ngrams, and nearly all other token filters (tell me which one does not works !).

The combination tokens got an 'UNSPLITED' token type.

The unsplit token filter sort tokens in each combination (as strings).

There is a few parameters in order to control theses combinations.

Features
Setting description
keep_originals (true) Keep original tokens in token stream (so that they remain searchable)
frequent_terms_only (true) Uses path parameter as a filter on tokens that are involved in combinations. Set to false only when path is not used.
path (/usr/share/elasticsearch/plugins/jdonrefv4-0.2/word84.txt) path to the file in which the unsplit token filter will find tokens involved in combinations. See word84.txt in the plugin as an example.
min_words_unsplitted (2) Minimum number of tokens in each combination. Set it to 1 with keep_originals to true will index each original token twice (with different token types). 0 is a special value that mean the maximum value for each token stream. In other words, only 07BOULEVARDHOPITALPARIS will be generated.
max_subwords (1) Set the maximum number of subwords in each combination. Tokens within the same position and the token type "NGRAM" are considered to be subwords. With max_subwords=1 (default value), that means 07BOULEPAR won't be indexed, but 07BOULEVARDPAR and 07BOULEPARIS will.
subwords_types (NGRAM) Since 0.4. You might change the subwords's token type that max_subwords parameter tell about.
required_payloads 11 BOULEVARD|2 PARIS|2 07|3 HOPITAL|5 will include 24 because token 24 is associate with the payload 11.
ordered_payloads (none) Since 0.4. Groups of tokens with the same payload are taken in order, so that only the last can be a subword. With "ordered_payloads":[2], that mean 07BOULEHOPITAL won't be indexed, but 07BOULEVARDHOP will because HOPITAL is the last word from tokens with payload 2.
span_payload (none) Each combination token will be given that payload.
delimiter (none) Tokens inside a combination will be separated by this one character wide delimiter.