PayloadCheckerSpanFilter
Check results for grammar rules according to payloads values.
Sommaire
Sample
For example, the document :
{ "fullName": "BOULEVARD|1 DE|1 PARIS|1 07|2 L|3 HOPITAL|3" }
with the following mapping :
"fullName" : {"type": "string", "term_vector" : "with_positions_offsets_payloads", "index_analyzer":"myAnalyzer"}
and settings like :
{ "index" : { "analysis" : { "analyzer": { "myAnalyzer" : { "type" : "custom", "tokenizer" : "whitespace", "filter" : ["delimited_payload_filter", "lowercase", "tokencount_payload_filter"] } }, "filter" : { "delimited_payload_filter" : { "type": "delimited_payload_filter", "delimiter" : "|", "encoding" : "int" }, "tokencount_payload_filter" : { "type": "tokencountpayloads", "factor": 1000 } } }
will match the query :
{ "span_payloadchecker" : { "clauses": [ { "span_multipayloadterm" : { "fullName": "BOULEVARD" } }, { "span_multipayloadterm" : { "fullName": "PARIS" } }, { "span_multipayloadterm" : { "fullName": "HOPITAL" } } ], "checker": { "type":"Grouped" } } }
but not match the following :
{ "span_payloadchecker" : { "clauses": [ { "span_multipayloadterm" : { "fullName": "BOULEVARD" } }, { "span_multipayloadterm" : { "fullName": "HOPITAL" } } { "span_multipayloadterm" : { "fullName": "PARIS" } }, ], "checker": { "type":"Grouped" } } }
because eventhough they got the same payload BOULEVARD and PARIS are not grouped together in the second request.
Checkers
Notice the following checkers :
checker type | description | sample |
And | All the given checkers from data array must validate to validate the document | { "type":"And", "checkers" : [ ... ] } |
Or | One of the given checkers from data array must validate to validate the document | { "type":"Or", "checkers" : [ ... ] } |
Xor | Only one of the given checkers from data array must validate to validate the document | { "type":"Xor", "checkers" : [ ... ] } |
Not | The given checker 'data' must not validate to validate the document | { "type":"Not", "checker" : { ... } } |
All | All the tokens with the given payload (bytes[] or int) must match to validate the document. You will need tokencountpayloads token filter for that. | { "type":"All", "payload" : "AAAD8Q==" } |
One | One token with the given payload (bytes[] or int) must match to validate the document | { "type":"One", "payload" : "AAAD8Q==" } |
Field | Field value (default to _type) must match to validate the document | { "type":"Field", "field" : "type_de_voie", "value" : "BOULEVARD" } |
Grouped | Groups of tokens by payloads must not interlace to validate the document | { "type":"Grouped" } |
BeforeAnother | Check whether tokens with a given payload are just before another group of payloads | { "type":"BeforeAnother", "payloadbefore" : "AAAD8Q==", "another" : "AAAD8Q==" } |
If | If the condition match, the then clause must match to validate the document. Otherwise it matches. Be aware of performances. | { "type":"If", "condition" : {...}, "then" : {...} } |
IfElse | If the condition match, the then clause must match to validate the document. If the condition does not match, the else clause must match to validate the document. Be aware of performances (see Switch) | { "type":"IfElse", "condition" : {...}, "then" : {...}, "else" : {...} } |
Switch | If a switch condition clause match, the switch clause must match to validate the document. Default not customizable set to true (Null). | { "type":"Switch", "field" : "_type", "clauses" : { "adresse": { ... }, "commune": { ... } } } |
Limit | No more available (>=0.4). Validate the "limit" first documents. | { "type":"Limit", "limit" : 100 } |
Null | Validate all documents | { "type":"Null" } |
How it works
Elasticsearch is a flat inverted index document, it does not natively make a difference between one term or another. As an example, it uses frequencies to seek for terms in a corpus.
However, in a simple sentence, like an adress, terms might be very differents. Frequency might not be usable in such cases.
For example for the request :
57 BD DE L HOPITAL 75 PARIS
We are not intend to get such results :
75 BD DE L HOPITAL 75013 PARIS (75 as the number of the adress) 75 rue de paris 57 L HOPITAL (the town L HOPITAL in Moselle, it do not exists in real life, it's an example)
The exactitude of the number of the adresse is here more important than its frequency in the corpus.
To get the most pertinent results, ElasticSearch uses lucene engine to get a score for each results. PayloadCheckerSpanQuery overload that scorer with a new score for each results. Actually, it descores bad results.
Performance consideration
Notice check for span may result slower than a simple match query ! Although the result will match your working rules, the query will scan over ever document matching your query terms. Here is two tips to gain performances :
- terms are internally set in order of term frequencies. That mean if your request is 'RUE EIFFEL', the internal request will be 'EIFFEL RUE'. The number of document checked for 'RUE' term will be down to 'EIFFEL' count. Of course, GroupedPayload and OneBeforeAnother checkers support this.
- you can reduce fonctionnality when terms are too frequent. That mean a request for 'RUE' whom results are about 600 000 documents in FRANCE, will bypass payload checker to give a null response. With more terms (like RUE EIFFEL), the checker will be on again. The query got a "limit" parameter to set the number of documents not to reach. This parameter is usefull when you request over multiple indexes with differents level of details.
Scoring (soon)
Scoring System of PayloadCheckerSpanQuery got two objectives :
- give the most representative results
- get an absolute notation , than mean an AI would pick the good choice between the results (see bulk mode)
The scoring concept follows the SpanScorer and the BooleanScorer, and is inspired by scoring in JDONREFv2 and JDONREFv3 (adapted to inversed index). It was necessary to change the DefaultSimilarity class.
The bulk mode is a mode where score get a ceiling. The ceiling for a result is the maximum score it is possible to get with it. So the bulk mode is simply the score divided by the maximum score. Then it's applying a rule of three to the ceiling for commodity (something like 100% or 20/20).
The bulk mode is less performant than traditionnal way, so it is not recommended for autocomplete, and it is made like an option mode. It is not necessary when a human being can choose between results. But an IA can choose automatically a result when the score is for example over 19/20.
Customise Scoring
(soon)