![]() However, it was a bit wasteful, with all these extra 0s in the index (although they do compress well!). It turns out this was a viable approach, used by Lucene users for years! ![]() ![]() See how 002 now sorts (correctly) before 017. Those leading 0s do not change the numeric value, but they do cause Lucene to sort in the correct order so that TermRangeQuery matches only the numbers in the requested numeric range. Now if you want to find all numbers in the range of 17-23, inclusive, TermRangeQuery will incorrectly include the number 2, shocking your users!įortunately the fix, way back when, was simple: just left-zero-pad your numbers to the maximum length number, e.g.: Imagine indexing these numbers, sorted as Lucene does in its index: The immediate challenge with this approach is that Lucene sorts all tokens alphabetically (in Unicode code point order), which means simple numbers in decimal form won't be in the right order. The problem was, to Lucene, everything had to be a simple text token, so the obvious way to work with numbers was to index each number as its own text token and then use the already existing TermRangeQuery, accepting all tokens in a single alphabetic range, to filter on the numeric range. The project became very successful with time, and naturally users wanted to index numbers too, to apply numeric range filters to their textual searches, such as "find all digital cameras that cost less than $150". This hard problem was already challenging enough! The Apache Lucene project, which Elasticsearch builds on, began life as a pure text search engine, indexing tokens (words) from a document to build an on-disk inverted index so you could later quickly search for documents containing a specific token. If you like this post and want the opportunity to meet with the author and other Elastic engineers face to face, consider attending Elastic.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |