Working with Elasticsearch feels like sitting in the cockpit of a rocket ship: empowering and terrifying all at the same time. I’ve been learning a lot about the product, and the documentation has been the best resource on my educational journey. In this post, I hope to explain how to index an U.S. phone number utilizing a custom analyzer I cobbled together. This post assumes you have a single field that just has a phone number.
Understanding a User’s Search Experience
Before constructing the analyzer, I needed to understand how a user might search for a phone number. Let’s look at a simple phone number.
(555) 789 1234
How may a user start typing into a search input to find this phone number?
555… 555189… 5557891234
A user will not likely waste their time on superfluous characters like dashes or parenthesis. They will also start by searching from the beginning of the phone number and continue to the end. Let’s look at another phone number.
1 800 555 1234
Damn, we have a 1
at the beginning of that number, but it is pointless. We’ve been conditioned to think of 800
numbers as 1-800
. I guess we’ll need to account for that too.
Gameplan
So we need a game plan when writing an analyzer. We already know the values we’ll be indexing, and we know what the user is going to be typing into the search input. What is important about our data?
- Digits are significant; everything else is noise.
- We are looking for ten numbers at a minimum
- The leading
1
is not helpful - Users search from
left
toright
(Arabic numerals).
Elasticsearch Know-how
What is an analyzer and what are its parts?
An analyzer — whether built-in or custom — is just a package which contains three lower-level building blocks: character filters, tokenizers, and token filters. –Elasticsearch - Anatomy of an analyzer
I’ll analyze the phone numbers by having character filters, tokenizers, and token filters. The order is significant as it allows us to pipeline our data to produce appropriate search criteria for both the index and the user search.
Character Filters
Let’s start with a phone number.
1-(800) 555 1234
The first thing we’ll want to do is strip all non-digit characters. We can do that with a Character Filter
.
Our digits_only
character filter matches all non-digit characters and replaces them with an empty character. After this is applied, we have the following result.
18005551234
Tokenizing
The next important phase is to tokenize
our input. Here we’ll use the keyword
tokenizer. It is non-operation for this use-case since it does not affect our current token. We still have our phone number.
18005551234
Token Filter
Token filters allow you to increase or decrease the number of tokens produced for your index. In our case, we want two filters: a U.S. phone number pattern capture, and a ten-digit minimum filter. The second one will make more sense after I explain the U.S. phone number filter.
U.S. Phone Number Filter
That pesky leading 1
is a real pain, so we need to pick our number out. At this time we need to apply a pattern_capture
filter.
"us_phone_number": {
"type": "pattern_capture",
"preserve_original": true,
"patterns": [
"1?(1)(\\d*)"
]
}
Applying the filter to our current number, we get the following tokens.
1 8005551234 18005551234
The results may be confusing, but the pattern capture filter above is splitting on the first 1
it encounters in our data. We also keep the original token since it makes for a decent search token.
Ten Digit Minimum Filter
You can tell now that the value of 1
is a bad search token. Given an index with 100,000 1-800
numbers, we’d end up with 100,000 hits! Not what we want. Let’s apply a length
filter.
After this, we have the following tokens.
8005551234 18005551234
The Whole Cannoli
Putting our character filters, tokenizer, and token filters together we have the phone_number
analyzer.
Running some _analyze
queries against Elasticsearch get’s us the results we want.
The text input above has white spaces in all the wrong places, but…
We still get the token we wanted!
You can now use a prefix
query to give your users results as they type.
IMPORTANT: For
prefix
queries to work properly, we have to provide a differentsearch_analyzer
than our indexinganalyzer
. Our indexing analyzer limits tokens to at least ten digits. This limits the value of our search to exact matches.
I’ve updated the next section appropriately
Putting It All Together
I’ve included the entire index and mapping so that you can test this in your instance of Elasticsearch and Kibana.
Conclusion
I’m still pretty new to Elasticsearch analyzers. My current employer has been using Elasticsearch for more than a year now, but this is my first time doing a deep dive like this. I think I’m going to need a bigger boat.
P.S. Let me know if I can do this a better way or ask me questions and I’ll do my best to answer.