Analyzing U.S. Phone Numbers In Elasticsearch

Working with Elasticsearch feels like sitting in the cockpit of a rocket ship: empowering and terrifying all at the same time. I’ve been learning a lot about the product, and the documentation has been the best resource on my educational journey. In this post, I hope to explain how to index an U.S. phone number utilizing a custom analyzer I cobbled together. This post assumes you have a single field that just has a phone number.

Understanding a User’s Search Experience

Before constructing the analyzer, I needed to understand how a user might search for a phone number. Let’s look at a simple phone number.

(555) 789 1234

How may a user start typing into a search input to find this phone number?

555… 555189… 5557891234

A user will not likely waste their time on superfluous characters like dashes or parenthesis. They will also start by searching from the beginning of the phone number and continue to the end. Let’s look at another phone number.

1 800 555 1234

Damn, we have a 1 at the beginning of that number, but it is pointless. We’ve been conditioned to think of 800 numbers as 1-800. I guess we’ll need to account for that too.

Gameplan

So we need a game plan when writing an analyzer. We already know the values we’ll be indexing, and we know what the user is going to be typing into the search input. What is important about our data?

Digits are significant; everything else is noise.
We are looking for ten numbers at a minimum
The leading 1 is not helpful
Users search from left to right (Arabic numerals).

Elasticsearch Know-how

What is an analyzer and what are its parts?

An analyzer — whether built-in or custom — is just a package which contains three lower-level building blocks: character filters, tokenizers, and token filters. –Elasticsearch - Anatomy of an analyzer

I’ll analyze the phone numbers by having character filters, tokenizers, and token filters. The order is significant as it allows us to pipeline our data to produce appropriate search criteria for both the index and the user search.

Character Filters

Let’s start with a phone number.

1-(800) 555 1234

The first thing we’ll want to do is strip all non-digit characters. We can do that with a Character Filter.

"digits_only": {
    "type": "pattern_replace",
    "pattern": "[^\\d]"
}

Our digits_only character filter matches all non-digit characters and replaces them with an empty character. After this is applied, we have the following result.

18005551234

Tokenizing

The next important phase is to tokenize our input. Here we’ll use the keyword tokenizer. It is non-operation for this use-case since it does not affect our current token. We still have our phone number.

18005551234

Token Filter

Token filters allow you to increase or decrease the number of tokens produced for your index. In our case, we want two filters: a U.S. phone number pattern capture, and a ten-digit minimum filter. The second one will make more sense after I explain the U.S. phone number filter.

U.S. Phone Number Filter

That pesky leading 1 is a real pain, so we need to pick our number out. At this time we need to apply a pattern_capture filter.

"us_phone_number": {
    "type": "pattern_capture",
    "preserve_original": true,
    "patterns": [
    "1?(1)(\\d*)"
    ]
}

Applying the filter to our current number, we get the following tokens.

1 8005551234 18005551234

The results may be confusing, but the pattern capture filter above is splitting on the first 1 it encounters in our data. We also keep the original token since it makes for a decent search token.

Ten Digit Minimum Filter

You can tell now that the value of 1 is a bad search token. Given an index with 100,000 1-800 numbers, we’d end up with 100,000 hits! Not what we want. Let’s apply a length filter.

"ten_digits_min": {
    "type": "length",
    "min": 10
}

After this, we have the following tokens.

8005551234 18005551234

The Whole Cannoli

Putting our character filters, tokenizer, and token filters together we have the phone_number analyzer.

"phone_number": {
    "char_filter": "digits_only",
    "tokenizer": "keyword",
    "filter": [
        "us_phone_number",
        "ten_digits_min"
    ]
}

Running some _analyze queries against Elasticsearch get’s us the results we want.

POST phone-numbers/_analyze
{
  "text" : "      555 3 2 1  8286",
  "analyzer": "phone_number"
}

The text input above has white spaces in all the wrong places, but…

{
  "tokens": [
    {
      "token": "5553218286",
      "start_offset": 6,
      "end_offset": 21,
      "type": "word",
      "position": 0
    }
  ]
}

We still get the token we wanted!

Moneyball Yes

You can now use a prefix query to give your users results as they type.

IMPORTANT: For prefix queries to work properly, we have to provide a different search_analyzer than our indexing analyzer. Our indexing analyzer limits tokens to at least ten digits. This limits the value of our search to exact matches.

I’ve updated the next section appropriately

POST phone-numbers/_search
{
  "query": {
    "prefix": {
      "phone_number": {
        "value": "555321"
      }
    }
  }
}

Putting It All Together

I’ve included the entire index and mapping so that you can test this in your instance of Elasticsearch and Kibana.

DELETE phone-numbers

PUT phone-numbers
{
  "settings": {
    "analysis": {
      "char_filter": {
        "digits_only": {
          "type": "pattern_replace",
          "pattern": "[^\\d]"
        }
      },
      "filter": {
        "us_phone_number": {
          "type": "pattern_capture",
          "preserve_original": true,
          "patterns": [
            "1?(1)(\\d*)"
          ]
        },
        "ten_digits_min": {
          "type": "length",
          "min": 10
        },
        "not_empty": {
          "type": "length",
          "min": 1
        }
      },
      "analyzer": {
        "phone_number": {
          "char_filter": "digits_only",
          "tokenizer": "keyword",
          "filter": [
            "us_phone_number",
            "ten_digits_min"
          ]
        },
        "phone_number_search" : {
          "char_filter" : "digits_only",
          "tokenizer" : "keyword",
          "filter" : [
            "not_empty"
          ]
        }
      }
    }
  },
  "mappings": {
    "_default_": {
      "_all": {
        "enabled": false
      },
      "properties": {
        "phone_number": {
          "type": "text",
          "analyzer" : "phone_number",
          "search_analyzer" : "phone_number_search"
        }
      }
    }
  }
}


POST phone-numbers/_analyze
{
  "text" : "      555 3 2 1  8286",
  "analyzer": "phone_number"
}

POST phone-numbers/_analyze
{
  "text" : "1(855) 555-5344",
  "analyzer": "phone_number"
}

POST phone-numbers/_analyze
{
  "text" : "1(855) 555",
  "analyzer": "phone_number_search"
}

Conclusion

I’m still pretty new to Elasticsearch analyzers. My current employer has been using Elasticsearch for more than a year now, but this is my first time doing a deep dive like this. I think I’m going to need a bigger boat.

Jaws Shocked