Prefix Search with ArangoSearch

Search for strings or tokens that start with one or more substrings

A typical use case for matching prefixes is to provide word completion (auto-complete). If the requirement is to find exact prefix matches only then indexing strings with the identity Analyzer is sufficient. The search term needs to have the original capitalization to match (case-sensitive) in that case.

Prefix matching can be used together with normalizing Analyzers (norm, text) for case-insensitive and accent-insensitive searches. This is often preferable over exact prefix matching in auto-complete scenarios, because users may type everything in lower case and not use characters with diacritical marks but just the base characters.

The fastest option for prefix matching is to use edge n-grams, a feature supported by text Analyzers. They make it possible for Views and inverted indexes to simply look up prefixes in the index. The downsides are that the minimum and maximum n-gram length needs to be defined upfront and only prefixes in this range will be found, and that edge n-grams can take up more space.

Exact Prefix Matching

In its basic form without case normalization and accent removal, prefix matching can be done if a field is indexed with just the identity Analyzer. It creates the necessary index data to perform prefix queries with STARTS_WITH() but also wildcard queries with LIKE().

Dataset

IMDB movie dataset

View definition

search-alias View

db.imdb_vertices.ensureIndex({ name: "inv-exact", type: "inverted", fields: [ "title" ] });
db._createView("imdb", "search-alias", { indexes: [ { collection: "imdb_vertices", index: "inv-exact" } ] });

arangosearch View

{
  "links": {
    "imdb_vertices": {
      "fields": {
        "title": {
          "analyzers": [
            "identity"
          ]
        }
      }
    }
  }
}

AQL queries

Match all movie titles that start with "The Matri" (case-sensitive):

FOR doc IN imdb
  SEARCH STARTS_WITH(doc.title, "The Matr")
  RETURN doc.title
Result
The Matrix
The Matrix Reloaded
The Matrix Revolutions
The Matrix Trilogy
The Matrix Revisited

Match movie titles that start with either "The Matr" or "Harry Pot" using OR:

FOR doc IN imdb
  SEARCH STARTS_WITH(doc.title, "The Matr") OR STARTS_WITH(doc.title, "Harry Pot")
  RETURN doc.title
Result
The Matrix Revisited
The Matrix
The Matrix Reloaded
The Matrix Revolutions
Harry Potter and the Sorcerer’s Stone
Harry Potter and the Chamber Of Secrets
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Goblet Of Fire
Harry Potter and the Order of the Phoenix
Harry Potter and the Half-Blood Prince
Harry Potter Collection
The Matrix Trilogy
Harry Potter and the Deathly Hallows: Part I
Harry Potter and the Deathly Hallows: Part II

Match movie titles that start with either "The Matr" or "Harry Pot" utilizing the feature of the STARTS_WITH() function that allows you to pass multiple possible prefixes as array of strings, of which one must match:

FOR doc IN imdb
  SEARCH STARTS_WITH(doc.title, ["The Matr", "Harry Pot"])
  RETURN doc.title
Result
The Matrix Revisited
The Matrix
The Matrix Reloaded
The Matrix Revolutions
Harry Potter and the Sorcerer’s Stone
Harry Potter and the Chamber Of Secrets
Harry Potter and the Prisoner of Azkaban
Harry Potter and the Goblet Of Fire
Harry Potter and the Order of the Phoenix
Harry Potter and the Half-Blood Prince
Harry Potter Collection
The Matrix Trilogy
Harry Potter and the Deathly Hallows: Part I
Harry Potter and the Deathly Hallows: Part II

Match Multiple Token Prefixes

It is possible to match strings if one out of multiple prefix conditions is fulfilled with a single call to STARTS_WITH(), as it supports an array of prefixes. The STARTS_WITH() function also accepts an optional third argument to define how many of the given prefixes must match. This is handy if the input is full-text tokenized by a text Analyzer or an array of strings, where conditions for different tokens can be fulfilled.

Dataset

IMDB movie dataset

View definition

search-alias View

db.imdb_vertices.ensureIndex({ name: "inv-text", type: "inverted", fields: [ { name: "title", analyzer: "text_en" } ] });
db._createView("imdb_alias", "search-alias", { indexes: [ { collection: "imdb_vertices", index: "inv-text" } ] });

arangosearch View

{
  "links": {
    "imdb_vertices": {
      "fields": {
        "title": {
          "analyzers": [
            "text_en"
          ]
        }
      }
    }
  }
}

AQL queries

Match movie titles that contain three out of five prefixes.

search-alias View:

FOR doc IN imdb_alias
  SEARCH STARTS_WITH(doc.title, TOKENS("Sec Cham Har Pot Phoe", "text_en"), 3)
  RETURN doc.title

ArangoSearch View:

FOR doc IN imdb
  SEARCH ANALYZER(STARTS_WITH(doc.title, TOKENS("Sec Cham Har Pot Phoe", "text_en"), 3), "text_en")
  RETURN doc.title
Result
Harry Potter and the Chamber Of Secrets
Harry Potter and the Order of the Phoenix

You can calculate the number of prefixes that need to match dynamically, for example to require that all prefixes must match.

search-alias View:

LET prefixes = TOKENS("Brot Blu", "text_en")
FOR doc IN imdb_alias
  SEARCH STARTS_WITH(doc.title, prefixes, LENGTH(prefixes))
  RETURN doc.title

ArangoSearch View:

LET prefixes = TOKENS("Brot Blu", "text_en")
FOR doc IN imdb
  SEARCH ANALYZER(STARTS_WITH(doc.title, prefixes, LENGTH(prefixes)), "text_en")
  RETURN doc.title
Result
The Blues Brothers
Blues Brothers 2000

Edge N-Grams

Edge n-grams is a feature of the text Analyzer type. It generates n-grams for each token (usually a word), but anchored to the beginning of the token. It thus creates prefix n-grams only and not all n-grams for the input.

Dataset

IMDB movie dataset

Custom Analyzer

Create a text Analyzer in arangosh to normalize case to all lowercase, remove diacritics, with no stemming, with edge n-grams of size 3 to 6 for example and including the original string as well:

//db._useDatabase("your_database"); // Analyzer will be created in current database
var analyzers = require("@arangodb/analyzers");
analyzers.save("edge_ngram", "text", { locale: "en", accent: false, case: "lower", stemming: false, edgeNgram: { min: 3, max: 6, preserveOriginal: true } }, ["frequency", "norm", "position"]);

Test the Analyzer:

db._query(`RETURN TOKENS("Ocean Equilibrium", "edge_ngram")`);
[
  [
    "oce",
    "ocea",
    "ocean",
    "equ",
    "equi",
    "equil",
    "equili",
    "equilibrium"
  ]
]

View definition

search-alias View

db.imdb_vertices.ensureIndex({ name: "inv-ngram", type: "inverted", fields: [ { name: "title", analyzer: "edge_ngram" } ] });
db._createView("imdb_alias", "search-alias", { indexes: [ { collection: "imdb_vertices", index: "inv-ngram" } ] });

arangosearch View

{
  "links": {
    "imdb_vertices": {
      "fields": {
        "title": {
          "analyzers": [
            "edge_ngram"
          ]
        }
      }
    }
  }
}

AQL queries

Match movie titles that have a word starting with "ocea"

search-alias View:

FOR doc IN imdb_alias
  SEARCH doc.title == "ocea"
  RETURN doc.title

arangosearch View:

FOR doc IN imdb
  SEARCH ANALYZER(doc.title == "ocea", "edge_ngram")
  RETURN doc.title
Result
Ocean Voyagers
Ocean’s Eleven
Ocean’s Twelve
Ocean’s Thirteen
Ocean’s Eleven
Ocean’s Collection

Note that the search term must be normalized in order to match something. You can create a text Analyzer that matches the configuration of the edge n-gram text Analyzer to pre-process the search terms in the same way, but without creating any n-grams:

//db._useDatabase("your_database"); // Analyzer will be created in current database
var analyzers = require("@arangodb/analyzers");
analyzers.save("match_edge_ngram", "text", { locale: "en", accent: false, case: "lower", stemming: false }, ["frequency", "norm", "position"]);

Now we can also match movie titles that start with "Oceä" (normalized to "ocea"):

search-alias View:

FOR doc IN imdb_alias
  SEARCH doc.title == TOKENS("Oceä", "match_edge_ngram")[0]
  RETURN doc.title

ArangoSearch View:

FOR doc IN imdb
  SEARCH ANALYZER(doc.title == TOKENS("Oceä", "match_edge_ngram")[0], "edge_ngram")
  RETURN doc.title
Result
Ocean Voyagers
Ocean’s Eleven
Ocean’s Twelve
Ocean’s Thirteen
Ocean’s Eleven
Ocean’s Collection

What we cannot match search terms that are longer than the maximum edge n-gram size (or shorter than the minimum edge n-gram size), except for full tokens if preserveOriginal is enabled. For example, this query does not match anything because the longest indexed edge n-gram is "equili" but the search term is nine characters long:

search-alias View:

FOR doc IN imdb_alias
  SEARCH doc.title == TOKENS("Equilibri", "match_edge_ngram")[0]
  RETURN doc.title

ArangoSearch View:

FOR doc IN imdb
  SEARCH ANALYZER(doc.title == TOKENS("Equilibri", "match_edge_ngram")[0], "edge_ngram")
  RETURN doc.title

Searching for "Equilibrium" does match because the full token "equilibrium" is indexed by our custom Analyzer thanks to preserveOriginal. We can take advantage of the full token being indexed with the STARTS_WITH() function:

search-alias View:

FOR doc IN imdb_alias
  SEARCH STARTS_WITH(doc.title, TOKENS("Equilibri", "match_edge_ngram"))
  RETURN doc.title

ArangoSearch View:

FOR doc IN imdb
  SEARCH ANALYZER(STARTS_WITH(doc.title, TOKENS("Equilibri", "match_edge_ngram")), "edge_ngram")
  RETURN doc.title
Result
Equilibrium

Note however, that this will not be as fast as matching an edge n-gram directly.