Implementing a Searchbox with Neo4j

A customer asked me recently how to implement a google-like search box. Users should be able to insert a search string and get a meaningful auto-complete back. The text to search for was coming from a single label/property. Meaningful in the context meant that:

  • results that start with the search string should come first

  • results that contain the search string should come after the first group

  • relevant results should come first in each group

  • the search should cope with spelling errors

Neo4j indexes have improved significantly with the 3.5 release, and we have all we need available. With an index on the property, a STARTS WITH is fast enough for our purpose. For the 2 criteria, we could use a CONTAINS operator. But, this is not super fast. Both operators do not support 'sounds like' (spelling error) searches. Ignoring that fact for a moment, how can we achieve the ordering of the results? A simple UNION to combine to searches works well: (pseudo code)

match (n:Label) where STARTS WITH $searchterm
  return as result order by result limit 5
match (n:Label) where contains $searchterm and not STARTS WITH $searchterm
  return as result order by result limit 5

But, as mentioned, the second part of the query is not that fast and so far we only have exact search.

Luckily, Neo4j supports full-text search, backed by Apache Lucene. Lucene provides performant contains searches as well as fuzzy search. Disadvantage: they are not integrated into Cypher, instead one has to work with the db.index.fulltext.* group of procedures. Searching for a search term brings back the matched node and score, indicating how relevant Lucene thinks the result is. Unfortunately, I could not find a way to combine a starts with search with fuzziness in Lucene.

To demo the idea, I used data I parsed from Twitter for the Neo4j filter bubble. The model ist simple and for our search, we only need the User Node and the FOLLOWS relationship:

Twitter schema
Figure 1. Twitter schema

There are currently about 4 million user nodes and the database is roughly 50GB in total. Big enough to prove the search works.

I created a Lucene index via:

call db.index.fulltext.createNodeIndex('desc', ['User'], ['description'])
Lucene indexes can span multiple Labels and properties.

How would our search using that index look like?

match (u:User) where u.description starts with $searchterm
   return as user, u.description as description order by description limit 5
CALL db.index.fulltext.queryNodes('desc', $searchterm+'~') YIELD node, score with node as user, score
  return as user, user.description as description order by score desc limit 5

The first part uses a Neo4j index to find anything starting with $searchterm and the second part is doing the fuzzy contains search.

The result when searching for data:





data & design @ spotify

Seth Goldman

data & insights at @DiscoveryEd | @mcps & @UofMaryland alum | #rstats & #PyData acolyte

Mariella de Crouy Chanel

data & interactive graphics • #d3js #python #dataviz • #luxembourg • ☀️

Alberto Gonzalez Paje

data + design @fjord

The MarcLab for Connectomics

In the data.

All the Mais Christian

Data are.

Brice Richardson

Data | Data | Data. Father of 2.

Daya Insan Sidhmukh


So far so good, but what about the relevance criteria? Well, since we are using a graph database, this is easy. Neo4j provides a Graph Algorithms library which allows us to easily calculate a score (relevance) of a node. In this example, I choose the page rank algorithm, using the FOLLOWERS relationships:

CALL algo.pageRank('User', 'FOLLOWS',
  {iterations:20, dampingFactor:0.85, write: true,writeProperty:"pagerank"})
YIELD nodes, iterations, loadMillis, computeMillis, writeMillis, dampingFactor, write, writeProperty

This calculates a ranking of users and writes the result back as pagerank property on the User label.

To use that pagerank, the query now looks like this:

match (u:User) where u.description starts with $searchterm
  return as user, u.description as description, u.pagerank as rank order by u.pagerank desc limit 5
CALL db.index.fulltext.queryNodes('desc', $searchterm+'~') YIELD node, score with node as user
 return as user, user.description as description, user.pagerank as rank order by rank desc limit 5

Whis gives a much better result (for data again):



data analyze & visualize product planner, maybe…​ interested in web analytics, ML, AI & energy harvesting


fred datchary

dataviz for competitive intelligence prospective & innovation


Marko Plahuta

data visualization & machine learning, owner of


Doyle Groves

data scientist, independent variable



data || :@mecanosaurio



The #1 platform for connected data. Developers start here:


Dew Wardah

I like data, web, semantic web| Random tweets,suka suka ;)|was born on Oct 26th|Indonesian|Moslem|Javanese



Graph Cloud at global scale, secured and trusted by Enterprise and Government to deliver ROI through Connected Data



No more a '(personal #data protection and #information architect). I’m a #graph lover, a casual #Rstats/#Lisp/#Smalltalk/#Prolog dev and a #kickscooter addict


Aleksander Stensby

COO at Software developer, architect and data geek.


As you can see, Neo4j has the highest relevance in my dataset, but as the description does not start with data is comes as number 5.

There are many UI components for a google-like search available, I choose easyautocomple as it only needed jQuery as a dependency and comes with good documentation. Even without much configuration, it can use images in the list:

Autocomplete example
Figure 2. Autocomplete example

The data is provided via a REST endpoint which in turn calls above query. With Spring Boot, this is just a few lines of code. You can check my example on Github.