Lucene

Full-text search module for Crux making use of Apache Lucene.

crux-lucene runs in-process as part of the Crux node as a Crux module. The Lucene index is kept up to date synchronously when Crux transactions are processed on a node.

This module is in alpha and likely to change. In particular, we might rationalize/combine the different query functions available (see https://github.com/juxt/crux/issues/1318).

Setup

First, add the crux-lucene dependency to your project:

deps.edn
pom.xml

pro.juxt.crux/crux-lucene {:mvn/version "1.17.1"}

<dependency>
    <groupId>pro.juxt.crux</groupId>
    <artifactId>crux-lucene</artifactId>
    <version>1.17.1</version>
</dependency>

Add the following to your node configuration:

JSON
Clojure
EDN

{
  "crux.lucene/lucene-store": {
        "db-dir": "lucene",
  }
}

{...
 :crux.lucene/lucene-store {:db-dir "lucene-dir"}}

{...
 :crux.lucene/lucene-store {:db-dir "lucene-dir"}}

You must have a fresh node to add Lucene configuration to. If you add Lucene configuration to a populated Crux node, you will receive an exception when the node starts up: Lucene store latest tx mismatch. To remedy this, you must wipe the index directories for the Crux node and restart. The Lucene index will then be populated as part of the normal Crux node ingestion process.

Querying

All text fields in a document will be automatically indexed. You can then you use the in-built text-search fn in your datalog:

{:find '[?e]
 :where '[[(text-search :name "Ivan") [[?e]]]
 [?e :crux.db/id]]}

The destructuring available is entity-id, matched-value and score. For example, to return the full search results tuple:

{:find '[?e ?v ?s]
 :where '[[(text-search :name "Ivan") [[?e ?v ?s]]]
 [?e :crux.db/id]]}

In the above example, ?e is the entity ID of the matched search rsult. ?v is the matched value and ?s is the matched score.

You can use standard Lucene fuzzy textual search capabilities:

{:find '[?e]
 :where '[[(text-search :name "Iva*") [[?e]]]
 [?e :crux.db/id]]}

This will return all entities with a :name attribute that starts with "Iva".

All query functions implemented in crux-lucene pass your query string directly to Lucene’s QueryParser.parse using the StandardAnalyzer, without any escaping or other modifications. See the Lucene documentation for more information.

It’s possible to supply var bindings to use in text-search:

(c/q db '{:find  [?v]
          :in    [input]
          :where [[(text-search :name input) [[?e ?v]]]]}
     "Ivan")

Wildcard Attributes

There is an an experimental wildcard search function, where you can search across all attributes:

{:find '[?e ?v ?a ?s]
 :where '[[(wildcard-text-search "Iva*") [[?e ?v ?a ?s]]]
 [?e :crux.db/id]]}

Will return all entities that have an attribute with a value that matches "Iva". The destructured binding also contains a which is the matched attribute.

Multi-field searches

There is an entirely different search-function available for multi-field searches using Lucene in Crux: lucene-text-search:

{:find '[?e]
 :where '[[(lucene-text-search "firstname:James OR surname:preston") [[?e]]]]}

This lucene-text-search takes a Lucene query string.

If you use lucene-text-search, you cannot use the search functions listed above - wildcard-text-search and text-search. This is because the way we index documents into Lucene is different.

In the normal case for text-search and wildcard-text-search, we index each A/V pair in a Crux document as individual documents in Lucene. This allows for a large degree of structural sharing, which will help in the case where there is a lot historical data in Crux. This is targeted to ease the disk-space taken up by Lucene, but also for query efficiency reasons.

lucene-text-search indexs a single document per document-version in Crux. The downside of this is structural sharing impacting disk space, but the upside is taking advantage of more of the Lucene query language capability, and to perform queries taking into account multiple fields.

To enable lucene-text-search, you must configure the Lucene Indexer, such like:

{...
 :crux.lucene/lucene-store {:db-dir "lucene-dir" :indexer 'crux.lucene.multi-field/->indexer}}

Bindings

It’s possible to supply var bindings also, that are wired in using format when the vars are bound.

{:find [?e]
 :in [?surname ?firstname]
 :where [[(lucene-text-search "surname: %s AND firstname: %s" ?surname ?firstname) [[?e]]]]}

String Escaping

You can escape your input strings when constructing Lucene query strings by calling org.apache.lucene.queryparser.classic.QueryParser/escape. For example, this method would transform "|&hello&|" to "\\|\\&hello\\&\\|".

This is helpful to mitigate against injection attacks and other errors.

Reference