Full-text Lucene search

Full-text search module for Crux making use of Apache Lucene.

crux-lucene runs in-process as part of the Crux node as a Crux module. The Lucene index is kept up-to-date asynchronously as a "secondary index" that hooks into the underlying Crux transaction processing.

Setup

First, add the crux-lucene dependency to your project:

deps.edn
pom.xml

pro.juxt.crux/crux-lucene {:mvn/version "1.18.0"}

<dependency>
    <groupId>pro.juxt.crux</groupId>
    <artifactId>crux-lucene</artifactId>
    <version>1.18.0</version>
</dependency>

Add the following to your node configuration:

JSON
Clojure
EDN

{
  "crux.lucene/lucene-store": {
    // omit `"db-dir"` to start an in-memory-only Lucene store.
    "db-dir": "lucene",
  }
}

{...
 ; omit `:db-dir` to start an in-memory-only Lucene store.
 :crux.lucene/lucene-store {:db-dir "lucene-dir"}}

{...
 ; omit `:db-dir` to start an in-memory-only Lucene store.
 :crux.lucene/lucene-store {:db-dir "lucene-dir"}}

Indexing

All top-level text fields in a document are automatically indexed.

Datalog Querying

You can use the built-in text-search function within in your Datalog queries:

{:find '[?e]
 :where '[[(text-search :name "Ivan") [[?e]]]
 [?e :crux.db/id]]}

The destructuring available is entity-id, matched-value and score. For example, to return the complete search results tuples:

{:find '[?e ?v ?s]
 :where '[[(text-search :name "Ivan") [[?e ?v ?s]]]
 [?e :crux.db/id]]}

In the above example, ?e is the entity ID of the matched search result. ?v is the matched value and ?s is the matched score.

You can use standard Lucene fuzzy textual search capabilities:

{:find '[?e]
 :where '[[(text-search :name "Iva*") [[?e]]]
 [?e :crux.db/id]]}

This will return all entities with a :name attribute that starts with "Iva". Note that large result sets will be fully realized and there is currently no means to specify a limit.

This module does not support "leading wildcard" searches, e.g. "*van", even though Lucene is technically capable of performing these queries.

All query functions implemented in crux-lucene pass your query string directly to Lucene’s QueryParser.parse using the StandardAnalyzer, without any escaping or other modifications. See the Lucene documentation for more information.

It’s possible to supply var bindings to use in text-search:

(c/q db '{:find [?v]
          :in [input]
          :where [[(text-search :name input) [[?e ?v]]]]}
     "Ivan")

Wildcard Attributes

There is a wildcard search function, where you can search across all attributes:

{:find '[?e ?v ?a ?s]
 :where '[[(wildcard-text-search "Iva*") [[?e ?v ?a ?s]]]
 [?e :crux.db/id]]}

This will return all entities that have an attribute with a value that matches "Iva". The destructured binding also contains a which is the matched attribute.

Multi-field searches

There is an entirely different Lucene search function available for multi-field searches: lucene-text-search:

{:find '[?e]
 :where '[[(lucene-text-search "firstname:James OR surname:preston") [[?e]]]]}

lucene-text-search accepts a Lucene query string whose format is documented extensively elsewhere.

To enable this document-oriented query model, the structure of the indexes stored in Lucene is fundamentally different to the structure required for the EAV-oriented features discussed above, Using both querying and indexing approaches within the same node is supported and discussed discussed below.

In the normal case for text-search and wildcard-text-search, we index each EAV in a Crux document as individual documents in Lucene. This allows for some degree of structural sharing, which should help in the case where there is a lot historical data in Crux.

By contrast, lucene-text-search indexes a single document per document-version in Crux. The downside of this is there is no structural sharing (besides whatever tricks Lucene employs under the hood), but the upside is taking advantage of more of the Lucene query language capability, e.g. to perform queries that multiple fields into account.

To enable lucene-text-search, you must configure the Lucene Indexer, such like:

{...
 :crux.lucene/lucene-store {:indexer 'crux.lucene.multi-field/->indexer}}

Bindings

It’s possible to supply var bindings also, which are wired in using java.lang.String.format when the vars are bound.

{:find [?e]
 :in [?surname ?firstname]
 :where [[(lucene-text-search "surname: %s AND firstname: %s" ?surname ?firstname) [[?e]]]]}

String Escaping

You can escape your input strings when constructing Lucene query strings by calling org.apache.lucene.queryparser.classic.QueryParser/escape. For example, this method would transform "|&hello&|" to "\\|\\&hello\\&\\|".

This is helpful to mitigate against injection attacks and other errors.

Custom searching outside of Datalog

The more direct crux.lucene/search function is available to lazily return results, without the temporal filtering or other constraints of using Lucene via the q API.

The function accepts 3 parameters (node, query and opts) and returns an iterable cursor of results that must be closed.

The query parameter can be either a Lucene query string or an org.apache.lucene.search.Query object.

The opts parameter accepts a map with a single :default-field entry. The value of this entry will be supplied to the Lucene QueryParser in the cases where the supplied query parameter is a Lucene query string.

(with-open [search-results (crux.lucene/search node "Ivan")]
  (into [] (iterator-seq search-results)))

Each item returned will be a vector of org.apache.lucene.document.Document and a Double representing the matched score.

See the extension tests for examples of decoding the contents of the result document and performing userspace temporal filtering.

Custom Indexer

It is possible to implementing a custom indexer based on the crux.lucene/LuceneIndexer protocol, which will be necessary to address complex requirements. See the extension tests for examples.

Custom Analyzer

Lucene provides a huge amount of capability beyond the default StandardAnalyzer. See the extension tests for examples.

Multiple Lucene modules

The built-in search functions all accept an additional opts map parameter as the last argument. This can be included in your Datalog query as a literal or passed in using a logic variable. The value under :lucene-store-k in this map can be set to specify that a search function should be run against a particular module (i.e. a specific Lucene secondary index, if many are configured), otherwise the search function will attempt to execute against the default :crux.lucene/lucene-store module.

See the extension tests for an example of configuring multiple Lucene modules to run on the same node.

Checkpointing Lucene

For more details about checkpointing in Crux, see the main Checkpointing docs.

You can set up checkpointing on your Lucene store too, in addition to the main Crux query indices. This means that a new node starting up will be able to download a checkpoint of a reasonably recent Lucene store from a central location rather than having to replay all of the transactions.

The parameters are the same as for the main Crux query indices, except applied to your Lucene store component:

JSON
Clojure
EDN

{
  "crux.lucene/index-store": {
    "db-dir": "lucene-dir",
    "checkpointer": {
      "crux/module": "crux.checkpoint/->checkpointer",
      "store": {
        "crux/module": "crux.checkpoint/->filesystem-checkpoint-store",
        "path": "/path/to/cp-store"
      },
      "approx-frequency": "PT6H"
    }
  },
  ...
}

{:crux.lucene/lucene-store {:db-dir "lucene-dir"
                            :checkpointer {:crux/module 'crux.checkpoint/->checkpointer
                                           :store {:crux/module 'crux.checkpoint/->filesystem-checkpoint-store
                                                   :path "/path/to/cp-store"}
                                           :approx-frequency (Duration/ofHours 6)}}}
 ...}

{:crux.lucene/lucene-store {:db-dir "lucene-dir"
                            :checkpointer {:crux/module crux.checkpoint/->checkpointer
                                           :store {:crux/module crux.checkpoint/->filesystem-checkpoint-store
                                                   :path "/path/to/cp-store"}
                                           :approx-frequency "PT6H"}}}
 ...}

Reference