Full-text Lucene search
Full-text search module for Crux making use of Apache Lucene.
crux-lucene
runs in-process as part of the Crux node as a Crux module.
The Lucene index is kept up-to-date asynchronously as a "secondary index" that hooks into the underlying Crux transaction processing.
Setup
First, add the crux-lucene
dependency to your project:
pro.juxt.crux/crux-lucene {:mvn/version "1.18.0"}
<dependency>
<groupId>pro.juxt.crux</groupId>
<artifactId>crux-lucene</artifactId>
<version>1.18.0</version>
</dependency>
Add the following to your node configuration:
{
"crux.lucene/lucene-store": {
// omit `"db-dir"` to start an in-memory-only Lucene store.
"db-dir": "lucene",
}
}
{...
; omit `:db-dir` to start an in-memory-only Lucene store.
:crux.lucene/lucene-store {:db-dir "lucene-dir"}}
{...
; omit `:db-dir` to start an in-memory-only Lucene store.
:crux.lucene/lucene-store {:db-dir "lucene-dir"}}
Datalog Querying
You can use the built-in text-search
function within in your Datalog queries:
{:find '[?e]
:where '[[(text-search :name "Ivan") [[?e]]]
[?e :crux.db/id]]}
The destructuring available is entity-id
, matched-value
and score
.
For example, to return the complete search results tuples:
{:find '[?e ?v ?s]
:where '[[(text-search :name "Ivan") [[?e ?v ?s]]]
[?e :crux.db/id]]}
In the above example, ?e
is the entity ID of the matched search result.
?v
is the matched value and ?s
is the matched score.
You can use standard Lucene fuzzy textual search capabilities:
{:find '[?e]
:where '[[(text-search :name "Iva*") [[?e]]]
[?e :crux.db/id]]}
This will return all entities with a :name
attribute that starts with "Iva". Note that large result sets will be fully realized and there is currently no means to specify a limit.
This module does not support "leading wildcard" searches, e.g. "*van", even though Lucene is technically capable of performing these queries.
All query functions implemented in crux-lucene
pass your query string directly to Lucene’s QueryParser.parse
using the StandardAnalyzer
, without any escaping or other modifications.
See the Lucene documentation for more information.
It’s possible to supply var bindings to use in text-search
:
(c/q db '{:find [?v]
:in [input]
:where [[(text-search :name input) [[?e ?v]]]]}
"Ivan")
Wildcard Attributes
There is a wildcard search function, where you can search across all attributes:
{:find '[?e ?v ?a ?s]
:where '[[(wildcard-text-search "Iva*") [[?e ?v ?a ?s]]]
[?e :crux.db/id]]}
This will return all entities that have an attribute with a value that matches "Iva".
The destructured binding also contains a
which is the matched attribute.
Multi-field searches
There is an entirely different Lucene search function available for multi-field searches: lucene-text-search
:
{:find '[?e]
:where '[[(lucene-text-search "firstname:James OR surname:preston") [[?e]]]]}
lucene-text-search
accepts a Lucene query string whose format is documented extensively elsewhere.
To enable this document-oriented query model, the structure of the indexes stored in Lucene is fundamentally different to the structure required for the EAV-oriented features discussed above, Using both querying and indexing approaches within the same node is supported and discussed discussed below.
In the normal case for text-search
and wildcard-text-search
, we index each EAV in a Crux document as individual documents in Lucene.
This allows for some degree of structural sharing, which should help in the case where there is a lot historical data in Crux.
By contrast, lucene-text-search
indexes a single document per document-version in Crux.
The downside of this is there is no structural sharing (besides whatever tricks Lucene employs under the hood), but the upside is taking advantage of more of the Lucene query language capability, e.g. to perform queries that multiple fields into account.
To enable lucene-text-search
, you must configure the Lucene Indexer, such like:
{...
:crux.lucene/lucene-store {:indexer 'crux.lucene.multi-field/->indexer}}
Bindings
It’s possible to supply var bindings also, which are wired in using java.lang.String.format
when the vars are bound.
{:find [?e]
:in [?surname ?firstname]
:where [[(lucene-text-search "surname: %s AND firstname: %s" ?surname ?firstname) [[?e]]]]}
String Escaping
You can escape your input strings when constructing Lucene query strings by calling org.apache.lucene.queryparser.classic.QueryParser/escape
. For example, this method would transform "|&hello&|"
to "\\|\\&hello\\&\\|"
.
This is helpful to mitigate against injection attacks and other errors.
Custom searching outside of Datalog
The more direct crux.lucene/search
function is available to lazily return results, without the temporal filtering or other constraints of using Lucene via the q
API.
The function accepts 3 parameters (node
, query
and opts
) and returns an iterable cursor of results that must be closed.
The query
parameter can be either a Lucene query string or an org.apache.lucene.search.Query
object.
The opts
parameter accepts a map with a single :default-field
entry.
The value of this entry will be supplied to the Lucene QueryParser
in the cases where the supplied query
parameter is a Lucene query string.
(with-open [search-results (crux.lucene/search node "Ivan")]
(into [] (iterator-seq search-results)))
Each item returned will be a vector of org.apache.lucene.document.Document
and a Double representing the matched score.
See the extension tests for examples of decoding the contents of the result document and performing userspace temporal filtering.
Custom Indexer
It is possible to implementing a custom indexer based on the crux.lucene/LuceneIndexer
protocol, which will be necessary to address complex requirements.
See the extension tests for examples.
Custom Analyzer
Lucene provides a huge amount of capability beyond the default StandardAnalyzer
.
See the extension tests for examples.
Multiple Lucene modules
The built-in search functions all accept an additional opts map parameter as the last argument.
This can be included in your Datalog query as a literal or passed in using a logic variable.
The value under :lucene-store-k
in this map can be set to specify that a search function should be run against a particular module (i.e. a specific Lucene secondary index, if many are configured), otherwise the search function will attempt to execute against the default :crux.lucene/lucene-store
module.
See the extension tests for an example of configuring multiple Lucene modules to run on the same node.
Checkpointing Lucene
For more details about checkpointing in Crux, see the main Checkpointing docs.
You can set up checkpointing on your Lucene store too, in addition to the main Crux query indices. This means that a new node starting up will be able to download a checkpoint of a reasonably recent Lucene store from a central location rather than having to replay all of the transactions.
The parameters are the same as for the main Crux query indices, except applied to your Lucene store component:
{
"crux.lucene/index-store": {
"db-dir": "lucene-dir",
"checkpointer": {
"crux/module": "crux.checkpoint/->checkpointer",
"store": {
"crux/module": "crux.checkpoint/->filesystem-checkpoint-store",
"path": "/path/to/cp-store"
},
"approx-frequency": "PT6H"
}
},
...
}
{:crux.lucene/lucene-store {:db-dir "lucene-dir"
:checkpointer {:crux/module 'crux.checkpoint/->checkpointer
:store {:crux/module 'crux.checkpoint/->filesystem-checkpoint-store
:path "/path/to/cp-store"}
:approx-frequency (Duration/ofHours 6)}}}
...}
{:crux.lucene/lucene-store {:db-dir "lucene-dir"
:checkpointer {:crux/module crux.checkpoint/->checkpointer
:store {:crux/module crux.checkpoint/->filesystem-checkpoint-store
:path "/path/to/cp-store"}
:approx-frequency "PT6H"}}}
...}