Lucene
Full-text search module for Crux making use of Apache Lucene.
crux-lucene
runs in-process as part of the Crux node as a Crux
module. The Lucene index is kept up to date synchronously when Crux
transactions are processed on a node.
This module is in alpha and likely to change. In particular, we might rationalize/combine the different query functions available (see https://github.com/juxt/crux/issues/1318). |
Setup
First, add the crux-lucene
dependency to your project:
pro.juxt.crux/crux-lucene {:mvn/version "1.17.1"}
<dependency>
<groupId>pro.juxt.crux</groupId>
<artifactId>crux-lucene</artifactId>
<version>1.17.1</version>
</dependency>
Add the following to your node configuration:
{
"crux.lucene/lucene-store": {
"db-dir": "lucene",
}
}
{...
:crux.lucene/lucene-store {:db-dir "lucene-dir"}}
{...
:crux.lucene/lucene-store {:db-dir "lucene-dir"}}
You must have a fresh node to add Lucene configuration to. If
you add Lucene configuration to a populated Crux node, you will
receive an exception when the node starts up: Lucene store latest tx
mismatch . To remedy this, you must wipe the index directories for the
Crux node and restart. The Lucene index will then be populated as part
of the normal Crux node ingestion process.
|
Querying
All text fields in a document will be automatically indexed. You can
then you use the in-built text-search
fn in your datalog:
{:find '[?e]
:where '[[(text-search :name "Ivan") [[?e]]]
[?e :crux.db/id]]}
The destructuring available is entity-id
, matched-value
and
score
. For example, to return the full search results tuple:
{:find '[?e ?v ?s]
:where '[[(text-search :name "Ivan") [[?e ?v ?s]]]
[?e :crux.db/id]]}
In the above example, ?e
is the entity ID of the matched search
rsult. ?v
is the matched value and ?s
is the matched score.
You can use standard Lucene fuzzy textual search capabilities:
{:find '[?e]
:where '[[(text-search :name "Iva*") [[?e]]]
[?e :crux.db/id]]}
This will return all entities with a :name
attribute that starts with "Iva".
All query functions implemented in crux-lucene
pass your query string directly
to Lucene’s QueryParser.parse
using the StandardAnalyzer
, without any
escaping or other modifications. See the Lucene documentation for more
information.
It’s possible to supply var bindings to use in text-search
:
(c/q db '{:find [?v]
:in [input]
:where [[(text-search :name input) [[?e ?v]]]]}
"Ivan")
Wildcard Attributes
There is an an experimental wildcard search function, where you can search across all attributes:
{:find '[?e ?v ?a ?s]
:where '[[(wildcard-text-search "Iva*") [[?e ?v ?a ?s]]]
[?e :crux.db/id]]}
Will return all entities that have an attribute with a value that
matches "Iva". The destructured binding also contains a
which is the
matched attribute.
Multi-field searches
There is an entirely different search-function available for
multi-field searches using Lucene in Crux: lucene-text-search
:
{:find '[?e]
:where '[[(lucene-text-search "firstname:James OR surname:preston") [[?e]]]]}
This lucene-text-search
takes a Lucene query string.
If you use lucene-text-search , you cannot use the search
functions listed above - wildcard-text-search and
text-search . This is because the way we index documents into Lucene
is different.
|
In the normal case for text-search and wildcard-text-search ,
we index each A/V pair in a Crux document as individual documents in
Lucene. This allows for a large degree of structural sharing, which
will help in the case where there is a lot historical data in
Crux. This is targeted to ease the disk-space taken up by Lucene, but
also for query efficiency reasons.
|
lucene-text-search indexs a single document per
document-version in Crux. The downside of this is structural sharing
impacting disk space, but the upside is taking advantage of more of
the Lucene query language capability, and to perform queries taking
into account multiple fields.
|
To enable lucene-text-search
, you must configure the Lucene Indexer, such like:
{...
:crux.lucene/lucene-store {:db-dir "lucene-dir" :indexer 'crux.lucene.multi-field/->indexer}}
Bindings
It’s possible to supply var bindings also, that are wired in using
format
when the vars are bound.
{:find [?e]
:in [?surname ?firstname]
:where [[(lucene-text-search "surname: %s AND firstname: %s" ?surname ?firstname) [[?e]]]]}
String Escaping
You can escape your input strings when constructing Lucene query strings by calling org.apache.lucene.queryparser.classic.QueryParser/escape
. For example, this method would transform "|&hello&|"
to "\\|\\&hello\\&\\|"
.
This is helpful to mitigate against injection attacks and other errors.