Skip to content

Latest commit

 

History

History
283 lines (175 loc) · 8.58 KB

README.rdoc

File metadata and controls

283 lines (175 loc) · 8.58 KB

Lucene.rb

Lucene.rb is JRuby wrapper for the Lucene document database.

Status

This library was once included in the neo4j.rb gem (version <= 0.4.6). The neo4j wrapper has now replaced this library with the included Java neo4j-lucene component. The main reason for this split was that neo4j now uses two phase commit to keep the databases in sync.

TODO

  • Upgrade to newer lucene 3.x

  • Check the thread synchronization and locking - is this needed ?

Installation

Install JRuby

The easiest way to install JRuby is by using RVM, see rvm.beginrescueend.com. Otherwise check: kenai.com/projects/jruby/pages/GettingStarted#Installing_JRuby

Lucene.rb

Lucene provides:

  • Flexible Queries - Phrases, Wildcards, Compound boolean expressions etc…

  • Field-specific Queries eg. title, artist, album

  • Sorting

  • Ranked Searching

The Lucene index will be updated after the transaction commits. It is not possible to query for something that has been created inside the same transaction as where the query is performed.

Lucene Document

In Lucene everything is a Document. A document can represent anything textual: A Word Document, a DVD (the textual metadata only), or a Neo4j.rb node. A document is like a record or row in a relationship database.

The following example shows how a document can be created by using the ”<<” operator on the Lucene::Index class and found using the Lucene::Index#find method.

Example of how to write a document and find it:

require 'lucene'

include Lucene

# the var/myindex parameter is either a path where to store the index or
# just a key if index is kept in memory (see below)
index = Index.new('var/myindex')

# add one document (a document is like a record or row in a relationship database)
index << {:id=>'1', :name=>'foo'}

# write to the index file
index.commit

# find a document with name foo
# hits is a ruby Enumeration of documents
hits = index.find{name == 'foo'}

# show the id of the first document (document 0) found
# (the document contains all stored fields - see below)
hits[0][:id]   # => '1'

Notice that you have to call the commit method in order to update the index (both disk and in memory indexes). Performing several update and delete operations before a commit will give much better performance than committing after each operation.

Keep indexing on disk

By default Neo4j::Lucene keeps indexes in memory. That means that when the application restarts the index will be gone and you have to reindex everything again.

To store indexes on file:

Lucene::Config[:store_on_file] = true
Lucene::Config[:storage_path] => '/home/neo/lucene-db'

When creating a new index the location of the index will be the Lucene::Config + index path Example:

Lucene::Config[:store_on_file] = true
Lucene::Config[:storage_path] => '/home/neo/lucene-db'
index = Index.new('/foo/lucene')

The example above will store the index at /home/neo/lucene-db/foo/lucene

Indexing several values with the same key

Let say a person can have several phone numbers. How do we index that?

index << {:id=>'1', :name=>'adam', :phone => ['987-654', '1234-5678']}

Id field

All Documents must have one id field. If an id is not specified, the default will be: :id of type String. A different id can be specified using the field_infos id_field property on the index:

index = Index.new('some/path/to/the/index')
index.field_infos.id_field = :my_id

To change the type of the my_id from String to a different type see below.

Conversion of types

Lucene.rb can handle type conversion for you. (The Java Lucene library stores all the fields as Strings) For example if you want the id field to be a Fixnum

require 'lucene'
include Lucene

index = Index.new('var/myindex')  # store the index at dir: var/myindex
index.field_infos[:id][:type] = Fixnum

index << {:id=>1, :name=>'foo'} # notice 1 is not a string now

index.commit

# find that document, hits is a ruby Enumeration of documents
hits = index.find(:name => 'foo')

# show the id of the first document (document 0) found
# (the document contains all stored fields - see below)
doc[0][:id]   # => 1

If the field_info type parameter is not set then it has a default value of String.

Storage of fields

By default only the id field will be stored. That means that in the example above the :name field will not be included in the document.

Example

doc = index.find('name' => 'foo')
doc[:id]   # => 1
doc[:name] # => nil

Use the field info :store=true if you want a field to be stored in the index (otherwise it will only be searchable).

Example

require 'lucene'
include Lucene

index = Index.new('var/myindex')  # store the index at dir: var/myindex
index.field_infos[:id][:type] = Fixnum
index.field_infos[:name][:store] = true # store this field

index << {:id=>1, :name=>'foo'} # notice 1 is not a string now

index.commit

# find that document, hits is a ruby Enumeration of documents
hits = index.find('name' => 'foo')

# let say hits only contains one document so we can use doc[0] for that one
# that document contains all stored fields (see below)
doc[0][:id]   # => 1
doc[0][:name] # => 'foo'

Setting field infos

As shown above you can set field infos like this

index.field_infos[:id][:type] = Fixnum

Or you can set several properties like this:

index.field_infos[:id] = {:type => Fixnum, :store => true}

Tokenized

Field infos can be used to specify if the should be tokenized. If this value is not set then the entire content of the field will be considered as a single term.

Example

index.field_infos[:text][:tokenized] = true

If not specified, the default is ‘false’

Analyzer

Field infos can also be used to set which analyzer should be used. If none is specified, the default analyzer - org.apache.lucene.analysis.standard.StandardAnalyzer (:standard) will be used.

index.field_infos[:code][:tokenized] = false
index.field_infos[:code][:analyzer] = :standard

The following analyzer is supported

* :standard (default) - org.apache.lucene.analysis.standard.StandardAnalyzer
* :keyword - org.apache.lucene.analysis.KeywordAnalyzer
* :simple  - org.apache.lucene.analysis.SimpleAnalyzer
* :whitespace - org.apache.lucene.analysis.WhitespaceAnalyzer
* :stop       - org.apache.lucene.analysis.StopAnalyzer

For more info, check the Lucene documentation, lucene.apache.org/java/docs/

Simple Queries

Lucene.rb support search in several fields: Example:

# finds all document having both name 'foo' and age 42
hits = index.find('name' => 'foo', :age=>42)

Range queries:

# finds all document having both name 'foo' and age between 3 and 30
hits = index.find('name' => 'foo', :age=>3..30)

Lucene Queries

If the query is string then the string is a Lucene query.

hits = index.find('name:foo')

For more information see: lucene.apache.org/java/2_4_0/queryparsersyntax.html

Advanced Queries (DSL)

The queries above can also be written in a lucene.rb DSL:

hits = index.find { (name == 'andreas') & (foo == 'bar')}

Expression with OR (|) is supported, example

# find all documents with name 'andreas' or age between 30 and 40
 hits = index.find { (name == 'andreas') | (age == 30..40)}

Sorting

Sorting is specified by the ‘sort_by’ parameter Example:

hits = index.find(:name => 'foo', :sort_by=>:category)

To sort by several fields:

hits = index.find(:name => 'foo', :sort_by=>[:category, :country])

Example sort order:

hits = index.find(:name => 'foo', :sort_by=>[Desc[:category, :country], Asc[:city]])

Thread-safety

The Lucene::Index is thread safe. It guarantees that an index is not updated from two threads at the same time.

Lucene Transactions

Use the Lucene::Transaction in order to do atomic commits. By using a transaction you do not need to call the Index.commit method.

Example:

Transaction.run do |t|
  index = Index.new('var/index/foo')
  index << { id=>42, :name=>'andreas'}
  t.failure  # rollback
end

result = index.find('name' => 'andreas')
result.size.should == 0

You can find uncommitted documents with the uncommitted index property.

Example:

index = Index.new('var/index/foo')
index.uncommited #=> [document1, document2]

Notice that even if it looks like a new Index instance object was created the index.uncommitted may return a non-empty array. This is because Index.new is a singleton - a new instance object is not created.