Skip to content

django-native-search implements basic full-text search engine for Django models without additional dependencies.

License

Notifications You must be signed in to change notification settings

kmierzeje/django-native-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

django-native-search

PyPI

django-native-search implements basic full-text search engine for Django models.

The engine itself uses Django ORM to manage its index, so no additional backend is needed for searching to work. Just create a model for index, run makemigrations and migrate and you are ready to feed it with data and search.

Installation

Install the package from PyPi:

pip install django-native-search

The package will be installed with all its dependencies including django-expression-index.

Setup

Setting up the search in basic configuration is quite simple.

1. Register the app

Add django_native_search to INSTALLED_APPS in your settings:

INSTALLED_APPS = [
    ...
    'django_native_search.apps.DjangoNativeSearch',
    ...
]

2. Define your Index Model

Create a new app or in existing app, in your models.py, define an index model. In this example we are creating a simple index for books.Book model:

from django_native_search.models import IndexEntry

class BookIndexEntry(IndexEntry):
    object = models.OneToOneField('books.Book', on_delete=models.CASCADE)
    search_template='search_index/book.txt'

The object field defines a relation to a model which is being indexed. The engine uses search_template to render the text with object variable in template context. By default the rendered text is tokenized with by re.searchall(r'[^\s"]+', text). You can change this behavior by overriding tokenize class method in your index model. All extracted tokens are stored in the index of respective indexed model instance.

import re

class BookIndexEntry(IndexEntry):
    ...
    @classmethod
    def tokenize(self, text):
        return re.findall(r"[^\W_]+(['_]?[^\W_]+)*", text)

Index for multiple models

It is also possible to create index for multiple models by using model inheritance. Create a single concrete descendant model of IndexEntry with multiple descendants for each indexed model.

You can add some common fields to this model to be used for filtering the entries, but do not add objects field. Then create descendants of your IndexEntry model. Each of the derived classes should have object field which points to a model to be indexed and a search_template.

I would advise to put some additional fields to your root index model, to be able to filter entries of any kind or display the results without additional query for descendant models. You can fill the fields with data by overriding your save method in your index model.

It should also be possible to use GenericForeignKey to define the object field, but I haven't tried it.

Multiple indexes

Each direct descendant of IndexEntry is a separate index, so you can have multiple independent indexes in your site.

3. Prepare the database

Run the well known commands:

manage.py makemigrations
manage.py migrate

The index was tested with sqlite and PostreSQL.

Usage

Usually you use your index to do full-text seach within your data. Just remember to fill it with data first.

Feeding the index with data

The only thing you need to do is to create your IndexEntry descendant model instance and save it.

from book_index.models import BookIndexEntry
from books.models import Book

for book in Book.objects.all():
    BookIndexEntry(object=book).save()

There is a convenient shortcut for indexing querysets:

from book_index.models import BookIndexEntry
BookIndexEntry.objects.rebuild()

You can override get_index_queryset method in your class to do select_related or filter or anything you need, before passing the queryset for indexing.

You can call the rebuild method on your index model root class manager, to rebuild all descendant index models.

Probably you would like to create you own management command to run the indexing, but actually you would not use it...

Runtime index updates

The indexing should be fast enough to be executed in runtime on every save of the indexed model. Just connect a handler to post_save signal:

from django.db.models.signals import post_save

class BookIndexEntry(IndexEntry):
    ...
    @classmethod
    def update_index(cls, instance, **kwargs):
        cls.objects.refresh([instance])
        
post_save.connect(BookIndexEntry.update_index, sender=Book)

Now your index will be always up-to-date.

Searching

You can search the index by calling the manager's search method. The query is tokenized using the same tokenize method as when indexing. All tokens must be found in a document to consider it matched:

qs = BookIndexEntry.objects.search('Monty Python')

This will return a QuerySet of BookIndexEntry which contain both "Monty" and "Python" case sensitively. If you want your search to be case-insensitive, then provide the query in lowercase:

qs = BookIndexEntry.objects.search('circus')

You can filter the search results, just as any other QuerySet:

qs = BookIndexEntry.objects.search('circus').filter(object__release_date__year__gt=1970)

By default search returns matches only for whole words. If there is a single keyword in a query, the engine does a substring search, so search results may contain documents with words matching the keyword or containing it.

For example searching for "yth" may return documents containing "python", "pythonic", "myth", "demythologization".

Substring search works fine in sqlite. In PostgreSQL there is a problem with using the db index, so the searching might be too slow.

Putting multiple words inside quotes forces searching for colocation of these words.

qs = BookIndexEntry.objects.search('"Monty Python\'s Flying Circus"')

This will return a QuerySet of BookIndexEntry which contain word "Monty" followed by "Python's", followed by "Flying", followed by "Circus".

Search form

There is SearchFormMixin available to easily to create your search view:

from django.views.generic.base import TemplateView
from book_index.models import BookIndexEntry
from django_native_search.forms import SearchFormMixin, searchform_factory

class SearchView(SearchFormMixin, TemplateView):
    template_name = "books_index/search.html"
    form_class = searchform_factory(BookIndexEntry)

The searchform_factory function will use all fields with db_index = True in BookIndexEntry to create MultipleChoiceField in your form. The fields can be used to filter the results. Each filtering field in your form will contain all possible values of the field in the database.

Search template

The templated referred by template_name is rendered with form containing the form instance and results containing the queryset of search results if form is valid.

{% block content %}
    <h2>Search</h2>
    <form method="get" action=".">
        <table>
            {{ form }}
            <tr>
                <td>&nbsp;</td>
                <td>
                    <input type="submit" value="Search" class="btn"/>
                </td>
            </tr>
        </table>
    </form>
    {% if form.is_valid %}
        <br/>
        <h3>Found {{ results.count }} results</h3>
        <ul>
            {% for result in results %}
                <li class = "search-result">
                    <ul>
                        <li class="result-link">
                            <a href="{{result.object.get_absolute_url}}">{{ result.object.title }}</a>
                        </li>
                        <li class="result-excerpt'>
                            {{result.excerpt}}
                        </li>
                    </ul>
                </li>
            {% endfor %}
        </ul>
    {% endif %}
{% endblock %}

The excerpt member of index entry instance returns a fragment of the indexed document with occurrences of search keywords hihghted with <em>.

Settings

There are serveral settings to tweak the search engine.

SEARCH_MIN_SUBSTR_LENGTH

Default : 2

Minimum number of characters in keyword to run substring search.

SEARCH_MAX_SUBTSTR_COUNT_IN_QUERY

Default : 300

Maximum number of indexed words containing the substring to run substring search.

SEARCH_MAX_EXCERPT_FRAGMENTS

Default : 5

Maximum number of fragments containing keywords to be returned in excerpt.

SEARCH_EXCERPT_FRAGMENT_START_OFFSET

Default : -2

Offset of excerpt fragment start.

SEARCH_EXCERPT_FRAGMENT_END_OFFSET

Default : 5

Offset of excerpt fragment end.

SEARCH_MAX_RANKING_KEYWORDS_COUNT

Default : 3

Maximum number of keywords to be used for ranking the results. If the query contains more keywords, only the first ones will be used to calculate the ranking of results.

Search API

To be described...

Look into the code to check what you can do with it.

Performance

Despite the naive design, the index performs surpsisingly well, even with quite large datasets. It can search through 100k documents containing 10M words in a fraction of a second.

About

django-native-search implements basic full-text search engine for Django models without additional dependencies.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages