Skip to content

Commit

Permalink
update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
leonardbinet committed Mar 7, 2020
1 parent 8d84c6f commit da46969
Show file tree
Hide file tree
Showing 9 changed files with 238 additions and 55 deletions.
1 change: 1 addition & 0 deletions docs/source/IMDB.md
10 changes: 8 additions & 2 deletions docs/source/advanced-usage.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@

##############
Advanced usage
==============
##############

TODO
.. note::

This is a work in progress. Some sections still need to be furnished.

* node and tree deserialization order
* compound query insertion
5 changes: 0 additions & 5 deletions docs/source/imdb.rst

This file was deleted.

4 changes: 2 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ pandagg
:hidden:
:maxdepth: 4

Introduction <self>
introduction
user-guide
advanced-usage
Usage example <imdb>
Tutorial dataset <IMDB>
API reference <reference/pandagg>
Contributing <CONTRIBUTING>

Expand Down
47 changes: 47 additions & 0 deletions docs/source/introduction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
##########
Principles
##########

.. note::

This is a work in progress. Some sections still need to be furnished.


**pandagg** is designed for both for "regular" code repository usage, and "interactive" usage (ipython or jupyter
notebook usage with autocompletion features inspired by `pandas <https://github.com/pandas-dev/pandas>`_ design).

This library focuses on two principles:

* stick to the **tree** structure of Elasticsearch objects
* provide simple and flexible interfaces to make it easy and intuitive to use in an interactive usage


*****************************
Elasticsearch tree structures
*****************************

Many Elasticsearch objects have a **tree** structure, ie they are built from a hierarchy of **nodes**:

* a `mapping <https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html>`_ (tree) is a hierarchy of `fields <https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html>`_ (nodes)
* a `query <https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html>`_ (tree) is a hierarchy of query clauses (nodes)
* an `aggregation <https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html>`_ (tree) is a hierarchy of aggregation clauses (nodes)
* an aggregation response (tree) is a hierarchy of response buckets (nodes)

This library aims to stick to that structure by providing a flexible syntax distinguishing **trees** and **nodes**.

*****************
Interactive usage
*****************

Some classes are not intended to be used elsewhere than in interactive mode (ipython), since their purpose is to serve
auto-completion features and convenient representations.

They won't serve you for any other usage than interactive ones.

Namely:

* `pandagg.mapping.IMapping`: used to interactively navigate in mapping and run quick aggregations on some fields
* `pandagg.client.Elasticsearch`: used to discover cluster indices, and eventually navigate their mappings, or run quick access aggregations or queries.
* `pandagg.agg.AggResponse`: used to interactively navigate in an aggregation response

These use case will be detailed in following sections.
1 change: 1 addition & 0 deletions docs/source/ressources
207 changes: 168 additions & 39 deletions docs/source/user-guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,40 +2,12 @@
User Guide
##########

.. toctree::


.. note::

Examples will be based on :doc:`IMDB` data.
This is a work in progress. Some sections still need to be furnished.

**********
Philosophy
**********

**pandagg** is designed for both for "regular" code repository usage, and "interactive" usage (ipython or jupyter
notebook usage with autocompletion features inspired by `pandas <https://github.com/pandas-dev/pandas>`_ design).

This library focuses on two principles:

* stick to the **tree** structure of Elasticsearch objects
* provide simple and flexible interfaces to make it easy and intuitive to use in an interactive usage

Elasticsearch tree structures
=============================

Many Elasticsearch objects have a **tree** structure, ie they are built from a hierarchy of **nodes**:

* a `mapping <https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html>`_ (tree) is a hierarchy of `fields <https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html>`_ (nodes)
* a `query <https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html>`_ (tree) is a hierarchy of query clauses (nodes)
* an `aggregation <https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html>`_ (tree) is a hierarchy of aggregation clauses (nodes)
* an aggregation response (tree) is a hierarchy of response buckets (nodes)

This library aims to stick to that structure by providing a flexible syntax distinguishing **trees** and **nodes**.

Interactive usage
=================


*****
Query
Expand Down Expand Up @@ -144,16 +116,6 @@ Eventually, you can also use regular Elasticsearch dict syntax:
└── terms, field=genres, values=['Action', 'Thriller']


*******
Mapping
*******

Mapping declaration
===================

Mapping navigation
==================

***********
Aggregation
***********
Expand All @@ -166,8 +128,175 @@ Aggregation response

TODO

*******
Mapping
*******

Here is a portion of :doc:`IMDB` example mapping:

>>> imdb_mapping = {
>>> 'dynamic': False,
>>> 'properties': {
>>> 'movie_id': {'type': 'integer'},
>>> 'name': {
>>> 'type': 'text',
>>> 'fields': {
>>> 'raw': {'type': 'keyword'}
>>> }
>>> },
>>> 'year': {
>>> 'type': 'date',
>>> 'format': 'yyyy'
>>> },
>>> 'rank': {'type': 'float'},
>>> 'genres': {'type': 'keyword'},
>>> 'roles': {
>>> 'type': 'nested',
>>> 'properties': {
>>> 'role': {'type': 'keyword'},
>>> 'actor_id': {'type': 'integer'},
>>> 'gender': {'type': 'keyword'},
>>> 'first_name': {
>>> 'type': 'text',
>>> 'fields': {
>>> 'raw': {'type': 'keyword'}
>>> }
>>> },
>>> 'last_name': {
>>> 'type': 'text',
>>> 'fields': {
>>> 'raw': {'type': 'keyword'}
>>> }
>>> }
>>> }
>>> }
>>> }
>>> }

Mapping DSL
===========

The :class:`~pandagg.tree.mapping.Mapping` class provides a more compact view, which can help when dealing with large mappings:

>>> from pandagg.mapping import Mapping
>>> m = Mapping(imdb_mapping)
<Mapping>
{Object}
├── genres Keyword
├── movie_id Integer
├── name Text
│ └── raw ~ Keyword
├── rank Float
├── roles [Nested]
│ ├── actor_id Integer
│ ├── first_name Text
│ │ └── raw ~ Keyword
│ ├── gender Keyword
│ ├── last_name Text
│ │ └── raw ~ Keyword
│ └── role Keyword
└── year Date


With pandagg DSL, an equivalent declaration would be the following:

>>> from pandagg.mapping import Mapping, Object, Nested, Float, Keyword, Date, Integer, Text
>>>
>>> dsl_mapping = Mapping(properties=[
>>> Integer('movie_id'),
>>> Text('name', fields=[
>>> Keyword('raw')
>>> ]),
>>> Date('year', format='yyyy'),
>>> Float('rank'),
>>> Keyword('genres'),
>>> Nested('roles', properties=[
>>> Keyword('role'),
>>> Integer('actor_id'),
>>> Keyword('gender'),
>>> Text('first_name', fields=[
>>> Keyword('raw')
>>> ]),
>>> Text('last_name', fields=[
>>> Keyword('raw')
>>> ])
>>> ])
>>> ])

Which is exactly equivalent to initial mapping:

>>> dsl_mapping.serialize() == imdb_mapping
True


Interactive mapping
===================

In interactive context, the :class:`~pandagg.interactive.mapping.IMapping` class provides navigation features with autocompletion to quickly discover a large
mapping:

>>> from pandagg.mapping import IMapping
>>> m = IMapping(imdb_mapping)
>>> m.roles
<IMapping subpart: roles>
roles [Nested]
├── actor_id Integer
├── first_name Text
│ └── raw ~ Keyword
├── gender Keyword
├── last_name Text
│ └── raw ~ Keyword
└── role Keyword
>>> m.roles.first_name
<IMapping subpart: roles.first_name>
first_name Text
└── raw ~ Keyword

To get the complete field definition, just call it:

>>> m.roles.first_name()
<Mapping Field first_name> of type text:
{
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}

A **IMapping** instance can be bound to an Elasticsearch client to get quick access to aggregations computation on mapping fields.

Suppose you have the following client:

>>> from elasticsearch import Elasticsearch
>>> client = Elasticsearch(hosts=['localhost:9200'])

Client can be bound either at initiation:

>>> m = IMapping(imdb_mapping, client=client, index_name='movies')

or afterwards through `bind` method:

>>> m = IMapping(imdb_mapping)
>>> m.bind(client=client, index_name='movies')

Doing so will generate a **a** attribute on mapping fields, this attribute will list all available aggregation for that
field type (with autocompletion):

>>> m.roles.gender.a.terms()
[('M', {'key': 'M', 'doc_count': 2296792}),
('F', {'key': 'F', 'doc_count': 1135174})]


.. note::

Nested clauses will be automatically taken into account.


*************************
Cluster indices discovery
*************************

TODO

16 changes: 10 additions & 6 deletions examples/imdb/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Explore IMDB with ElasticSearch
# IMDB dataset

You might know the Internet Movie Database, commonly called [IMDB](https://www.imdb.com/).

Expand All @@ -7,7 +7,7 @@ Well it's a good simple example to showcase ElasticSearch capabilities.
In this case, relational databases (SQL) are a good fit to store with consistence this kind of data.
Yet indexing some of this data in a optimized search engine will allow more powerful queries.

## Goal of exercice
## Query requirements
In this example, we'll suppose most usage/queries requirements will be around the concept of movie (rather than usages
focused on fetching actors or directors, even though it will still be possible with this data structure).

Expand All @@ -21,7 +21,9 @@ The index should provide good performances trying to answer these kind question


## Data source
I exported following SQL tables from MariaDB as described in https://relational.fit.cvut.cz/dataset/IMDb:
I exported following SQL tables from MariaDB [following these instructions](https://relational.fit.cvut.cz/dataset/IMDb).

Relational schema is the following:

![imdb tables](ressources/imdb_ijs.svg)

Expand Down Expand Up @@ -61,6 +63,7 @@ https://www.elastic.co/fr/blog/strings-are-dead-long-live-strings)

#### Final mapping

*TODO -> use [copy_to](https://www.elastic.co/guide/en/elasticsearch/reference/current/copy-to.html) parameter to build **full_name***
```
.
├── directors [Nested]
Expand Down Expand Up @@ -139,6 +142,7 @@ python examples/imdb/load.py


#### Explore pandagg notebooks
```
jupyter notebook
```

An example notebook is available to showcase some of `pandagg` functionalities: [here it is](https://gistpreview.github.io/?4cedcfe49660cd6757b94ba491abb95a).

Code is present in `examples/imdb/IMDB exploration.py` file.
2 changes: 1 addition & 1 deletion pandagg/interactive/_field_agg_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ def _operate(self, agg_node, index, execute, output, query):
for nested in nesteds:
raw_response = raw_response[nested]
result = list(agg_node.extract_buckets(raw_response[agg_node.name]))
if output is None:
if output == 'raw':
return result
elif output == 'dataframe':
try:
Expand Down

0 comments on commit da46969

Please sign in to comment.