update documentation

alkemics · Mar 7, 2020 · da46969 · da46969
1 parent 8d84c6f
commit da46969
Show file tree

Hide file tree

Showing 9 changed files with 238 additions and 55 deletions.
diff --git a/docs/source/IMDB.md b/docs/source/IMDB.md
@@ -0,0 +1 @@
+../../examples/imdb/README.md
diff --git a/docs/source/advanced-usage.rst b/docs/source/advanced-usage.rst
@@ -1,5 +1,11 @@
 
+##############
 Advanced usage
-==============
+##############
 
-TODO
+.. note::
+
+    This is a work in progress. Some sections still need to be furnished.
+
+    * node and tree deserialization order
+    * compound query insertion
diff --git a/docs/source/imdb.rst b/docs/source/imdb.rst
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -11,10 +11,10 @@ pandagg
    :hidden:
    :maxdepth: 4
 
-   Introduction <self>
+   introduction
    user-guide
    advanced-usage
-   Usage example <imdb>
+   Tutorial dataset <IMDB>
    API reference <reference/pandagg>
    Contributing <CONTRIBUTING>
 

diff --git a/docs/source/introduction.rst b/docs/source/introduction.rst
@@ -0,0 +1,47 @@
+##########
+Principles
+##########
+
+.. note::
+
+    This is a work in progress. Some sections still need to be furnished.
+
+
+**pandagg** is designed for both for "regular" code repository usage, and "interactive" usage (ipython or jupyter
+notebook usage with autocompletion features inspired by `pandas <https://github.com/pandas-dev/pandas>`_ design).
+
+This library focuses on two principles:
+
+* stick to the **tree** structure of Elasticsearch objects
+* provide simple and flexible interfaces to make it easy and intuitive to use in an interactive usage
+
+
+*****************************
+Elasticsearch tree structures
+*****************************
+
+Many Elasticsearch objects have a **tree** structure, ie they are built from a hierarchy of **nodes**:
+
+* a `mapping <https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html>`_ (tree) is a hierarchy of `fields <https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html>`_ (nodes)
+* a `query <https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html>`_ (tree) is a hierarchy of query clauses (nodes)
+* an `aggregation <https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html>`_ (tree) is a hierarchy of aggregation clauses (nodes)
+* an aggregation response (tree) is a hierarchy of response buckets (nodes)
+
+This library aims to stick to that structure by providing a flexible syntax distinguishing **trees** and **nodes**.
+
+*****************
+Interactive usage
+*****************
+
+Some classes are not intended to be used elsewhere than in interactive mode (ipython), since their purpose is to serve
+auto-completion features and convenient representations.
+
+They won't serve you for any other usage than interactive ones.
+
+Namely:
+
+* `pandagg.mapping.IMapping`: used to interactively navigate in mapping and run quick aggregations on some fields
+* `pandagg.client.Elasticsearch`: used to discover cluster indices, and eventually navigate their mappings, or run quick access aggregations or queries.
+* `pandagg.agg.AggResponse`: used to interactively navigate in an aggregation response
+
+These use case will be detailed in following sections.
diff --git a/docs/source/ressources b/docs/source/ressources
@@ -0,0 +1 @@
+../../examples/imdb/ressources
diff --git a/docs/source/user-guide.rst b/docs/source/user-guide.rst
@@ -2,40 +2,12 @@
 User Guide
 ##########
 
-.. toctree::
-
 
 .. note::
 
+    Examples will be based on :doc:`IMDB` data.
     This is a work in progress. Some sections still need to be furnished.
 
-**********
-Philosophy
-**********
-
-**pandagg** is designed for both for "regular" code repository usage, and "interactive" usage (ipython or jupyter
-notebook usage with autocompletion features inspired by `pandas <https://github.com/pandas-dev/pandas>`_ design).
-
-This library focuses on two principles:
-
-* stick to the **tree** structure of Elasticsearch objects
-* provide simple and flexible interfaces to make it easy and intuitive to use in an interactive usage
-
-Elasticsearch tree structures
-=============================
-
-Many Elasticsearch objects have a **tree** structure, ie they are built from a hierarchy of **nodes**:
-
-* a `mapping <https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html>`_ (tree) is a hierarchy of `fields <https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html>`_ (nodes)
-* a `query <https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html>`_ (tree) is a hierarchy of query clauses (nodes)
-* an `aggregation <https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html>`_ (tree) is a hierarchy of aggregation clauses (nodes)
-* an aggregation response (tree) is a hierarchy of response buckets (nodes)
-
-This library aims to stick to that structure by providing a flexible syntax distinguishing **trees** and **nodes**.
-
-Interactive usage
-=================
-
 
 *****
 Query
@@ -144,16 +116,6 @@ Eventually, you can also use regular Elasticsearch dict syntax:
         └── terms, field=genres, values=['Action', 'Thriller']
 
 
-*******
-Mapping
-*******
-
-Mapping declaration
-===================
-
-Mapping navigation
-==================
-
 ***********
 Aggregation
 ***********
@@ -166,8 +128,175 @@ Aggregation response
 
 TODO
 
+*******
+Mapping
+*******
+
+Here is a portion of :doc:`IMDB` example mapping:
+
+    >>> imdb_mapping = {
+    >>>     'dynamic': False,
+    >>>     'properties': {
+    >>>         'movie_id': {'type': 'integer'},
+    >>>         'name': {
+    >>>             'type': 'text',
+    >>>             'fields': {
+    >>>                 'raw': {'type': 'keyword'}
+    >>>             }
+    >>>         },
+    >>>         'year': {
+    >>>             'type': 'date',
+    >>>             'format': 'yyyy'
+    >>>         },
+    >>>         'rank': {'type': 'float'},
+    >>>         'genres': {'type': 'keyword'},
+    >>>         'roles': {
+    >>>             'type': 'nested',
+    >>>             'properties': {
+    >>>                 'role': {'type': 'keyword'},
+    >>>                 'actor_id': {'type': 'integer'},
+    >>>                 'gender': {'type': 'keyword'},
+    >>>                 'first_name':  {
+    >>>                     'type': 'text',
+    >>>                     'fields': {
+    >>>                         'raw': {'type': 'keyword'}
+    >>>                     }
+    >>>                 },
+    >>>                 'last_name':  {
+    >>>                     'type': 'text',
+    >>>                     'fields': {
+    >>>                         'raw': {'type': 'keyword'}
+    >>>                     }
+    >>>                 }
+    >>>             }
+    >>>         }
+    >>>     }
+    >>> }
+
+Mapping DSL
+===========
+
+The :class:`~pandagg.tree.mapping.Mapping` class provides a more compact view, which can help when dealing with large mappings:
+
+    >>> from pandagg.mapping import Mapping
+    >>> m = Mapping(imdb_mapping)
+    <Mapping>
+                                                                 {Object}
+    ├── genres                                                    Keyword
+    ├── movie_id                                                  Integer
+    ├── name                                                      Text
+    │   └── raw                                                 ~ Keyword
+    ├── rank                                                      Float
+    ├── roles                                                    [Nested]
+    │   ├── actor_id                                              Integer
+    │   ├── first_name                                            Text
+    │   │   └── raw                                             ~ Keyword
+    │   ├── gender                                                Keyword
+    │   ├── last_name                                             Text
+    │   │   └── raw                                             ~ Keyword
+    │   └── role                                                  Keyword
+    └── year                                                      Date
+
+
+With pandagg DSL, an equivalent declaration would be the following:
+
+    >>> from pandagg.mapping import Mapping, Object, Nested, Float, Keyword, Date, Integer, Text
+    >>>
+    >>> dsl_mapping = Mapping(properties=[
+    >>>     Integer('movie_id'),
+    >>>     Text('name', fields=[
+    >>>         Keyword('raw')
+    >>>     ]),
+    >>>     Date('year', format='yyyy'),
+    >>>     Float('rank'),
+    >>>     Keyword('genres'),
+    >>>     Nested('roles', properties=[
+    >>>         Keyword('role'),
+    >>>         Integer('actor_id'),
+    >>>         Keyword('gender'),
+    >>>         Text('first_name', fields=[
+    >>>             Keyword('raw')
+    >>>         ]),
+    >>>         Text('last_name', fields=[
+    >>>             Keyword('raw')
+    >>>         ])
+    >>>     ])
+    >>> ])
+
+Which is exactly equivalent to initial mapping:
+
+    >>> dsl_mapping.serialize() == imdb_mapping
+    True
+
+
+Interactive mapping
+===================
+
+In interactive context, the :class:`~pandagg.interactive.mapping.IMapping` class provides navigation features with autocompletion to quickly discover a large
+mapping:
+
+    >>> from pandagg.mapping import IMapping
+    >>> m = IMapping(imdb_mapping)
+    >>> m.roles
+    <IMapping subpart: roles>
+    roles                                                    [Nested]
+    ├── actor_id                                              Integer
+    ├── first_name                                            Text
+    │   └── raw                                             ~ Keyword
+    ├── gender                                                Keyword
+    ├── last_name                                             Text
+    │   └── raw                                             ~ Keyword
+    └── role                                                  Keyword
+    >>> m.roles.first_name
+    <IMapping subpart: roles.first_name>
+    first_name                                            Text
+    └── raw                                             ~ Keyword
+
+To get the complete field definition, just call it:
+
+    >>> m.roles.first_name()
+    <Mapping Field first_name> of type text:
+    {
+        "type": "text",
+        "fields": {
+            "raw": {
+                "type": "keyword"
+            }
+        }
+    }
+
+A **IMapping** instance can be bound to an Elasticsearch client to get quick access to aggregations computation on mapping fields.
+
+Suppose you have the following client:
+
+    >>> from elasticsearch import Elasticsearch
+    >>> client = Elasticsearch(hosts=['localhost:9200'])
+
+Client can be bound either at initiation:
+
+    >>> m = IMapping(imdb_mapping, client=client, index_name='movies')
+
+or afterwards through `bind` method:
+
+    >>> m = IMapping(imdb_mapping)
+    >>> m.bind(client=client, index_name='movies')
+
+Doing so will generate a **a** attribute on mapping fields, this attribute will list all available aggregation for that
+field type (with autocompletion):
+
+    >>> m.roles.gender.a.terms()
+    [('M', {'key': 'M', 'doc_count': 2296792}),
+    ('F', {'key': 'F', 'doc_count': 1135174})]
+
+
+.. note::
+
+    Nested clauses will be automatically taken into account.
+
+
 *************************
 Cluster indices discovery
 *************************
 
 TODO
+
diff --git a/examples/imdb/README.md b/examples/imdb/README.md
@@ -1,4 +1,4 @@
-# Explore IMDB with ElasticSearch
+# IMDB dataset
 
 You might know the Internet Movie Database, commonly called [IMDB](https://www.imdb.com/).
 
@@ -7,7 +7,7 @@ Well it's a good simple example to showcase ElasticSearch capabilities.
 In this case, relational databases (SQL) are a good fit to store with consistence this kind of data.
 Yet indexing some of this data in a optimized search engine will allow more powerful queries.
 
-## Goal of exercice
+## Query requirements
 In this example, we'll suppose most usage/queries requirements will be around the concept of movie (rather than usages 
 focused on fetching actors or directors, even though it will still be possible with this data structure).
 
@@ -21,7 +21,9 @@ The index should provide good performances trying to answer these kind question
 
 
 ## Data source
-I exported following SQL tables from MariaDB as described in https://relational.fit.cvut.cz/dataset/IMDb:
+I exported following SQL tables from MariaDB [following these instructions](https://relational.fit.cvut.cz/dataset/IMDb).
+
+Relational schema is the following:
 
 ![imdb tables](ressources/imdb_ijs.svg) 
 
@@ -61,6 +63,7 @@ https://www.elastic.co/fr/blog/strings-are-dead-long-live-strings)
 
 #### Final mapping
 
+*TODO -> use [copy_to](https://www.elastic.co/guide/en/elasticsearch/reference/current/copy-to.html) parameter to build **full_name***
 ```
 .                                                            
 ├── directors                         [Nested]
@@ -139,6 +142,7 @@ python examples/imdb/load.py
 
 
 #### Explore pandagg notebooks
-```
-jupyter notebook
-```
+
+An example notebook is available to showcase some of `pandagg` functionalities: [here it is](https://gistpreview.github.io/?4cedcfe49660cd6757b94ba491abb95a).
+
+Code is present in `examples/imdb/IMDB exploration.py` file.
diff --git a/pandagg/interactive/_field_agg_factory.py b/pandagg/interactive/_field_agg_factory.py
@@ -57,7 +57,7 @@ def _operate(self, agg_node, index, execute, output, query):
         for nested in nesteds:
             raw_response = raw_response[nested]
         result = list(agg_node.extract_buckets(raw_response[agg_node.name]))
-        if output is None:
+        if output == 'raw':
             return result
         elif output == 'dataframe':
             try: