From da469692921bce602effb6b73a2b39f073fb3fa0 Mon Sep 17 00:00:00 2001 From: Leonard Binet Date: Sat, 7 Mar 2020 22:44:43 +0100 Subject: [PATCH] update documentation --- docs/source/IMDB.md | 1 + docs/source/advanced-usage.rst | 10 +- docs/source/imdb.rst | 5 - docs/source/index.rst | 4 +- docs/source/introduction.rst | 47 +++++ docs/source/ressources | 1 + docs/source/user-guide.rst | 207 ++++++++++++++++++---- examples/imdb/README.md | 16 +- pandagg/interactive/_field_agg_factory.py | 2 +- 9 files changed, 238 insertions(+), 55 deletions(-) create mode 120000 docs/source/IMDB.md delete mode 100644 docs/source/imdb.rst create mode 100644 docs/source/introduction.rst create mode 120000 docs/source/ressources diff --git a/docs/source/IMDB.md b/docs/source/IMDB.md new file mode 120000 index 00000000..3a3d5b30 --- /dev/null +++ b/docs/source/IMDB.md @@ -0,0 +1 @@ +../../examples/imdb/README.md \ No newline at end of file diff --git a/docs/source/advanced-usage.rst b/docs/source/advanced-usage.rst index 1a2def99..16b8ed43 100644 --- a/docs/source/advanced-usage.rst +++ b/docs/source/advanced-usage.rst @@ -1,5 +1,11 @@ +############## Advanced usage -============== +############## -TODO +.. note:: + + This is a work in progress. Some sections still need to be furnished. + + * node and tree deserialization order + * compound query insertion diff --git a/docs/source/imdb.rst b/docs/source/imdb.rst deleted file mode 100644 index ae120d41..00000000 --- a/docs/source/imdb.rst +++ /dev/null @@ -1,5 +0,0 @@ -Usage example on IMDB -===================== - -An example based on publicly available IMDB data is documented in repository `examples/imdb` directory, with -a jupyter notebook to showcase some of `pandagg` functionalities: `here it is `_. \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index 6caad5c0..c5ef9c3e 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -11,10 +11,10 @@ pandagg :hidden: :maxdepth: 4 - Introduction + introduction user-guide advanced-usage - Usage example + Tutorial dataset API reference Contributing diff --git a/docs/source/introduction.rst b/docs/source/introduction.rst new file mode 100644 index 00000000..dd0b6352 --- /dev/null +++ b/docs/source/introduction.rst @@ -0,0 +1,47 @@ +########## +Principles +########## + +.. note:: + + This is a work in progress. Some sections still need to be furnished. + + +**pandagg** is designed for both for "regular" code repository usage, and "interactive" usage (ipython or jupyter +notebook usage with autocompletion features inspired by `pandas `_ design). + +This library focuses on two principles: + +* stick to the **tree** structure of Elasticsearch objects +* provide simple and flexible interfaces to make it easy and intuitive to use in an interactive usage + + +***************************** +Elasticsearch tree structures +***************************** + +Many Elasticsearch objects have a **tree** structure, ie they are built from a hierarchy of **nodes**: + +* a `mapping `_ (tree) is a hierarchy of `fields `_ (nodes) +* a `query `_ (tree) is a hierarchy of query clauses (nodes) +* an `aggregation `_ (tree) is a hierarchy of aggregation clauses (nodes) +* an aggregation response (tree) is a hierarchy of response buckets (nodes) + +This library aims to stick to that structure by providing a flexible syntax distinguishing **trees** and **nodes**. + +***************** +Interactive usage +***************** + +Some classes are not intended to be used elsewhere than in interactive mode (ipython), since their purpose is to serve +auto-completion features and convenient representations. + +They won't serve you for any other usage than interactive ones. + +Namely: + +* `pandagg.mapping.IMapping`: used to interactively navigate in mapping and run quick aggregations on some fields +* `pandagg.client.Elasticsearch`: used to discover cluster indices, and eventually navigate their mappings, or run quick access aggregations or queries. +* `pandagg.agg.AggResponse`: used to interactively navigate in an aggregation response + +These use case will be detailed in following sections. diff --git a/docs/source/ressources b/docs/source/ressources new file mode 120000 index 00000000..5b88db8f --- /dev/null +++ b/docs/source/ressources @@ -0,0 +1 @@ +../../examples/imdb/ressources \ No newline at end of file diff --git a/docs/source/user-guide.rst b/docs/source/user-guide.rst index 53e90817..d980d10b 100644 --- a/docs/source/user-guide.rst +++ b/docs/source/user-guide.rst @@ -2,40 +2,12 @@ User Guide ########## -.. toctree:: - .. note:: + Examples will be based on :doc:`IMDB` data. This is a work in progress. Some sections still need to be furnished. -********** -Philosophy -********** - -**pandagg** is designed for both for "regular" code repository usage, and "interactive" usage (ipython or jupyter -notebook usage with autocompletion features inspired by `pandas `_ design). - -This library focuses on two principles: - -* stick to the **tree** structure of Elasticsearch objects -* provide simple and flexible interfaces to make it easy and intuitive to use in an interactive usage - -Elasticsearch tree structures -============================= - -Many Elasticsearch objects have a **tree** structure, ie they are built from a hierarchy of **nodes**: - -* a `mapping `_ (tree) is a hierarchy of `fields `_ (nodes) -* a `query `_ (tree) is a hierarchy of query clauses (nodes) -* an `aggregation `_ (tree) is a hierarchy of aggregation clauses (nodes) -* an aggregation response (tree) is a hierarchy of response buckets (nodes) - -This library aims to stick to that structure by providing a flexible syntax distinguishing **trees** and **nodes**. - -Interactive usage -================= - ***** Query @@ -144,16 +116,6 @@ Eventually, you can also use regular Elasticsearch dict syntax: └── terms, field=genres, values=['Action', 'Thriller'] -******* -Mapping -******* - -Mapping declaration -=================== - -Mapping navigation -================== - *********** Aggregation *********** @@ -166,8 +128,175 @@ Aggregation response TODO +******* +Mapping +******* + +Here is a portion of :doc:`IMDB` example mapping: + + >>> imdb_mapping = { + >>> 'dynamic': False, + >>> 'properties': { + >>> 'movie_id': {'type': 'integer'}, + >>> 'name': { + >>> 'type': 'text', + >>> 'fields': { + >>> 'raw': {'type': 'keyword'} + >>> } + >>> }, + >>> 'year': { + >>> 'type': 'date', + >>> 'format': 'yyyy' + >>> }, + >>> 'rank': {'type': 'float'}, + >>> 'genres': {'type': 'keyword'}, + >>> 'roles': { + >>> 'type': 'nested', + >>> 'properties': { + >>> 'role': {'type': 'keyword'}, + >>> 'actor_id': {'type': 'integer'}, + >>> 'gender': {'type': 'keyword'}, + >>> 'first_name': { + >>> 'type': 'text', + >>> 'fields': { + >>> 'raw': {'type': 'keyword'} + >>> } + >>> }, + >>> 'last_name': { + >>> 'type': 'text', + >>> 'fields': { + >>> 'raw': {'type': 'keyword'} + >>> } + >>> } + >>> } + >>> } + >>> } + >>> } + +Mapping DSL +=========== + +The :class:`~pandagg.tree.mapping.Mapping` class provides a more compact view, which can help when dealing with large mappings: + + >>> from pandagg.mapping import Mapping + >>> m = Mapping(imdb_mapping) + + {Object} + ├── genres Keyword + ├── movie_id Integer + ├── name Text + │ └── raw ~ Keyword + ├── rank Float + ├── roles [Nested] + │ ├── actor_id Integer + │ ├── first_name Text + │ │ └── raw ~ Keyword + │ ├── gender Keyword + │ ├── last_name Text + │ │ └── raw ~ Keyword + │ └── role Keyword + └── year Date + + +With pandagg DSL, an equivalent declaration would be the following: + + >>> from pandagg.mapping import Mapping, Object, Nested, Float, Keyword, Date, Integer, Text + >>> + >>> dsl_mapping = Mapping(properties=[ + >>> Integer('movie_id'), + >>> Text('name', fields=[ + >>> Keyword('raw') + >>> ]), + >>> Date('year', format='yyyy'), + >>> Float('rank'), + >>> Keyword('genres'), + >>> Nested('roles', properties=[ + >>> Keyword('role'), + >>> Integer('actor_id'), + >>> Keyword('gender'), + >>> Text('first_name', fields=[ + >>> Keyword('raw') + >>> ]), + >>> Text('last_name', fields=[ + >>> Keyword('raw') + >>> ]) + >>> ]) + >>> ]) + +Which is exactly equivalent to initial mapping: + + >>> dsl_mapping.serialize() == imdb_mapping + True + + +Interactive mapping +=================== + +In interactive context, the :class:`~pandagg.interactive.mapping.IMapping` class provides navigation features with autocompletion to quickly discover a large +mapping: + + >>> from pandagg.mapping import IMapping + >>> m = IMapping(imdb_mapping) + >>> m.roles + + roles [Nested] + ├── actor_id Integer + ├── first_name Text + │ └── raw ~ Keyword + ├── gender Keyword + ├── last_name Text + │ └── raw ~ Keyword + └── role Keyword + >>> m.roles.first_name + + first_name Text + └── raw ~ Keyword + +To get the complete field definition, just call it: + + >>> m.roles.first_name() + of type text: + { + "type": "text", + "fields": { + "raw": { + "type": "keyword" + } + } + } + +A **IMapping** instance can be bound to an Elasticsearch client to get quick access to aggregations computation on mapping fields. + +Suppose you have the following client: + + >>> from elasticsearch import Elasticsearch + >>> client = Elasticsearch(hosts=['localhost:9200']) + +Client can be bound either at initiation: + + >>> m = IMapping(imdb_mapping, client=client, index_name='movies') + +or afterwards through `bind` method: + + >>> m = IMapping(imdb_mapping) + >>> m.bind(client=client, index_name='movies') + +Doing so will generate a **a** attribute on mapping fields, this attribute will list all available aggregation for that +field type (with autocompletion): + + >>> m.roles.gender.a.terms() + [('M', {'key': 'M', 'doc_count': 2296792}), + ('F', {'key': 'F', 'doc_count': 1135174})] + + +.. note:: + + Nested clauses will be automatically taken into account. + + ************************* Cluster indices discovery ************************* TODO + diff --git a/examples/imdb/README.md b/examples/imdb/README.md index e0da9b4d..7c458783 100644 --- a/examples/imdb/README.md +++ b/examples/imdb/README.md @@ -1,4 +1,4 @@ -# Explore IMDB with ElasticSearch +# IMDB dataset You might know the Internet Movie Database, commonly called [IMDB](https://www.imdb.com/). @@ -7,7 +7,7 @@ Well it's a good simple example to showcase ElasticSearch capabilities. In this case, relational databases (SQL) are a good fit to store with consistence this kind of data. Yet indexing some of this data in a optimized search engine will allow more powerful queries. -## Goal of exercice +## Query requirements In this example, we'll suppose most usage/queries requirements will be around the concept of movie (rather than usages focused on fetching actors or directors, even though it will still be possible with this data structure). @@ -21,7 +21,9 @@ The index should provide good performances trying to answer these kind question ## Data source -I exported following SQL tables from MariaDB as described in https://relational.fit.cvut.cz/dataset/IMDb: +I exported following SQL tables from MariaDB [following these instructions](https://relational.fit.cvut.cz/dataset/IMDb). + +Relational schema is the following: ![imdb tables](ressources/imdb_ijs.svg) @@ -61,6 +63,7 @@ https://www.elastic.co/fr/blog/strings-are-dead-long-live-strings) #### Final mapping +*TODO -> use [copy_to](https://www.elastic.co/guide/en/elasticsearch/reference/current/copy-to.html) parameter to build **full_name*** ``` . ├── directors [Nested] @@ -139,6 +142,7 @@ python examples/imdb/load.py #### Explore pandagg notebooks -``` -jupyter notebook -``` + +An example notebook is available to showcase some of `pandagg` functionalities: [here it is](https://gistpreview.github.io/?4cedcfe49660cd6757b94ba491abb95a). + +Code is present in `examples/imdb/IMDB exploration.py` file. diff --git a/pandagg/interactive/_field_agg_factory.py b/pandagg/interactive/_field_agg_factory.py index 90e7fcd2..98805e02 100644 --- a/pandagg/interactive/_field_agg_factory.py +++ b/pandagg/interactive/_field_agg_factory.py @@ -57,7 +57,7 @@ def _operate(self, agg_node, index, execute, output, query): for nested in nesteds: raw_response = raw_response[nested] result = list(agg_node.extract_buckets(raw_response[agg_node.name])) - if output is None: + if output == 'raw': return result elif output == 'dataframe': try: