Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.2.4 #115

Open
wants to merge 29 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
6484639
Handle deprecation warnings in https://github.com/elastic/elasticsear…
leonardbinet Feb 14, 2022
bac7837
Black formatting update
leonardbinet Feb 14, 2022
6b21f52
Fix Search.scan method to allow passing or additional parameters to u…
leonardbinet Feb 16, 2022
6ff6baa
Merge pull request #109 from leonardbinet/deprecation-warnings
leonardbinet Mar 8, 2022
f09878d
Fix composite-aggs scan, by retrieving first batch.
leonardbinet Feb 14, 2022
10266dd
Merge pull request #110 from leonardbinet/agg-scan
leonardbinet Mar 8, 2022
1724219
Add match_all aggregation clause
leonardbinet Feb 16, 2022
d6ac305
Add match_all and match_none query clauses
leonardbinet Feb 16, 2022
41005e5
Fix Search.scan method to allow passing or additional parameters to u…
leonardbinet Feb 16, 2022
c6415c5
Activate 'github actions' on dev branch
leonardbinet Mar 8, 2022
0b8c182
Merge pull request #113 from leonardbinet/actions
leonardbinet Mar 8, 2022
ab176c3
Merge pull request #112 from leonardbinet/query_match
leonardbinet Mar 8, 2022
3e25ac5
v0.2.4
leonardbinet Mar 8, 2022
16683dc
Merge pull request #111 from leonardbinet/scan
leonardbinet Mar 8, 2022
06fd465
Merge pull request #114 from leonardbinet/actions
leonardbinet Mar 8, 2022
75f64f3
Migrate flake8 configuration from makefile to setup.cfg
leonardbinet Mar 9, 2022
1863fc0
Update examples for easier use (imdb can be ingested via command line)
leonardbinet Mar 9, 2022
122174a
Merge pull request #116 from leonardbinet/examples
leonardbinet Mar 9, 2022
72c2d7c
Introduction of changelog (begin at v0.2.4).
leonardbinet Mar 11, 2022
23fff0a
Merge pull request #117 from leonardbinet/changelog
leonardbinet Mar 11, 2022
b469618
Improve discovery.discover docstring
leonardbinet Mar 15, 2022
7ea2954
Document docstring improvement, and clearer __repr__
leonardbinet Mar 15, 2022
10971ae
pandagg.index docstring improvements
leonardbinet Mar 15, 2022
397b5d6
NYC restaurants example comment clarification
leonardbinet Mar 15, 2022
2484883
Merge pull request #118 from leonardbinet/doc_improve_1
leonardbinet Mar 15, 2022
34e6d7d
Apply import sort via isort
leonardbinet Mar 16, 2022
8de08fd
Add isort pre-commit
leonardbinet Mar 16, 2022
0c7f1fa
CI - ensure imports are sorted
leonardbinet Mar 16, 2022
e49d32d
Merge pull request #119 from leonardbinet/isort
leonardbinet Mar 16, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .github/workflows/python-3-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@ name: Python 3 Tests

on:
push:
branches: [ master ]
branches: [ master, dev ]
pull_request:
branches: [ master ]
branches: [ master, dev ]

jobs:
static_analysis:
Expand All @@ -32,6 +32,10 @@ jobs:
pip install mypy
pip install -e ".[develop]"
mypy --install-types --non-interactive pandagg
- name: Isort check
run: |
pip install isort
isort pandagg examples tests -c

run_tests:
runs-on: ubuntu-latest
Expand Down
6 changes: 6 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,9 @@ repos:
pass_filenames: false
language: system
types: [ python ]
- repo: https://github.com/pycqa/isort
rev: 5.10.1
hooks:
- id: isort
name: isort (python)
args: ["--profile", "black", "--filter-files"]
34 changes: 34 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@

# Change Log
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](http://keepachangelog.com/)
and this project adheres to [Semantic Versioning](http://semver.org/).

## [Unreleased] - 2022-02-11

### Added
- Use isort to automatically sort imports

### Changed

### Fixed

## [0.2.4] - 2022-02-11

Introduction of the repository changelog.

### Added

- Github actions run when pushing changes or making a pull request to `dev` branch ([#113](https://github.com/alkemics/pandagg/pull/113)).
- `match_all`, and `match_none` query clauses ([103](https://github.com/alkemics/pandagg/issues/103#issuecomment-1040425685), [#112](https://github.com/alkemics/pandagg/pull/112)).

### Changed

- Handle deprecation warnings introduced in [elasticsearch-py](https://github.com/elastic/elasticsearch-py/issues/1698) ([#109](https://github.com/alkemics/pandagg/pull/109)).
- Improved IMDB and NY-restaurants examples, by allowing them to be ingested on client cluster by a simple command line ([#116](https://github.com/alkemics/pandagg/pull/116)).

### Fixed

- Fix aggregation scan via composite aggregation, the first batch was not yielded ([#101](https://github.com/alkemics/pandagg/issues/101), [#110](https://github.com/alkemics/pandagg/pull/110)).
- Fix search scan, by allowing passing of parameters ([#103](https://github.com/alkemics/pandagg/issues/103#issuecomment-1040445479), [#111](https://github.com/alkemics/pandagg/pull/111)).
8 changes: 6 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,18 @@ lint-diff:
git diff upstream/master --name-only -- "*.py" | xargs flake8

lint:
# ignore "line break before binary operator", and "invalid escape sequence '\_'" useful for doc
flake8 --count --ignore=W503,W605 --show-source --statistics pandagg
flake8 --count --show-source --statistics pandagg
# on tests, more laxist: allow "missing whitespace after ','" and "line too long"
flake8 --count --ignore=W503,W605,E231,E501 --show-source --statistics tests

black:
black examples docs pandagg tests setup.py

isort:
isort examples docs pandagg tests setup.py

format: isort black lint

develop:
-python -m pip install -e ".[develop]"

Expand Down
16 changes: 8 additions & 8 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,14 @@

# -- Project information -----------------------------------------------------

project = u"pandagg"
copyright = u"2020, Léonard Binet"
author = u"Léonard Binet"
project = "pandagg"
copyright = "2020, Léonard Binet"
author = "Léonard Binet"

# The short X.Y version
version = u""
version = ""
# The full version, including alpha/beta/rc tags
release = u"0.1"
release = "0.1"


# -- General configuration ---------------------------------------------------
Expand Down Expand Up @@ -130,15 +130,15 @@
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, "pandagg.tex", u"pandagg Documentation", u"Léonard Binet", "manual")
(master_doc, "pandagg.tex", "pandagg Documentation", "Léonard Binet", "manual")
]


# -- Options for manual page output ------------------------------------------

# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [(master_doc, "pandagg", u"pandagg Documentation", [author], 1)]
man_pages = [(master_doc, "pandagg", "pandagg Documentation", [author], 1)]


# -- Options for Texinfo output ----------------------------------------------
Expand All @@ -150,7 +150,7 @@
(
master_doc,
"pandagg",
u"pandagg Documentation",
"pandagg Documentation",
author,
"pandagg",
"One line description of project.",
Expand Down
22 changes: 3 additions & 19 deletions examples/NY-restaurants/ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@
"""Script that downloads a public dataset and streams it to an Elasticsearch cluster"""

import csv
from os.path import abspath, join, dirname, exists
from os.path import abspath, dirname, exists, join

import urllib3
from elasticsearch import Elasticsearch

from pandagg.index import DeclarativeIndex
from model import NYCRestaurants

NYC_RESTAURANTS = (
"https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD"
Expand All @@ -16,22 +16,6 @@
CHUNK_SIZE = 16384


class NYCRestaurants(DeclarativeIndex):
name = "nyc-restaurants"
mappings = {
"properties": {
"name": {"type": "text"},
"borough": {"type": "keyword"},
"cuisine": {"type": "keyword"},
"grade": {"type": "keyword"},
"score": {"type": "integer"},
"location": {"type": "geo_point"},
"inspection_date": {"type": "date", "format": "MM/dd/yyyy"},
}
}
settings = {"number_of_shards": 1}


def download_dataset():
"""Downloads the public dataset if not locally downloaded
and returns the number of rows are in the .csv file.
Expand Down
31 changes: 31 additions & 0 deletions examples/NY-restaurants/model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from pandagg.document import DocumentSource
from pandagg.index import DeclarativeIndex
from pandagg.mappings import Date, GeoPoint, Integer, Keyword, Text


class Inspection(DocumentSource):
name = Text()
borough = Keyword()
cuisine = Keyword()
grade = Keyword()
score = Integer()
location = GeoPoint()
inspection_date = Date(format="MM/dd/yyyy")


class NYCRestaurants(DeclarativeIndex):
name = "nyc-restaurants"
document = Inspection
# Note: "mappings" attribute take precedence over "document" attribute in mappings definition
mappings = {
"properties": {
"name": {"type": "text"},
"borough": {"type": "keyword"},
"cuisine": {"type": "keyword"},
"grade": {"type": "keyword"},
"score": {"type": "integer"},
"location": {"type": "geo_point"},
"inspection_date": {"type": "date", "format": "MM/dd/yyyy"},
}
}
settings = {"number_of_shards": 1}
65 changes: 10 additions & 55 deletions examples/imdb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The index should provide good performances trying to answer these kind question


## Data source
I exported following SQL tables from MariaDB [following these instructions](https://relational.fit.cvut.cz/dataset/IMDb).
https://relational.fit.cvut.cz/dataset/IMDb.

Relational schema is the following:

Expand Down Expand Up @@ -98,66 +98,21 @@ _
## Steps to start playing with your index


You can either directly use the demo index available [here]('https://beba020ee88d49488d8f30c163472151.eu-west-2.aws.cloud.es.io:9243/')
with credentials user: `pandagg`, password: `pandagg`:
Follow below steps to install it yourself locally.

Access it with following client instantiation:
```
from elasticsearch import Elasticsearch
client = Elasticsearch(
hosts=['https://beba020ee88d49488d8f30c163472151.eu-west-2.aws.cloud.es.io:9243/'],
http_auth=('pandagg', 'pandagg')
)
```


Or follow below steps to install it yourself locally.
In this case, you can either generate yourself the files, or download them from [here](https://drive.google.com/file/d/1po3T18l9QoYxPEGh-iKV4oN3DslWGu8-/view?usp=sharing) (file md5 `b363dee23720052501e24d15361ed605`).

#### Dump tables
Follow instruction on bottom of https://relational.fit.cvut.cz/dataset/IMDb page and dump following tables in a
directory:
- movies.csv
- movies_genres.csv
- movies_directors.csv
- directors.csv
- directors_genres.csv
- roles.csv
- actors.csv

#### Clone pandagg and setup environment
```
# clone repo
git clone [email protected]:alkemics/pandagg.git
cd pandagg

# create and activate your virtual environment using virtualenv or any similar tool
virtualenv env
python setup.py develop
pip install pandas simplejson jupyter seaborn
```
Then copy `conf.py.dist` file into `conf.py` and edit variables as suits you, for instance:
```
# your cluster address
ES_HOST = 'localhost:9200'
source env/bin/activate

# where your table dumps are stored, and where serialized output will be written
DATA_DIR = '/path/to/dumps/'
OUTPUT_FILE_NAME = 'serialized.json'
```

#### Serialize movie documents and insert them

```
# generate serialized movies documents, ready to be inserted in ES
# can take a while
python examples/imdb/serialize.py
# install dependencies for this example
make develop
pip install pandas simplejson mysqlclient mariadb

# create index with mappings if necessary, bulk insert documents in ES
python examples/imdb/load.py
# run ingestion script (type `python examples/imdb/ingest.py --help` for options)
python examples/imdb/ingest.py
```


#### Explore pandagg notebooks

An example notebook is available to showcase some of `pandagg` functionalities: [here it is](https://gistpreview.github.io/?4cedcfe49660cd6757b94ba491abb95a).

Code is present in `examples/imdb/IMDB exploration.py` file.
10 changes: 0 additions & 10 deletions examples/imdb/conf.py.dist

This file was deleted.

Loading