Ml sklearn structure (integration to the AB) #385

tzamalisp · 2020-08-20T12:21:28Z

New Machine Learning Infrastructure - Tool Integration (Google Summer of Code 2020)

In this Pull Request (PR), the main goal is the integration of the Machine Learning tool that is built, into the AcousticBrainz platform. Several aspects have been completed or have to be completed in order for the tool to be fully integrated into the AcousticBrainz project. This PR follows the previous PR, "New Machine Learning Infrastructure - Tool Development and Code Review by the Mentor" (closed), which is related to the phase of the GSoC of the tool development and the code review by the Mentor.

The implemented Machine Learning tool and the corresponding branch where the integration part takes place can be found at the forked repository of the AcousticBrainz platform of the link below. This link includes also a detailed description of how the tool works:

https://github.com/tzamalisp/acousticbrainz-server/tree/ml_sklearn_structure/models/sklearn

Till now, the tool has been developed successfully, and several models have been trained offline based on some datasets that were provided by the Mentor. The training process works successfully with the performance results (e.g. the accuracy of the model) being almost the same with the old machine learning infrastructure which is provided by the gaia tool.

In the integration part, a new docker image, the server_dataset_evaluator_sklearn is created in order for the new ML infrastructure can run in Python 3 with the corresponding libraries dependencies. The previous tool runs in Python 2. Several other compatibility issues (as the commits depict) are solved so there will be no conflicts between the simultaneous execution of Python 2 and Python 3, based on the tool that the AcoustiBrainz platform uses. Additionally, the user interface with the relevant back-end services is updated. Now, the user has the ability to choose which ML tool will execute the music classification task and train the model (gaia or sklearn).

The next steps that work have to be done are:

Continue searching if there are any issues when the tool is triggered to train some models.
Train the basic music classification models that will be used later for the high-level data predictions of a new instance that is stored in the database.
Integration of the tool's predictions functionality.
Train some other models.
Implement the corresponding functions so that the high-level data from the predicted values will be created and shown in the UI.

Finally, during the GSoC project period, some other PRs for code changes were created and implemented that correspond to:

issue of the gaia tool related to Python 3 compatibility: dict_values object supports indexing MTG/gaia#110
issue ticket: https://tickets.metabrainz.org/browse/AB-436. The PR and the description of how the issue is solved are available in the following link: AB-436: Fix issues. datasets' class descriptions accept empty strings #373

PEP 8 issues fix 02 remove gaia best models params issue fixed - tracks in gt file do not exist in low-level predict with MBID only (not the whole API low-level url)) add sklearn requirements.txt create gaia evaluation method change name of the model directory load updated files change the path saves - reduce print/logging messages in evaluation requirements.txt --> lowercase remove requirements.txt add requirements.txt with lowercase use of now.isoformat() reports creation - datetime to start of the report, code improvements review updates 01 syntax for documentation strings as in AB fixed ground_truth file typo fixed use of os.makedirs(full_path, exist_ok=True) create results_dict = {....} in a single call simplify processing step dict creation - documentation added add lower-case in all process steps params save best model after training to the whole data - add README split evaluation in separate methods split grid classification into separate methods single logger setup change logging set from int to str dynamic project yaml saving - update readme predict section add new arguments in readme readme - add predict invoking readme MBID in prediction create project - default values declaration - documentation relative imports update predict readme and script arguments import classification_project readme path file required readme, how ti works session - remove requirements jupyter notebook requirements.txt - add requests, remove tensorflow readme - add how training and predicting modes work add new dockerfile for ML tool readme - update doc only for parameters add exports_directory in config when specified in arguments sklearn model inserted in AB

pep8speaks · 2020-08-20T12:21:40Z

Hello @tzamalisp! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file acousticbrainz/models/sklearn/classification/classification_task.py:

Line 44:5: E303 too many blank lines (2)

In the file acousticbrainz/models/sklearn/classification/classification_task_manager.py:

Line 51:5: E303 too many blank lines (2)

In the file acousticbrainz/models/sklearn/classification/classifier_grid.py:

Line 27:5: E303 too many blank lines (2)

In the file acousticbrainz/models/sklearn/classification/train_class.py:

Line 52:48: E128 continuation line under-indented for visual indent
Line 53:48: E128 continuation line under-indented for visual indent
Line 54:48: E128 continuation line under-indented for visual indent
Line 55:48: E128 continuation line under-indented for visual indent

In the file acousticbrainz/models/sklearn/transformation/load_low_level.py:

Line 45:1: W391 blank line at end of file

In the file acousticbrainz/models/sklearn/transformation/transform.py:

Line 41:5: E303 too many blank lines (2)

In the file acousticbrainz/models/sklearn/transformation/transform_predictions.py:

Line 33:5: E303 too many blank lines (2)

In the file sklearn_manage.py:

Line 8:1: E302 expected 2 blank lines, found 1

Comment last updated at 2021-07-08 11:27:54 UTC

merging

Also move the selection of the tool to the advanced svm settings

This reverts commit 89d3276.

The export_path parameter should contain the complete path where the results should be stored. Everywhere we use exports_dir, we append it to the export_path. If its a matter of user convenience, we can add it back of the command line but merge it with export_path asap.

Making a class to just call a util method does not seem useful. Also, the directories are always created by the ClassifcationTaskManager and DatasetExplorer so we can just refer to their path with os.path.join elsewher. We already have a similar function in AB utils but that is used on Python 2.7 which does not have exist_ok arg available. Hence, keeping this.

Revert the change from previous commit for logging directory because we setup logging before trying to create all other output directories.

Instead of setting up logger again and again, we pass the logger around manually. In subsequent steps, this will be entirely replaced to utilise python's hierarchical logging. Can't do this just now because the file handler requires the dataset name.

This reverts commit 581b713.

Eliminate unused methods and move useful ones to individual methods.

…to function

We never really access these dicts other than iterating by calling `.items()` so the defaultdict are redundant and only add confusion.

tzamalisp added 3 commits August 3, 2020 15:27

add sklearn ml tool

e12d95d

readme added

54ab987

tzamalisp added 3 commits August 24, 2020 15:12

Merge remote-tracking branch 'origin/master' into ml_sklearn_structure

3fd4e5c

merging

python 3 installation

c394fc4

add init.py in the sklearn dir for relative importing

7b81889

tzamalisp force-pushed the ml_sklearn_structure branch from 36eb75e to 7b81889 Compare August 24, 2020 12:20

tzamalisp and others added 22 commits August 24, 2020 16:07

add environment variable for sklearn container

6d0957d

add c_values, gamma_values, preprocessing_values arguments in training

2ca4d41

add field of sklearn choice in evaluation mode

a988a3d

add all requirements.txt

ab3657f

add new env in dataset_evaluator for gaia

255d310

PEP* issues fixed

cb08299

modify print

330f880

change section where sklearn dependencies are loaded

a23a40a

Python 3 compatibility issues fix - GSoC commit

daaaf23

querying dataset_eval_jobs with evaluation_tool_value

a3cf6cb

query dataset_eval_jobs with tool value fixed

ad75c7c

ad evaluation tool choice from environment variable

12ce6f9

fix query for getting the next pending job

cbf4c85

dump yaml with SafeDumper

a323fce

store locally json files of low-data - 01

db33b66

dump recordings low-level in json files

44c7390

adding prints for paths of data storage

ad1114a

experimenting with the inputs for sklearn

840a9c6

Rename evaluation_tool to training_tool

a8ae48e

Also move the selection of the tool to the advanced svm settings

Fix failing tests

5426688

change the location where the .json low-level data is saved to

8e083f8

add results params and accuracy, add dataset eval in sklearn cm results

053cd11

amCap1712 added 30 commits June 30, 2021 14:00

Fix sklearn imports

eed4b2c

Add failOnUnmatched: False to gaia project file

89d3276

Revert "Add failOnUnmatched: False to gaia project file"

0ff83db

This reverts commit 89d3276.

Do not delete temp files for debugging

35aafb4

Use yaml.safe_dump

82c9f12

Store sklearn accuracy on 0 to 100 scale

edf3d4f

Display training_tool in evaluation jobs list

c09b0c7

Try fixing react styling error

d4d31ef

Add missing table header

d4982bb

Use eval_location as ground truth directory

680b509

Try passing eval_job_id separately

54e61e9

Separate datasets dir from groundtruth file path

836d2ea

fix typo

dd390a8

Fix more references to duplicate uuid in export path

900207d

Always check and create directory for logging if needed

4f9588c

Revert the change from previous commit for logging directory because we setup logging before trying to create all other output directories.

Remove unused classes

cb14dcd

First step in refactoring logging

a8a6295

Instead of setting up logger again and again, we pass the logger around manually. In subsequent steps, this will be entirely replaced to utilise python's hierarchical logging. Can't do this just now because the file handler requires the dataset name.

Move sklearn argparse scripts to click

baaf96a

Configure hierarchical acousticbrainz.models logger

581b713

Revert "Configure hierarchical acousticbrainz.models logger"

e621529

This reverts commit 581b713.

Fix missing logger instance error

fbebf59

Change some one use classes to functions to simplify code

57313b2

Simplify GroundTruthLoad class

c8b73d2

Eliminate unused methods and move useful ones to individual methods.

Simplify FeaturesDf by removing redundant methods and useful one out …

0643e38

…to function

Fix file and error handling in features_df

2d7d50c

Simplify ConfusionMatrixCreation

d3fd8a7

Remove unneccessary default dicts from matrix creation

98a5dfe

We never really access these dicts other than iterating by calling `.items()` so the defaultdict are redundant and only add confusion.

Save best model for sklearn to result->>'model'

ade0795

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ml sklearn structure (integration to the AB) #385

Ml sklearn structure (integration to the AB) #385

tzamalisp commented Aug 20, 2020 •

edited

Loading

pep8speaks commented Aug 20, 2020 •

edited

Loading

Ml sklearn structure (integration to the AB) #385

Are you sure you want to change the base?

Ml sklearn structure (integration to the AB) #385

Conversation

tzamalisp commented Aug 20, 2020 • edited Loading

New Machine Learning Infrastructure - Tool Integration (Google Summer of Code 2020)

pep8speaks commented Aug 20, 2020 • edited Loading

Comment last updated at 2021-07-08 11:27:54 UTC

tzamalisp commented Aug 20, 2020 •

edited

Loading

pep8speaks commented Aug 20, 2020 •

edited

Loading