Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ml sklearn structure (integration to the AB) #385

Open
wants to merge 66 commits into
base: master
Choose a base branch
from

Conversation

tzamalisp
Copy link
Contributor

@tzamalisp tzamalisp commented Aug 20, 2020

New Machine Learning Infrastructure - Tool Integration (Google Summer of Code 2020)

In this Pull Request (PR), the main goal is the integration of the Machine Learning tool that is built, into the AcousticBrainz platform. Several aspects have been completed or have to be completed in order for the tool to be fully integrated into the AcousticBrainz project. This PR follows the previous PR, "New Machine Learning Infrastructure - Tool Development and Code Review by the Mentor" (closed), which is related to the phase of the GSoC of the tool development and the code review by the Mentor.

The implemented Machine Learning tool and the corresponding branch where the integration part takes place can be found at the forked repository of the AcousticBrainz platform of the link below. This link includes also a detailed description of how the tool works:

Till now, the tool has been developed successfully, and several models have been trained offline based on some datasets that were provided by the Mentor. The training process works successfully with the performance results (e.g. the accuracy of the model) being almost the same with the old machine learning infrastructure which is provided by the gaia tool.

In the integration part, a new docker image, the server_dataset_evaluator_sklearn is created in order for the new ML infrastructure can run in Python 3 with the corresponding libraries dependencies. The previous tool runs in Python 2. Several other compatibility issues (as the commits depict) are solved so there will be no conflicts between the simultaneous execution of Python 2 and Python 3, based on the tool that the AcoustiBrainz platform uses. Additionally, the user interface with the relevant back-end services is updated. Now, the user has the ability to choose which ML tool will execute the music classification task and train the model (gaia or sklearn).

The next steps that work have to be done are:

  • Continue searching if there are any issues when the tool is triggered to train some models.
  • Train the basic music classification models that will be used later for the high-level data predictions of a new instance that is stored in the database.
  • Integration of the tool's predictions functionality.
  • Train some other models.
  • Implement the corresponding functions so that the high-level data from the predicted values will be created and shown in the UI.

Finally, during the GSoC project period, some other PRs for code changes were created and implemented that correspond to:

PEP 8 issues fix 02

remove gaia best models params

issue fixed - tracks in gt file do not exist in low-level

predict with MBID only (not the whole API low-level url))

add sklearn requirements.txt

create gaia evaluation method

change name of the model directory

load updated files

change the path saves - reduce print/logging messages in evaluation

requirements.txt --> lowercase

remove requirements.txt

add requirements.txt with lowercase

use of now.isoformat()

reports creation - datetime to start of the report, code improvements

review updates 01

syntax for documentation strings as in AB fixed

ground_truth file typo fixed

use of os.makedirs(full_path, exist_ok=True)

create results_dict = {....} in a single call

simplify processing step dict creation - documentation added

add lower-case in all process steps params

save best model after training to the whole data - add README

split evaluation in separate methods

split grid classification into separate methods

single logger setup

change logging set from int to str

dynamic project yaml saving -  update readme predict section

add new arguments in readme

readme - add predict invoking

readme MBID in prediction

create project - default values declaration - documentation

relative imports

update predict readme and script arguments

import classification_project

readme path file required

readme, how ti works session - remove requirements jupyter notebook

requirements.txt - add requests, remove tensorflow

readme - add how training and predicting modes work

add new dockerfile for ML tool

readme - update doc only for parameters

add exports_directory in config when specified in arguments

sklearn model inserted in AB
@pep8speaks
Copy link

pep8speaks commented Aug 20, 2020

Hello @tzamalisp! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 44:5: E303 too many blank lines (2)

Line 51:5: E303 too many blank lines (2)

Line 27:5: E303 too many blank lines (2)

Line 52:48: E128 continuation line under-indented for visual indent
Line 53:48: E128 continuation line under-indented for visual indent
Line 54:48: E128 continuation line under-indented for visual indent
Line 55:48: E128 continuation line under-indented for visual indent

Line 45:1: W391 blank line at end of file

Line 41:5: E303 too many blank lines (2)

Line 33:5: E303 too many blank lines (2)

Line 8:1: E302 expected 2 blank lines, found 1

Comment last updated at 2021-07-08 11:27:54 UTC

@tzamalisp tzamalisp force-pushed the ml_sklearn_structure branch from 36eb75e to 7b81889 Compare August 24, 2020 12:20
amCap1712 added 30 commits June 30, 2021 14:00
The export_path parameter should contain the complete path where the
results should be stored. Everywhere we use exports_dir, we append it
to the export_path. If its a matter of user convenience, we can add it
back of the command line but merge it with export_path asap.
Making a class to just call a util method does not seem useful. Also,
the directories are always created by the ClassifcationTaskManager and
DatasetExplorer so we can just refer to their path with os.path.join
elsewher. We already have a similar function in AB utils but that is
used on Python 2.7 which does not have exist_ok arg available. Hence,
keeping this.
Revert the change from previous commit for logging directory because
we setup logging before trying to create all other output directories.
Instead of setting up logger again and again, we pass the logger around
manually. In subsequent steps, this will be entirely replaced to utilise
python's hierarchical logging. Can't do this just now because the file
handler requires the dataset name.
Eliminate unused methods and move useful ones to individual methods.
We never really access these dicts other than iterating by
calling `.items()` so the defaultdict are redundant and only
add confusion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants