DOCUMENTATION

==========================
 mldata developer's guide
==========================
:Info: See <https://github.com/open-machine-learning/mldata> for git repository.
:Version: 0.1.1

Django Modules
==============
List of applications used in the project.

=================  ===========================================================
Module name        Description
=================  ===========================================================
about              Module used for defining static pages. Content of each
                   static page is defined in corresponding template.
blog               Blog application. Displays blog posts and manages them
                   through admin interface.
captcha            Application for captcha in registration. Based on 
                   <http://code.google.com/p/django-recaptcha/>, slightly
                   adjusted.
django_authopenid  Application for logging in by open-id. Currently disabled,
                   however registration uses it's views.
forum              Forum application.
preferences        ??? Simple model to store maximum data sizes ???
registration       Module responsible for managing user's registration and
                   account activation. Fork of django-registration
                   <http://code.google.com/p/django-registration/>
repository         Website's main module. Described below.
tagging            Module responsible for tagging repository entities. Fork of
                   <http://code.google.com/p/django-tagging/>
user               Views for user's account management.  
=================  ===========================================================

Repository module
=================
Major part of website's logic is contained in the repository module. 

Design goals
------------
List of some design decisions made in mldata.org

* Main models, as Data, Task, Method and Challenge, share exactly the same Slug
space. It is impossible to create Task and Result with exactly the same slug.

* Each item in mldata can have various version. Number of version is increased
after each edit. One of the design choices was not to delete anything from
database and using hiding of old items wherever it is possible 

* Since instances of Data, Task, Model and Challenge are similar in many senses
and they share many feature as related publications, ratings, description,
we want to store and present them in similar fashion.

* All datasets are converted into one format if only it is possible, due to
consistency reasons. As the main format hdf5 was chosen because of its
generality. Idea is that after such conversion it is easier to manage
all conversion using the schema:
    input format -> HDF5 -> output format
while storing only one file.

* Only one file is stored as a dataset - all conversions are handled on-line.

* Data files and conversions are provided by other application called
mldata-utils which consists bunch of various converters into one tool.
Conversion is then held in three lines in python shell:
> from ml2h5.converter import Converter
> conv = Converter("inputfile.type", "outputfile.type")
> conv.run()
For details about mldata-utils check it's documentation.

* No edit is available when any dependency to the item appears. For instance
if we create dataset and task relying on it, then dataset cannot be changed.

* Forking mechanism instead of edit is provided.

Structure
---------
Mldata's data entities are separated into 4 classes:
* Data,
* Task,
* Method,
* Challenge.

All four entities inherit the Base model, since they share similar behavior.
Base contains such attributes as date, ratings, number of hits, etc.

Base model is also responsible for handling common operations as checking
user's rights to access an entity, increasing number of hits, managing forks,
etc.

Implementation of Base model is stored in mldata/repository/models/base.py.
Other models implemented there as
* Publication,
* Licence,
* Slug,
* Rating (this is base rating class of which other inherits for providing
  more detailed reviewing)
are also used commonly among whole repository. 

In following subsections main repository models are described in details.

Data
~~~~
Model representing dataset entities. Saving instances is consisted of two
steps:
1) filling standard information form,
2) managing conversion.
More info about views which implement this transaction is provided below.
If transaction fails than dataset is stored anyway, however in this case
it is marked as private and user is asked for repair.

Method responsible for conversion is called approved. It's name is related to
the idea that after filling information form and submitting a file, user is
asked for checking if everything was uploaded correctly and for providing
details about how would he like the file to be converted.

Task
~~~~
Entities represents data split for machine learning task. It is related to
a dataset and defines which instances should be used for training and which
for validation and testing. It also possible to define multiple splits in the
same dataset.

Service allows user to export the task to the hdf5 file using mldata-utils.

Task contains also prediction measure, which is stored as an string.
All performance measures are hardcoded in mleval package of mldata-utils.
mlevel.evaluation provides a list of those functions indexed with strings.

Method
~~~~~~
Method is an entity which allows user to describe methodology of his
approach for tackling a machine learning problem. It is specified in
repository/models/method.py.

Apart of that in the same file Result is defined. This model allows user
to provide output of application of his Method for given machine learning
Task. Result is then evaluated using the Task related evaluation function.
Evaluation process is defined in function predict of Result model.


Challenge
~~~~~~~~~
This model constitutes a set of tasks.


Views implementation
--------------------
Similary to models, repository views also inherit from base views.
Listings and item details are almost the same for all models - major
difference appear only in controllers of dataset edit and save.

Each repository view returns in context following values
(among others):
* request
* klass, which is name of the class
* 'data', 'task', 'method' or 'challenge' variable with current
  'page' object
* tagcloud
* section equal to 'repository'