-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathDOCUMENTATION
160 lines (130 loc) · 6.25 KB
/
DOCUMENTATION
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
==========================
mldata developer's guide
==========================
:Info: See <https://github.com/open-machine-learning/mldata> for git repository.
:Version: 0.1.1
Django Modules
==============
List of applications used in the project.
================= ===========================================================
Module name Description
================= ===========================================================
about Module used for defining static pages. Content of each
static page is defined in corresponding template.
blog Blog application. Displays blog posts and manages them
through admin interface.
captcha Application for captcha in registration. Based on
<http://code.google.com/p/django-recaptcha/>, slightly
adjusted.
django_authopenid Application for logging in by open-id. Currently disabled,
however registration uses it's views.
forum Forum application.
preferences ??? Simple model to store maximum data sizes ???
registration Module responsible for managing user's registration and
account activation. Fork of django-registration
<http://code.google.com/p/django-registration/>
repository Website's main module. Described below.
tagging Module responsible for tagging repository entities. Fork of
<http://code.google.com/p/django-tagging/>
user Views for user's account management.
================= ===========================================================
Repository module
=================
Major part of website's logic is contained in the repository module.
Design goals
------------
List of some design decisions made in mldata.org
* Main models, as Data, Task, Method and Challenge, share exactly the same Slug
space. It is impossible to create Task and Result with exactly the same slug.
* Each item in mldata can have various version. Number of version is increased
after each edit. One of the design choices was not to delete anything from
database and using hiding of old items wherever it is possible
* Since instances of Data, Task, Model and Challenge are similar in many senses
and they share many feature as related publications, ratings, description,
we want to store and present them in similar fashion.
* All datasets are converted into one format if only it is possible, due to
consistency reasons. As the main format hdf5 was chosen because of its
generality. Idea is that after such conversion it is easier to manage
all conversion using the schema:
input format -> HDF5 -> output format
while storing only one file.
* Only one file is stored as a dataset - all conversions are handled on-line.
* Data files and conversions are provided by other application called
mldata-utils which consists bunch of various converters into one tool.
Conversion is then held in three lines in python shell:
> from ml2h5.converter import Converter
> conv = Converter("inputfile.type", "outputfile.type")
> conv.run()
For details about mldata-utils check it's documentation.
* No edit is available when any dependency to the item appears. For instance
if we create dataset and task relying on it, then dataset cannot be changed.
* Forking mechanism instead of edit is provided.
Structure
---------
Mldata's data entities are separated into 4 classes:
* Data,
* Task,
* Method,
* Challenge.
All four entities inherit the Base model, since they share similar behavior.
Base contains such attributes as date, ratings, number of hits, etc.
Base model is also responsible for handling common operations as checking
user's rights to access an entity, increasing number of hits, managing forks,
etc.
Implementation of Base model is stored in mldata/repository/models/base.py.
Other models implemented there as
* Publication,
* Licence,
* Slug,
* Rating (this is base rating class of which other inherits for providing
more detailed reviewing)
are also used commonly among whole repository.
In following subsections main repository models are described in details.
Data
~~~~
Model representing dataset entities. Saving instances is consisted of two
steps:
1) filling standard information form,
2) managing conversion.
More info about views which implement this transaction is provided below.
If transaction fails than dataset is stored anyway, however in this case
it is marked as private and user is asked for repair.
Method responsible for conversion is called approved. It's name is related to
the idea that after filling information form and submitting a file, user is
asked for checking if everything was uploaded correctly and for providing
details about how would he like the file to be converted.
Task
~~~~
Entities represents data split for machine learning task. It is related to
a dataset and defines which instances should be used for training and which
for validation and testing. It also possible to define multiple splits in the
same dataset.
Service allows user to export the task to the hdf5 file using mldata-utils.
Task contains also prediction measure, which is stored as an string.
All performance measures are hardcoded in mleval package of mldata-utils.
mlevel.evaluation provides a list of those functions indexed with strings.
Method
~~~~~~
Method is an entity which allows user to describe methodology of his
approach for tackling a machine learning problem. It is specified in
repository/models/method.py.
Apart of that in the same file Result is defined. This model allows user
to provide output of application of his Method for given machine learning
Task. Result is then evaluated using the Task related evaluation function.
Evaluation process is defined in function predict of Result model.
Challenge
~~~~~~~~~
This model constitutes a set of tasks.
Views implementation
--------------------
Similary to models, repository views also inherit from base views.
Listings and item details are almost the same for all models - major
difference appear only in controllers of dataset edit and save.
Each repository view returns in context following values
(among others):
* request
* klass, which is name of the class
* 'data', 'task', 'method' or 'challenge' variable with current
'page' object
* tagcloud
* section equal to 'repository'