The idea of dtsql is to take data from a database, train a decision tree on it, format the decision tree in the form a SQL and then run the SQL on a database to predict the column of your choice.
For more information see this two-part blog post,
Dtsql requires Python and scikit-learn.
To run dtsql locally,
python dtsql.py sample/iris.csv
You can also pull the demo docker image to play with it,
docker run tilayealemu/dtsql
docker run -i -t tilayealemu/dtsql /bin/bash
python dtsql.py sample/iris.csv
Use csv files to train a data. Have your target in the last
column. For example in the iris data the column to predict
is Name
. All columns except the last one will be used
to train the decision tree model.
SepalLength,SepalWidth,PetalLength,PetalWidth,Name
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
6.4,3.2,4.5,1.5,Iris-versicolor
There is a helper script to fetch training data from your database and put it in a csv. At the moment only mysql is supported. But you can easily tweak the script for other databases.
cd helper
python fetch_mysql.py mymysqlserver mydbname myuname mypassword "select * from table"
Write your query such that
- Columns that are completely unrelated to your target column are removed. When in doubt though better to keep columns than remove them.
- Your target column is selected in the end
Suppose you have these columns and you want to predict average_temperature
.
day_of_week, sunny_or_not, average_temperature, month,order_id
- Leave out day of week and order ID. They are unlikely to be useful
- Put temperature column in the end.
So the query becomes,
python fetch_mysql.py mymysqlserver mydbname myuname mypassword "select sunny_or_not, month, average_temperature from table"
Which will output training.csv
. Then run dtsql on it,
python dtsql.py training.csv