Skip to content
Pietro Michiardi edited this page Sep 16, 2014 · 9 revisions

Welcome to the Treelib wiki!

If you are not familiar with decision trees, use the following resources:

  • An Introduction to Statistical Learning, by G. James, D. Witten, T. Hastie and R. Tibshirani (Springer texts in Statistics)

  • Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman (Spinger series in Statistics)

Next, a short overview to understand how to use the library. For those of you interested in some of the internals, follow this link [todo].

Features:

Treelib helps you build Classification or Regression Trees by executing model training in parallel, using the Apache Spark execution engine. In addition to model building, additional utilities include Tree Pruning, and Random Forests. Treelib implements state-of-the-art algorithms including Binary Tree models (using the CART algorithm) and Multi-way Tree modles (using ID3).

How to use:

You can easily use our library by import the package which suit to your requirements. Every tree builders (RegressionTree, CART Classification Tree, ID3 Classification Tree) are has the same prototype. Hereby, in the next sections, for each type of tree builder, we will introduce how to declare a tree, and then, in the very end, we give you an example about how to build a classification tree (the others are similar) with the very popular dataset: PlayGolf.

Declare tree builder

Regression Tree

First, you need to import the required package by import treelib.cart._ and then create a tree builder: val tree = new RegressionTree

Classification Tree with CART

After import our library by import treelib.cart._, you can create new binary Classification tree by: val tree = new ClassificationTree()

Multi-way Tree with ID3

After import our library by import treelib.id3._, you can create new binary Classification tree by: val tree = new ID3TreeBuilder()

Using tree builder

The first steps to build a tree model is set the training data by tree.setDataSet(data).

Note: our library only support data WIHOUT header (because all lines in the data take the same role.)

But you can set header for this data programmatically by using tree.setFeatureNames(<Array_string_name_of_features>), otherwise, the features name will have format: "Column1", "Column2",...

To start building tree, use function buildTree(yFeature, SetOfXFeatures). In which, yFeature is the name of the target feature; SetOfXFeatures is the set of predictor. These two parameters are optional. If you simply call tree.buildTree(), the last feature will be considered as the target, and the rest are predictors.

For example, assume that we have a training data with schema: "Temperature", "Humidity", "Outlook", "Money", "Month" , "DayOfWeek", PlayGolf" in ordered. In which, "Temperature" is an float number, indicate the Celsius degree; "Humidity" has the value set is {High, Normal}; "Outlook" has the value set is {Overcast, Rainy, Sunny}; "Money" is a float number indicate the money that person has in Euro, "Month" has value range [1, 12] "DayOfWeek" has the value set {0, 1, 2, 3, 4, 5, 6}, that means {Saturday, Sunday, Monday, Tuesday, Wednesday, Thurday, Friday} relatively; "PlayGolf" has two values : "Yes" and "No"

// read training data
val playgolf_data = context.textFile("data/playgolf.csv", 1)
// create Regression Tree
val tree = new ClassificationTree()
// set the training data
tree.setDataset(trainingData)
// set the header for the training data
tree.setFeatureNames(Array[String]("Temperature", "Humidity", "Outlook", "Money", "Month" , "DayOfWeek", PlayGolf"))
// build tree with the target is the last feature (PlayGolf), the rest is predictor
tree.buildTree()

Or if you want to build a tree model with the target feature is "PlayGolf", and predictor is "Temperature", "Humidity", "Outlook", "Money", "DayOfWeek" (we don't use attribute Month), we can call:

tree.buildTree("Playgolf", Set[Any]("Temperature", "Humidity", "Outlook", "Money", "DayOfWeek"))

If you don't specify the type of predictor, the algorithm will detect them automatically by considering their values. For instance, with the above statement, "Temperatue", "Money", "Month", "DayOfWeek" will be treat as Numerical feature, the rest are as Categorical feature. In case you want to explicit the type of feature, you can use as object, similiar to R. For example:

tree.buildTree("Playgolf", Set[Any]("Temperature", "Humidity", "Outlook", as.Number("Money"), as.String("DayOfWeek"))

You can also setting some parameters before building tree:

tree.setDelimiter("\t")    // set delimiter of fields is tab, default is ","
tree.setMinSplit(1)        // only grow node if it has more than 1 records
tree.setThreshold(0.3)     // only grow node if it the coefficient of variation of Y> 0.3; CV = Deviation(Y)/E(Y)

Prediction

After building tree, we can use it to predict a new instance by function:

`predictOneInstance(record: Array[String], ignoreBranchIDs: Set[BigInt] = Set[BigInt]())`
 where record is an array of predictors' values; ignoreBranchIDs is the set of ID of nodes, which we want to force them make prediction as a leaf node, instead of using their children (this parameter is optional).

tree.predictOneInstance(Array[String]("cool","sunny","normal","false","30"))

Evaluation

Additional, you can evaluate tree by another dataset after making prediction:

val predictRDDOfTheFullTree = tree.predict(testingData)
val actualValueRDD = testingData.map(line => line.split(',').last) // the last feature is the target
Evaluation.evaluate(predictRDDOfTheFullTree, actualValueRDD)

The metric which we use to evaluate is: mean , deviation and square error of different between predicted value and real value.

Example of evaluation result of a regression tree:

Mean of different:-0.000000
Deviation of different:2.596211
SE of different:0.036566

Writing Tree model to file and reload

We can save and reload our tree model by:

// tree is created by 'val tree = new RegressionTree()'

// write model to file
tree.writeModelToFile("tree2.model")

// create a new regression tree
val tree3 = new RegressionTree ();
    
// load tree model from a file
tree3.loadModelFromFile("tree2.model")

Note

For full demonstration, please view file in src/main/scala/test/Test.scala

Regression Tree is implemented in src/main/scala/treelib/cart/RegressionTree.scala

Testing data is stored in data/

View output in console

Let's build tree with dataset 'bodyfat'.

With our regression tree, we can use the following code:

val context = new SparkContext("local", "SparkContext")

val trainingData = context.textFile("data/bodyfat.csv", 1)
             
val tree = new RegressionTree()
tree.setDataset(trainingData)
tree.setFeatureNames(Array[String]("","age","DEXfat","waistcirc","hipcirc","elbowbreadth","kneebreadth","anthro3a","anthro3b","anthro3c","anthro4"))

tree.setMinSplit(10)

var stime = System.nanoTime()

println(tree.buildTree("DEXfat", Set("age", "waistcirc","hipcirc","elbowbreadth","kneebreadth")))

println("Build tree in %f second(s)".format((System.nanoTime() - stime)/1e9))

And we get result:

waistcirc( < 88.400000)
|-(yes)-hipcirc( < 96.250000)
|    |-(yes)--age( < 59.500000)
|    |    |-(yes)---waistcirc( < 67.300000)
|    |    |    |-(yes)----11.21
|    |    |    |-(no)----17.015555555555558
|    |    |-(no)---22.328333333333333
|    |-(no)--waistcirc( < 80.750000)
|    |    |-(yes)---kneebreadth( < 8.550000)
|    |    |    |-(yes)----20.30333333333333
|    |    |    |-(no)----25.279000000000003
|    |    |-(no)---29.372000000000003
|-(no)-kneebreadth( < 11.300000)
|    |-(yes)--hipcirc( < 109.900000)
|    |    |-(yes)---35.27846153846154
|    |    |-(no)---42.95437500000001
|    |-(no)--61.370000000000005
Build tree in 2.120787 second(s)

Let's check with R by following code:

install.packages("mboost")
library("mboost")
library("rpart")
data("bodyfat", package = "mboost")
#help("bodyfat", package = "mboost")
bodyfat_rpart <- rpart(DEXfat ~ age + waistcirc + hipcirc + 
      elbowbreadth + kneebreadth, data=bodyfat,
      control=rpart.control(minsplit=10))

install.packages("partykit")
library("partykit")
plot(as.party(bodyfat_rpart), tp_args = list(id=FALSE))

And result is:

Result with R

Clone this wiki locally