More technical information can be found at the web site : www.shih.be
1 Shih Data Miner Application Window
2.2 Format of the testing and training file
2.3 Probabilities and node classification
2.5 Pruning, avoiding overfitting
2.7 Misclassification costs and altered priors
2.8.1.1 Minimum Parent Node Size
2.8.1.2 Sample test from training
2.8.3 Class parameters tab ( Misclassification matrix and Priors )
2.8.3.1 Misclassification Matrix
3.1.3 Classification information tab
This manual tries to be synchronized with the main application which is
continuously evolving. You might find features in the program that are
still not documented. Specially the latest developments. If some part is
missing please do not doubt to reach us at info@shih.be.
The program is written in java and has been tested in the following platforms :
- Windows 98/XP/2000/NT
- Linux Suse, RedHat
- FreeBSD stable 4.7, 4.8, 4.9
- MacOSX
- SunOS
The decision tree algorithms are inspired from CART as described in the book by Breiman et. al. The
association rules generation is based on the Apriori algorithm as described by
Agrawal et. al.
Launching the application
The program runs on any system with Java 1.4 or higher. To run it in windows double click from the explorer
the file : Shih.jar
or execute the bat file : runShihWindows.bat
To run it on Linux or Unix or Sun (on the Xwindows platform
of the operative system) execute the file : runShihLinuxUnix.sh
Both files contain the basic comand to launch the
application : java -jar -Xmx512M Shih.jar
The benefit of executing runShihWindows.bat or runShihLinuxUnix.sh
is that the java virtual machine allocates 512 Megas of memory to
run the application allowing to work with bigger files. This is
bigger than the 64 Megas allocated by default when the -Xmx command
is not explicitely given.
I guess from what the users say that in Mac OS X it's easy to run it,
I have not done it personally. Information about
this subject is appreciated, if you have any, please do not doubt to reach us.
Once executed the main window of the application appears, it is a container for all the other windows of the
application. From here you can see the XOutput console which shows the output
of the program.
You access to the main functionality of the program through the menus :
- File : Lets you access decision trees and association rules functionality and lets you connect to databases
- Preferences : Let's you change the Look and Feel, select the language, configure the generated output and the
way windows should be moved
- Windows : Keeps a list of the opened windows inside the main window
- Help : Lets you access the about box and the online manual of the application if you have an internet connection
enabled
- Tree -> Set parameters and build : This menu lets you initiate and build a new tree setting all its parameters,
or reading them from a previously saved XML tree model. After selecting this menu item,
a new command window will appear where the user can set the model parameters of the new model to build.
- Tree -> Build from XML model : Lets you build a tree from a saved XML model. This has the advantage that the
tree algorithm will not have to be calculated again. The tree will be built by reading at the XML file and the
data defined in it. This is much faster than calculating a tree from the parameters by applying the algorithm,
specially for large data files. Ideally once you find a good model after having fine tuned the parameters and
worked out several alternatives the model should be saved in order to be able to rebuild it fast using this menu.
- Tree -> Predict with XML model : This allows to load a tree PMML model file and new data to score it using a
saved model. This is used to apply model to new data via the GUI. An XML tree predictor window will appear.
Once there select the data file with the pmml model, for example IncomeTrain.txt.xml and the file with the
new data to score, for example IncomeTest.txt. Then press Predict and the result will be saved in a file
called IncomeTest.txt.sco (sco=score) in the same directory where IncomeTest.txt is located.
The results from the prediction will give the probability of each class and also the probability of the
profile each case falls into.
Ex: Case 1 wins less than 50 thousand with probability
68.3% and falls in the profile 26 which is defined by the
rule 26 (or the node 26) which has probability 2.8%.
Score P( <=50K.) P( >50K.) Profile/Node P(Node)
1 <=50K. 0.683 0.317 26 0.028097667
2 >50K. 0.1608 0.8392 22 0.369663
3 <=50K. 0.9377 0.0623 23 0.29405504
4 <=50K. 0.8182 0.1818 1 0.029206714
5 >50K. 0.1608 0.8392 22 0.369663
The same can also be done via the command
line in order to automate the process of using models to score data. Once the PMML model is created you can
score or make predictions with new cases from the command prompt in windows or from a terminal in Linux/Unix by
executing the following command :
java -cp Shih.jar i "xml model file" "file with new cases"
This will send its output to the console from where it can be redirected
with a pipe. You can find a working example editing the file prediction.bat for Windows or
prediction.sh for Unix.
- Generate associations : This brings us to the association rules window where you can set all the parameters to generate associations.
- JdbcGui data connector : Enables to connect to several
popular data database sources such as Oracle, MySQL, Postgre SQL, and any that
supports a Java Data Base Connectivity (JDBC) driver. This can be used to
process and get information via SQL and save it in format that can be used by
Shih Data Miner, namely tab separated columns with the variable names in the
first row.
- Exit: the user can end his/her session and exit the program by selecting the Exit-menu item.
By selecting an item of the Look & Feel (L&F) menu, the user can change the look &
feel of the whole program, corresponding to the OS on which the user is
working, or corresponding to the personal preferences of the user. The user can
select the Java, the Sun OS, the linux GTK, and the Windows L&F. Windows XP/2000 users will see the look and feel
corresponding to them. If you use Macintosh
systems you can select the Macintosh L&F. Apart from these operative
systems looks you can set two additional open source look and feels namely the
Kunststoff and the Metouia L&F.
The user can control output preferences deciding if to close the XOutput console, though it
is recommended to be left open, as it is by default, since it gives important
information on the output of the program. System output can be set to "enable system
output", this means that the system console can be set to display all the output
results that are being also shown in the Xouput console. By selecting non verbose
the output will be smaller showing just the necessary information.
The language menu item pops up a window that allows to set up the language of the
application.
Currently the application has been translated to the following
languages:
English
German
Dutch
French
Spanish
Italian
This is an on going effort that tries to be synchronized with the development so sometimes you will
see sentences that are not yet translated from english.
Once you select the language you want to work with (by default it starts with English) you must
restart the application in order to see the changes. From then on the language will
always be the one you selected even if you restart the application. Of course you can always
change it again. If you are interested in translating to another language you can reach us at
info@shih.be.
The drag window content controls the way windows are dragged inside the main application, if
this option is selected the full window will be dragged, otherwise only the
borders (making it usually faster on slow systems).
The windows menu shows the currently open windows in the main application and allows to select any window currently open. Sometimes this is handy when there are many windows opened covering one window. In this case you can select the covered window you want to see putting it above the rest.
This chapter explains the functionality to build trees. To access it the first step is to call the Tree Builder Parameters window to set up a model to build a tree. This is done by selecting the Tree -> Set parameters and build menu item on the File menu of the main window. Once the Tree Builder Parameters Window appears, showing the “Model” tab, the user is able to specify all the model parameters of the data model to be constructed. The first thing he should do is to load the train data from which the model will be constructed. Or optionally load a previously saved model. After loading a file the tabs “Variables” and “Class parameters” become active.

Shih Data Miner Tree Builder Parameters Window displays all the information and all the current settings from which the model will be build after pushing the Create Model button. The information and the settings can be changed by addressing the menus and their items or by setting information through the radio buttons and text boxes.
A popup menu shows up by right clicking on the window. By selecting the File menu, the user
can change the input files which will be used during the construction of the data model. This can also be achieved
by clicking on the open file icons. The user can change the input files that will be used as training data and
test data or the model file from which the data and the parameters will be read :
- Train: The user can change the input file on which
the model will be trained and constructed. Clicking on this menu item calls a train file
input box where the appropriate text file can be selected.
- Test: By selecting this menu item, the user can change the input file which is going to be used to test the
generated model. The appropriate text file can be selected in the test file
input box.
- Model: By selecting this menu the user can specify a previously saved model file that already contains all the parameters
necessary to build a tree. This is handy when the default parameters were modified to avoid doing all that work
once again.
- Create model : Once the data is set the Create model menu item becomes active, selecting it will build
the model. This has the same effect as pushing the Create Model button.
- Exit : Closes the Tree Builder parameters window.
There are two ways of using a test dataset during the construction of the data model, either
by selecting an independent test file or by sampling a proportion of the
training data. The sample proportion defines how much of the train file will be used as
test data. The sample seed is used to recreate a sample between two different sessions of the program,
if the same seed is used the same sample will be created. If a different seed is used the sample will change.
Shih Data Miner uses tab delimited txt files. It can generate this txt files by connecting to a
database and saving the results of an SQL query or by saving an Excel
spreadsheet or Access table in the CSV format using tab delimiters. Both the
training and the testing file must have the same structure:
- The first line in the txt-input file contains the names of the variables.
- The columns in the file represent the variables, categorical and numerical can be mixed.
By default the last variable or column will be used as the dependent variable unless another
one is specified by the user through the “Variables” tab. Any variable with
only numbers will be considered as a numerical variable unless it is specified
as categorical through the “Variables” tab. Shih Data Miner accepts
input files with missing values, the missing is identified by no character at
all or a blank space and not any other special symbol. If the
input files should contain irrelevant variables such as an ID columns, these
can be unselected through the “Variables” tab by unchecking its "Use" checkbox.

There is not simple and safe replacement solution for the cases
where some of the attributes have significant number of missing values.
Generally, it is good to experiment with and without these attributes in the
modeling phase, in order to find out the importance of the missing values.
Simple solutions replacement are:
a) replacing all missing values with a single global constant,
b) replace a missing value with its feature mean,
c) replace a missing value with its feature and class mean.
The main flaw of simple solutions is that substituted value is not
the correct value. This means that the data will be biased. If the missing
values can be isolated to only a few features, then we can try a solution by
deleting examples containing missing values, or delete attributes containing
most of the missing values.
Another solution, the more sophisticated one adopted by Shih Data
Miner is to try to predict missing values with a statistical imputation algorithm, what we roughly
do everytime we encounter a missing is to fill it by sampling a random value from the distribution
of the non missing values. In this
case predicting missing values is a special data mining prediction problem.
When we are standing at a given node and we find an example that belongs to a
certain class and which has a missing value for the variable to be splited we
dynamically assign a value to that example. It is done dynamically by sampling
a value that follows the distribution of the known variable’s values at the
given node and for the given class of the example with the missing value.
In the definition of impurity measures, that will be explained later, conditional
probabilities are used in each node of the tree. To build a tree
frequency estimates are used for the corresponding probabilities.
Definitions:
N : the total number of cases
Nj : number of cases in class j
N(t) : Number of cases in node t
Nj(t) : Number of cases in node t and class j
p(t) : probability of occurrence of node t
p(j) : prior probability of class j
To see how to assign a class to a given node there are 2 cases: With misclassification
costs and without.
If there are misclassification costs we will chose as class the one that minimizes
the expected cost i.e.:
Σ C(i|j) p(j|t) adding over all the classes j
If there are no misclassification costs at a given node t the class that is
assigned is simply the most probable class, i.e. the class i where
p(i|t) >= p(j|t) for all j
or
Ni(t) p(i) N / ( N(t) Ni )
>= Nj(t) p(j) N / ( N (t) Nj ) for all j
or
Ni(t) p(i) / Ni >= Nj(t) p(j) / Nj for all j
Where p(j|t) = p(j,t) / p(t) = p(t|j) p(j) / p(t) = Nj(t) p(j) N / (N(t) Nj)
Priors are sometimes calculated from the data by the proportions Nj/N,
p(j) = Nj / N
But also they can be set to be equal for each class in order to have the same classification
accuracy for each class. This is the default adopted by our application.
A predicted class for a given node t, depends therefore on 4 factors:
- Prior class probabilities
- Cost matrix
- Number of cases per class in the training data
- Number of cases per class and per node in the training data
Shih Data Miner has the the possibility to set the priors and the cost matrix to make
an optimal classification for your problem based on statistical theory.
The purpose of splitting a node is to partition a node into sub nodes (children)
who are more preferred than the father node in some sense. Desirable splits are
the ones for which the distributions of the outcome in the child nodes are more
homogeneous (purer) than in the root node. This is depicted in the figure
below. Put in other words: the measure to evaluate a potential splitter is
diversity. It can be formalized by an impurity function.
The impurity in a node t is formally denoted by i(t).

Starting at the root node, each node is split with the help of a splitting criterion.
For a given criterion, one could try to find directly the smallest decision
tree consistent with a training set. This problem is however enormous and
involves a search that cannot be feasible nowadays. Consequently, most tree
induction methods use greedy algorithms.. A heuristic that evaluates impurity
reductions is used to select the next most appropriate split. This implies that
a test, which is used at a node to partition the training cases, must be
evaluated without exploring subsequent divisions lower in the tree.
Consequently, the only information available for guidance is the distribution
of classes in the training set, in the parent node, and in the child nodes
(along with the respective number of cases). The goodness of a split s in a node t is defined
to be a weighted decrease in impurity, i.e.,
D i(s,t) = i(t) - Σ pdescendants i(descendant|s) the sum over all the descendants
(child nodes, in our case of binary trees there are only two)
Where pdescendant|s is the proportion of cases that go into
the corresponding descendant node for a given split s (probability of the descendant). Impurity of a tree
T can be defined as the sum of the weighted impurities of the terminal nodes, where the weighted node
impurity in a node t, with probability of occurrence of p(t) is given by :
I(t) = p(t)i(t), where p(t) is the probability of occurrence of the node t.
In the case of a binary tree, the set of questions is binary and some things simplify.
The words ‘left’ and ‘right’ descendants make sense now. By convention, cases
answering “yes” go to the left descendant node, those answering “no” go to the
right descending node.
At each node t, all candidate splits s
are explored. The split s* that
gives the largest decrease in impurity is then applied.
The greedy algorithm looks only one step ahead, but it has a major drawback, which
can be seen by a checkerboard configuration. This configuration is an example
for which no impurity function can be found that results in a good splitting.
Therefore, tree classifiers are easily stuck in local minima. Another
disadvantage of any splitting rule is that it is subjective to the sampling
variation. The influence of the sampling variation is more pronounced when
there are fewer cases in a node. Thus, especially at the bottom of a tree
classifier, splits are less reliable than at the top. This, of course, is
linked with overfitting. Pruning back an overfitted tree solves this problem to
a great extent. To compute the decrease in impurity, a specific impurity
function has to be agreed on.
The user can change the algorithm, and consequently the splitting measure, which will be
used to build the data model. Therefore, he/she has to choose one of the
available items in the Algorithm menu or by checking the appropriate radio box
button.
At this moment, four classification algorithms are implemented. Classification means that the
dependent variable is not a number that has an associated. Three of this
algorithms, namely Gini, Symmetric Gini and Entropy are based on a measure
which is characterizes the purity of an arbitrary collection of examples in a
node. Gini and Entropy and Twoing deal with misclassification costs through
altered priors. Symmetric Gini deals with misclassification costs by
considering them in the formula instead of altering the priors. Twoing is not
based on a purity measure. It is used when there are more than 2 classes (if
there are 2 classes the towing criterion is the same as the Gini criterion). It
groups classes into 2 super classes and applies the Gini criterion to them.
Several experiments have shown that within a wide range of splitting criteria the
properties of the final selected tree are insensitive to the choice of the
splitting rule. The criterion used to prune is much more important.
For regression trees, i.e. when the dependent variable is a number. The algorithm is variance
reduction. In other words the split selected is the one that maximizes the
decrease of variance, of the dependent variable, between the parent node and
the child nodes.
The splits that are searched in order to define the child nodes can be derived from a numerical
or a categorical variable. In case of a categorical variables splits possible
splits are formed by selecting subsets of values to assign to the left son and
the complement for the right son. In case of a numerical variable all the
values are observed to create the possible splits and each split is defined by
the middle point between two subsequent values.
This statistical, distance-based measure defines the impurity of a node t. The distance here refers to the distance between class probability distributions. The feature evaluation criteria in this class measures separability, divergence or discrimination between classes.
GINI(t) = 1 - S[p(j|t)]2 with p(j|t) the relative frequency of class j at node t
Or the same can be written as:
GINI(t) = S p(i|t) p(j|t) adding over i,j with i different than j
GINI(t) reaches its maximum (1 - 1/nc), with nc the number of classes, when records are equally distributed among all classes, implying least interesting information. The measure of impurity of a node is minimal at 0 when all records belong to one class, implying most interesting information.
When GINI is used as a splitting criterion to develop a tree, one has to minimize the Gini Index of each split. When a node p is split into k partitions (children), the quality of split is computed as
GINIsplit = GINI(i) S(ni/n) where,
ni = number of records at child i, and n = number of records at node p
When there are misclassification costs the priors ( which are implicitly considered in p(i|t) ) are altered and the same formula is applied.
When there are misclassification costs an altered Gini criterion in used with the following formula (instead of altering the priors):
GINI(t) = SC(i|j) p(i|t) p(j|t) adding over i,j with i different than j
Where C(i|j) is the cost of misclassifying as i a class j case.
This expression is used as the Gini measure of node impurity for variable misclassification costs. In the 2-class problem the formula reduces to (C(2|1)+ C(1|2))p(1|t) p(2|t) giving exactly the same splitting criterion as in the unit cost case. This points up a difficulty, in the way in which the Gini index deals with variable costs. The coefficient of p(i|t) p(j|t) is C(i|j)+C(j|i). The index therefore depends only on the symmetrized cost matrix and does not appropriately adjust to nonsymmetric costs.
This measure is commonly used in information theory and characterizes the (im)purity of an arbitrary collection of examples. It is often associated with the average amount of information needed to identify the class of a case in a given set cases (in the node t). Given a collection S, containing positive and negative examples of some target/classification-variable, the entropy of S relative to this boolean classification is
Entropy(S) º - p+ log2(p+) - p- log2(p-)
Or, for the general case
Entropy(S) º - å pi log2(pi)
Notice that the entropy is 0 if all members of S belong to the same class. For example, if all members are positive (p+ = 1), then p is 0, and Entropy(S) = - 1×log2(1) - 0×log2(0) = 0. Note the entropy reaches its maximum value 1 when the collection contains an equal number of positive and negative examples. If the collection contains unequal numbers of positive and negative examples, the entropy is between 0 and 1. The figure below shows the form of the entropy function relative to a boolean classification, as p+ varies between 0 and 1.

Given entropy as a measure of the impurity in a collection of training samples, one could now define a measure of effectiveness of an attribute in classifying the training data. The measure which is used in Shih Data Miner, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute. Thus, the tree is constructed by locally maximizing the information gain, or the reduction in entropy due to splitting each individual node. More precisely, the information gain, Gain(S,A) of an attribute A, relative to a collection of examples S, is defined as:
Gain(S,A) º Entropy(S) - |Sv|/|S| å Entropy(Sv)
Note that the first term in the equation is just the entropy of the original collection S, and the second term is the expected value of the entropy after S is partitioned using attribute A. The expected entropy described by this second term is simply the sum of the entropies of each subset Sv, weighted by the fraction of examples |Sv|/|S| that belong to Sv. Gain(S,A) is therefore the expected reduction in entropy caused by knowing the value of attribute A. Put another way, Gain(S,A) is the information provided about the target function value, given the value of some other attribute A. The value of Gain(S,A) is the number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of attribute A.
The idea is at every node to select that conglomeration of classes into two superclasses so that considered as a two class problem, the greatest decrease in node impurity is realized. This approach to the problem has one significant advantage: It gives “strategic” splits and informs the user of class similarities. At each node, it sorts the classes into those two groups which in some sense are most dissimilar and outputs to the user the optimal grouping of classes as well as the best split. Near the top of the tree, this criterion attempts to group together large number of classes that are similar in some characteristic. Near the bottom of the tree it attempts to isolate single classes.
At a node t, with s splitting t into tl and tr, choose the split that maximizes:
( pl pr / 4 ) ( å | p( j| tl ) - p( j| tr )| )2
This will give the best towing split and the superclasses are C1 and C2 where C1 = { j | p( j| t*l ) >= p( j| t*r )} where t*l and t*r are the nodes obtained at the maximum.
If the dependent variable is continuous, regression trees can be used. They can be though of as a histogram estimate of a regression surface. Regression trees are simpler because the same impurity criterion is used to grow and prune the tree, and each case has the same weight in ‘misclassification’. Since there are no classes there are no priors or misclassification costs. In regression trees we find the split that maximizes the decrease of variance between the father node and the child nodes with respect to the dependent variable.
An example is given in the book of Breiman et al. [1984]. It is based on a paper from 1978 from Harrison and Rubinfeld. Boston housing values data were gathered to see if there was any effect of air pollution concentration on housing values. Fourteen variables were measured. The response was the median value of homes (in thousands of dollars). Hence, the response is clearly continuous. A regression tree was grown, part of it is shown in the figure below.
A number within each node is the average of the response. The numbers in the middle of the arcs show the cases that go left or right, and the numbers below the leaves are their standard deviations of the corresponding response. The test at each internal node, which is denoted by a circle, is written just below it. The example showed that only 4 of the 13 predictors were retained for splits.

While a misclassification rate has an intuitive interpretation, a mean squared error has not. To remove scale dependency, an estimated relative mean squared error is defined by dividing the mean squared error by the sample variance.
Hence, the best split is the one that minimises the weighted variance over the child nodes. More details can be found in [Breiman et al. 1984].
Backward pruning is based minimal error cost-complexity, which is very similar to the tree classifier case, but with the misclassification rate simply replaced by the error measure. It is typical that the valley containing the minimum value of the misclassification error is flatter and wider than in classification trees.
Regression trees are reported to be competitive with linear regression. They can be more accurate on non-linear problems, but less on linear problems, which is consistent with the general principle of using parametric models when the parametric assumptions hold.
There are three ways of estimating through tests how the tree will generalize on unseen data :
1) Selecting an independent test file that will be used to estimate the classification error of the tree built with the training data.
2) Sampling a proportion of the training data to be used as test data used to estimate classification error.
3) Cross-validation
A training file is always chosen to build the classification or regression tree. A test file is used as independent dataset to prevent overfitting of the tree (via the pruning of the tree). An overfitted classification model will be able to classify the training data very well, it’s resubstitution cost will be low. The resubstitution cost is the error of the tree estimated with the same data it used to be built. Therefore it is a bad estimator of the real error since its overoptimistic. An overfitted tree is not able to create general rules that will be valid with new “unseen” data since it focuses too much on the cases it learned with in the training data (the figure below shows this principle). Too much splitting would yield a generally too large tree to be useful and not only with regard to the comprehensibility, but also with regard to prediction errors. A pruning strategy has to be designed to prevent overfitting because a tree classifier is important, not because it summarises what is known (the training set), but because it hopefully predicts new cases correctly.
There are essentially two pruning techniques available:
1. forward pruning
2. backward pruning
The implementation relies on backward pruning as proposed in CART. In this case one may first split further and further until one arrives at single-class subsets, even if all or most of them contain a single training case. This is prohibited for it would overfit the training set. Hence, a significant number of cases at each leaf must be retained. This is an option in the implementation where one can set the Minimum Parent Node Size. It represents the minimum size below which a node will not be split. This value is set to 10 by default. Increasing allowable parent node sizes enables the user to control tree growth and to potentially fit larger problems into a finite work space. This parameter is an example of forward pruning.
In backward pruning, a quality for a tree has to be defined in which the size of the tree also plays a role, as it stands for the complexity of the tree. A pruning sequence is constructed with the train file by starting with the maximal tree built until no more growing is possible due to perfect classification or due to a restriction in the minimum parent node side. From that initial tree the algorithm selects a non terminal node to eliminate it’s sons (i.e. prune) such that it will reduce the complexity of the tree and that will not affect too much the prediction accuracy. Like that the next tree in the pruning sequence is obtained, the procedure is repeated again and again until no more pruning is possible since we reached the root node. With the testing data, for each given tree in the pruning sequence the terminal nodes are assigned errors that have to be minimised over the pruning sequence. The tree in the pruning sequence with the smallest error for the test data is selected as the best tree with highest generalization power. The backward pruning approach is slower in computation time, but a more thorough exploration of possible partitions can be made, resulting in a better class probability tree. For this to be possible a test file or a cross validation technique is needed. Pruning will prune starting from the largest tree until a tree which minimises the error on the testing set (or the cross validation error) is found. Since the test set was not used to construct the tree this tree will probably generalize good if enough data of good quality was available for it’s construction.

The generalization property is one of the most interesting and important properties of trees. This property can be effectively observed and used in order to control the training process of a decision tree. One divides randomly the registered and classified data into two sets: the training set and the test set. The training set is used to construct a maximal tree during training (with no pruning), only limited by the minimum parent node size. As one prunes the branches of the built tree and evaluates the error on the training set the error always decreases. For the test set the error will decrease initially due to the train file overfit, but only until a certain point where it will increase again since if you keep on pruning the tree looses knowledge, and classifies poorly. An empirical study of the evolution of the average accumulated squared error between the desired and the actual output/classification for the test set as you evaluate the error in subsequent pruned trees shows that this error for the test set usually decreases first along with that for the training set, but from a certain point in time the test error starts to increase. If one chooses a bigger tree in the pruning sequence after this point, the resulting tree has a better performance for the training set, but not for the test set. In other words, the resulting tree doesn’t generalize well. This implies that at a certain point during training, the tree starts to learn more properties that are specific for the training set that are not representative for the test set. This adaptation after a certain point during training is called overtraining or memorization and is certainly to be avoided. In fact, one should continue to train until the average error of the test set is minimal. In many cases however users are tempted to use all the classified samples for training and leave no examples for testing. This is ineffective, because one has in this way no idea about the crucial generalization property of the generated decision tree.
The class priors represents the prior probability of each class. It can be viewed as the real percentage of cases in the population (if these are different from the sample). The Priors is one of the most important options you can set in shaping a classification analysis. The user can specify the prior class probabilities for classification trees. The option is ignored and disabled for regression trees.
The priors are a useful set of parameters, and intelligent selection and adjustment of them can assist in constructing a desirable classification tree. In some studies, the data set may be very unbalanced between classes. The priors can be used to adjust the individual class misclassification rates in any desired direction. For example, taking equal priors tends to equalize the misclassification rates. Putting a larger prior on a class will tend to decrease its misclassification rate, and vice versa.
Remember: the priors affect how accurate the user wants to be in a given class. E.g. using equal priors will force the model towards having the same error/accuracy for each class. On the other hand, if the prior is set high for a particular class, the error will be low for that class.
The Priors can be altered by selecting the appropriate menu item in the Priors menu of Shih Data Miner Model Parameters window. The user can choose out of 6 settings:
- Equal Priors: By default, the priors are set to equal, which means that the classes are treated as if they were uniformly distributed in the population regardless of their distribution in the sample. This default setting frequently gives the most satisfactory analyses because each class is treated as equally important for classification accuracy.
Note that with the equal priors default setting, Shih Data Miner pays no attention to how common or how rare a class is in the dataset. Instead, Shih Data Miner looks at whether a given node is more or less rich in that class compared to the root node. For example, consider a root note with 5 percent low birth weight (LBW) cases and 95 non-LBW cases. A child node with more than 5 percent LBW is classified as LBW even if the actual number of LBW cases is less than the number of non-LBW cases.
With equal priors, the probabilities reported for each node in the text output measure changes in relative risk. In the example above, the probability of LBW and non-LBW cases is 0.05 and 0.95, respectively, in the root node and the relative odds are 0.053 (0.05/0.95). If the reported probabilities in a child node are 0.20 and 0.80, the relative odds have changed to one-fourth (0.20/0.80) of the root node odds. Note that the probabilities do not mean that 20% of the cases belong to the LBW class.
- Train Priors: Priors match observed sample frequencies in training data only.
- Test Priors: Priors match observed sample frequencies in test data only.
- Data Priors [train + test]: Priors match observed sample frequencies in combined learn and test data
- Mix Priors [(data + equal)/2]: Priors set to the average of the DATA Priors and EQUAL Priors option. It uses a middle point between data priors and using equal priors. This is done in order to avoid preferences for frequent classes and at the same time take into consideration low frequent classes.
- User Specified Priors: Priors set to user-specified positive numbers in the Class Parameterstab. If the user selects to specify the priors, he/she must enter a value for each level of his/her target variable. All these values have to sum up to 1.
The user can enter these values by going to the Class Parameters tab, and altering the values of the Class Priors Array by double clicking on a cell of the array (or press space bar), changing the value, and pressing enter to confirm the specified value.
NOTE:
* tab can be used to go to the next cell
* shift tab to go to the previous cell
* space bar or double clicking to enter a cell so that its value can be changed
* enter or clicking outside the cell to confirm the user specified value.
Misclassification costs can be dealt with by directly incorporating them into the split formula as Symmetric Gini does or by altering the priors in such a way that the resubstitution error remains the same (except a constant ponderator) and working with a unit misclassification matrix in order to be able to use all the split algorithms that work with unit cost matrix, namely Twoing, Gini and Entropy. The idea is very simple, we just have to alter the priors in such a way that the Cost matrix is unitary and that the resubstitution cost (the cost estimated with the train data is not affected by the change). This is an attractive strategy since the priors can affect directly the classification accuracy of the classes taking into account the cost structure.
Suppose that in a 2 class problem with equal priors, it is twice as expensive to misclassify a class 2 case as it is to misclassify a class 1 case; that is, C(1|2) = 2, c(2|1) = 1. As compared to the equal cost situation, we want a tree that misclassifies fewer class 2 cases. Another way to look at it is that every case in class 2 misclassified counts double, so the situation is similar to that if the prior on class 2 is taken twice as large as the prior on class 1.
Pursuing this idea, let Q(i|j) be the proportion of the class j cases in the
training data classified as i by a tree T. Then the resubstitution estimate
(error calculated with the train data) for the expected misclassification cost
is:
R(T) = Σ C(i|j)Q(i|j)Prior(j)
Let Prior'(j) and C'(i|j)
be altered forms of Prior(j) and C(i|j) such that:
C'(i|j)Prior'(j) = α C(i|j)Prior(j)
(4.13)
Then R(T) remains the same, in the sense the error minimizing it will
yield the same results, when computed using Prior'(j) and C'(i|j)
Take C'(i|j) as the unit cost matrix and suppose that altered priors can
be found satisfying 4.13. Then the cost structure of T is equivalent, in the
above sense, to a unit cost problem with the Prior(j) replaced by the Prior'(j).
If the costs are such that for each class j, there is a constant misclassification cost C(j) regardless of how it is misclassified, that is, if
C(i|j) = C(j)
Then C'(i|j) can be taken as unit costs with the altered
priors :
Prior'(j) = C(j)Prior(j) / (Σ C(j)Prior(j)
) (4.14)
This suggests that a natural way to deal with a problem having a constant cost
structure C(j) for the j-th class is to redefine priors by 4.14 and proceed as
though it were a unit cost problem then for both gini, entropy and twoing, the
formula does not change, only the priors.
In general, the Prior'(j) should be chosen so that C'(i|j) are as
close as possible to unit cost. This has been implemented in SHIH (and in CART)
by defining the Prior'(j) through (4.14) using C(j) = Σ C(i|j)
This section explains how to use the software, by now you should have enough theory to easily understand this. The fist thing to do is to load the training data to build the model, after this one can set additional parameters that depend on the data. The additional parameters to be set are the misclassification costs, the priors the classification variable, the variables to use in the model and which automatically defined numerical variables will be used as categorical.
In the model parameters tab, some parameters can be set which can influence the tree that will be created, those parameters are eg. the minimum parent node size, sample percentage, looking forward, the algorithm that is used, the type of priors. By default the parameters will work in most of the cases. Better results can be obtained by changing some of them depending on the specific problem.
The minimum parent node size represents the minimum size below which a node will not be split. This value is set to ten by default. Increasing allowable parent node sizes enables the user to control tree growth and to potentially fit larger problems into finite memory resources. If you have enough resources it is recommended to put a low value since pruning will ultimately find the best tree and it the node size is small pruning will have a greater possibility of finding the optimum tree.
- Sample Proportion
Fraction of sample cases selected at random from the training file, in order to use for testing, and not for training. This is a number between 0 and 1, by default 0.33 of the cases will be used for testing.
If the user selects not to use an independent test file, he/she is able to control the tree building by using a test file that was created by taking a sample proportion of the training file which will only be used to test the model, and not to train the model.
- Sample Seed
Allows the user to set the random number seed and to specify whether the seed is to remain in effect after the tree is built. By default the seed is set to 13671. Using the same seed in another run will generate the same random sample. This is used to reproduce results. Changing the seed will generate a different sample.
Allows the user to set several “folds” to cross validate, in other words test, the constructed tree with all the data. This is used when there is not much data available and we want to use it all to construct the tree and at the same time have a good estimator of the real error. This means that the tree is constructed with all the data and the error is estimated the following way : the data is divided in, lets say 10 folds or partitions, then 10 trees are constructed by using 9/10ths of the data as train and the remaining 1/10th as test. Like this test data is generated to estimate the error without overfitting and all the data is used to construct the tree.
In the normal case, decision trees are built with local optimizations (greedy search). This means that, to split a node, the algorithm takes only local information into account. The algorithm selects the best split for a certain node, but does not look how this will affect the splits in the child nodes. To try to reach a solution that does not only focus on local information, we included the options to look forward one or two levels :
- Don’t Look Forward : The standard algorithm is used. Only the current split is taken into account to find the best split.
- Looking Forward One Level, all the variables : not only the split with the best variable at a level is considered, but the splits of all the variables (suppose number of variables is n) are calculated at level x.
With each of these possible splits, the best split is created at level (x+1), the total measure will be calculated regarding level x and x +1 now, i.e. not only taking into account the node at level x but also taking into account their sons, therefore giving the decrease of impurity of a grandfather node with respect to their grandchildren. At the end, the selected variable at level x will be the variable with the best total measure.
After that, the same procedure is calculated at level x+1, taking all variables into account at level x+1, and the best variable at level x+2.
Of course, these calculations will take O(n) times more time than the not looking forward case.
- Looking Forward Two Level, for a certain amount of variables : here, not only level x and x+1 are considered, but also level x+2 to decide which variable to use to split the node at level x. The problem is now that the calculation time of one split is now of O(n²), which is in most cases too long. Therefore, not all variables are considered, but only a certain amount of variables, e.g. the five best (by default).
In the variable selection tab, the user can select the variables that should be used to create the tree, which variables have to be treated as numerical and which as categorical (a numerical values could eg be interpreted as categories), and which variable is the target variable.
The default values are extracted from file, the application checks automatically which variables are originally categorical variables and which are originally numerical variables. The last variable in the data is set by default to be the target variable.
There are however some obvious restrictions. For example if a variable is set to be the target variable, it cant be set to don’t use, reciprocally if a variable is set as don’t use, it can’t be set as target. If a variable was originally categorical, it cant be used as a numerical variable.

The class parameters tab consists of two parts : the misclassification matrix, and the priors. These affect the builded tree in order to be more precise in classifying certain classes with respect to others.

This matrix indicates how much it costs if you have a misclassification. For example classifying as class A an example that is of class B. The diagonal should contain 0’s since the cost of classifying correctly is 0.
For example, misclassifying a person who has cancer is more costly, since the person can die without being treated, than misclassifying patients who don't have cancer, in which case the person will not dye, even though he will have a very unpleasant experience thinking he might.
Some one could argue that it is 5 times more costly to misclassify "has cancer" than to misclassify "doesn't have cancer". The relation is 5 to 1 for this example but this is a subjective decision from the modeler and his experience on the problem, other modelers might argue that the relation is much bigger since the risk of death has an infinite cost.
Supposing the user wants to assign a penalty of 5 to misclassifying a cancer-patient and a penalty of 1 to one who doesn't have cancer. Then he will use the misclassification costs matrix to set this values.
This means that it is possible to guide the tree in a certain way, you can prevent the tree of misclassifying the people which have cancer (in that case you will indicate too many people as having cancer, but the overall cost will be lower).
Default Cost settings: correct classifications are assigned a cost of zero and incorrect classifications a cost of one (applicable to classification and class probability trees only).
To change costs anywhere in the matrix, the user can change the default values by going to the Class Parameters tab, and altering the values of the Misclassification Matrix by double clicking on a cell of the matrix (or press space bar), changing the value, and pressing enter to confirm the specified value.
- tab can be used to go to the next cell
- shift tab to go to the previous cell
- space bar or double clicking to enter a cell so that its value can be changed
- enter or clicking outside the cell to confirm the user specified value.
This represents the prior probability of each class. It can be viewed as the real percentage of cases in the population.
The user can set the priors to user-specified positive numbers. If the user selects to specify the priors, he/she must enter a value for each level of his/her target variable in the Class Priors Array. All these values have to add up to 1.
The user can enter these values by going to the Class Parameters tab, and altering the values of the Class Priors Array by double clicking on a cell of the array (or press space bar), changing the value, and pressing enter to confirm the specified value.
NOTE:
* tab can be used to go to the next cell
* shift tab to go to the previous cell
* space bar or double clicking to enter a cell so that its value can be changed
* enter or clicking outside the cell to confirm the user specified value.
After pushing the create model button, the tree is built (this can take a while, depending on the size of the training file).
The user will see the following window :

This window displays all the results of the constructed tree model generated from the model parameters set in the Tree Builder Parameters window. The window is divided into 4 movable parts:
1. Tree Panel (upper left corner): this panel displays all the information related to one particular tree.
2. Node Info Panel (upper right corner): this panel displays all the information related to one particular active node, and related to the segmentation variables.
3. Pruning Sequence Panel (bottom left corner): this panel displays the pruning sequence that was generated from the model parameters which were set in the Model Parameters Window.
4. Tree Navigation Panel (bottom right corner): this panel displays a rescaled copy of the tree in the Tree View. The user can use the Tree Navigation Panel to easily navigate the big tree displayed in the upper left corner.
In this panel, the tree is drawn. The information in the nodes is :
- the variable used to split the node
- the classification of this node
- percentages of the classes (first in train data, then in test data)
- the amount of samples classified in this node (train test)


The table displays the following information for each terminal node :
Profile : terminal node reference number, corresponds to a profile given by the rule that defines the path
from the root node to the terminal node.
Class # cases : Number of cases in the node that are class >50K.
Class % Profile : Percentage of cases in the node that are class >50K.
Class % in Pop. : Number of class >50K cases in the node as a percentage of the total number of class >50K
cases in the data (in this case in the train data since train is selected).
Cum % Class : Cummulative number of class >50K cases as a percentage of the total number of cases in
the population (in this case in the train population since train is selected).
Cum % Pop : Cummulative percentage of the total number of cases in the population.
Pop % profile : Percentage of cases in the profile (node) with respect to the population.
Cases in profile : Total number of cases in the node.
Cum. lift : Cumulative percentage of class >50K cases divided by the cumulative share of the number of
total cases. i.e. Cum % Class / Cum % Pop
Lift index : Percentage of class >50K cases in the node divided by the percentage of the total number of cases
in the node. i.e. Class % in Pop. / Pop % profile
The classification info panel shows how good each class is classified, it shows the error and correctness percentage for each class.

The scoring data shows for each example in the train and test file how the classification of that example is in the tree (prediction), how it should be (real) and what rule was applied for this classification.

The panel at the right top can be used to see how a variable performs in that node and to force a split of a node on a specified variable at a certain value (cut off)

To be able to use the split panel, you select a node and click on GOS (Goodness of Split).
Then you see a list of variables that can be used to split that node (if NA is written after the variable, this variable is not applicable anymore). After the variable you can see C or N, which means numerical or categorical variable, and a value, which is an indication of how good it is to use that variable to split the node, the higher the better.
Now you can select one (numerical) variable, and click on the icon under the list. What you see then is the goodness of split for each possible value of that variable (yellow : actual value, blue : normalized).
What you can do now is e.g. change the cut off. You just have to change the variable in the text field. You can see how much this affects the goodness of split (click on GOS on CO (cut off)). If you actually want to perform the split, click Split.
The graph indicates the difference in classification between the test and the training data (in that node).
To start creating a tree manually from scratch, the user has to prune the first node. The tree can be completely built up again as wanted by the user. If you want to start building manually from a certain node, just prune that node and start from there. Pruning can be done by right clicking the mouse on a node.
After splitting a node manually, you can continue splitting either manually or let it be done automatically.
What you see at the left bottom is a sequence of trees with its test cost (test cost is the percentage of errors on the test data).
Sequence of trees? Indeed, but to understand this, it is better to give a short explanation about what happens if you build a tree.
A tree is built till no nodes can be splitted anymore (not enough cases in that node, or all the cases have the same class).
The problem is then that the tree might be overfitted to the data it learned with. Therefore, the tree is pruned using an independent sample. This means that branches which don’t give much information to a tree are cut. Like that the complexity and the cost on the testing data go down until a point when it starts going up again. That point is the point that define the optimal tree.
The tree that is selected is then the best tree in the pruning sequence (and is shown in the tree window).
If a tree is at some point built manually and then further constructed automatically, the tree is too big; therefore, it is possible to create the pruning sequence for this tree, which will generate the best sub tree of that tree, the one that minimizes the error or cost in the independent test sample. To create this sequence, just click the icon on the right side of the pruning sequence graph to get the new pruning sequence.
In the pruning sequence panel the user has the possibility to store a tree temporally, this is not to disk, but only for playing with the tool. Like this the user can compare with previously created trees and switch between several temporally stored trees.

With the little zoomed view, it is very easy to have an overview of the tree. It is also possible to explore the tree by dragging the mouse inside this panel. While dragging, the general view of the tree will also be updated, in that way it is very easy to select nodes or look at certain parts of the tree.

Saving rules
Once the tree is built, you can generate the rules of that tree in sql or if then rules.
Therefore go to the tree menu and select save rules. There is also the option to store the path save path. The difference between the path and the rule of a terminal node is that a rule is the compacted way of writing the path. This means that if e.g. a variable is used twice along a path, the rules are taken together in that case.
Suppose e.g. that we have first AGE < 70 and more down in the path AGE > 25, then in the rules, the user will see 25 < AGE < 70. In the path, both rules will exist.
The save model menu option gives the option to save the tree to file. Like that the user is able to load that tree in a later session.
Saving the tree in PMML format
By selecting this item the tree is saved in a format that is human readable and follows the industry standard. Write a name that ends with xml so that it can be interpreted by xml browsers (such as Mozilla or Internet Explorer) and click on the button save model to save the model in xml (PMML) format. This model format is the standard way of specifying models in the PMML (Predictive Modeling Markup Language) language which is an XML language to specify predictive models such as regression, association rules, decision trees, neural networks and more. The complete specification and examples can be found at www.dmg.org, more specific in http://www.dmg.org/v2-0/GeneralStructure.html. One the tree model is saved it can be viewed by any xml browser. Even by a normal text editor since the model is saved in an xml file which is a text file.
The goal of saving the tree is in order to use it to predict new data by selecting the model file and the new data. This is done by selecting File -> Predict with XML tree in the main application window. This will pop up a window where you can select the data to score (in the same format as the data used to build the model) and the XML model file. The score results will be saved with the same name as the data score with the added extension .scr
Computing variables statistics
This shows the variables used in the tree and their statistics for both the training and the testing files. For numerical variables it shows the average and the standard deviation and for categorical values it shows the values counts.
The exit option closes the window.
The node menu gives some operations that have to be performed on a terminal node :
- split one level splits the node one level, with the best variable for this node (best node found by the algorithm)
- split automatic splits the node till there are no more descendants splittable anymore.
- prune prunes a node.
These options are also available when the user right clicks on a node.
The branch menu gives the operation that can be performed on any node, either on an internal or a terminal node :
- expand one level : expands the terminal nodes of the sub tree starting at the selected node with one level.
- expand two levels : expands the terminal nodes of the sub tree starting at the selected node with two levels.
- expand three levels : expands the terminal nodes of the sub tree starting at the selected node with three levels.
- expand all : expands the subtree starting at the selected node till no more terminal nodes can be splitted.
- prune : prune a branch.
In the detail menu the user can set way he wants to see the tree.
There are tree options :
- class breakdown : shows the details of the classification in every node (this is the default view if the target variable has less than twenty values)
- class distribution : highlights the nodes which have most chance to be classified as a certain class. This class can be selected in the lift curves tab. The probability of being classified as that class is indicated with a color, from white to a specified color. The nodes with 0% probability to be classified as that class will be white.

- Low : in the low detail view, no detailed information is shown about the probabilities of the different classes. This is the default view for trees with a classification variable that has more than twenty different values.

The zoom menu can show the tree in different sizes, so that is eg. to see first the big structure, and later on see more details of the nodes (or vice versa of course). Another responsibility is to attach or detach the small tree view to the tree panel.
- Detach zoomed view : detach the mini tree panel.
- Attach zoomed view : attach the mini tree panel.
- View on top : shows the mini tree panel on top of the tree panel.
- 50% 150% zooming in or out the tree view with a certain percentage.
- default size sets the size of the tree back to the default value.
Sets the color of the nodes.
In this menu the user can change the view of the tree panel. This means, the splitting control center and the pruning sequence can be removed or added to the panel.
- hide/show pruning sequence : removes or adds the pruning sequence from the panel.
When the pruning sequence is removed, this is the view the user gets :

- hide/show control panel : with this option, the splitting control center is removed from or added to the tab. This is the view the user gets :

when both are combined, the user has the possibility to only see the full tree :

In the first case, a very small file is used. We compare what can be the difference between automatic creation and manual creation of a tree.
We try to predict if a house will be priced in the top 20% or the bottom 80%, this with regard to some variables, eg. the crime level in the region.
The automatic tree looks like this :

Everything is created automatically and then the best tree is found with the pruning sequence. In the root node, the best variable seems to be var5.
Now, instead of taking the best variable, we split the node on the second best variable (var11) and create the best tree then again with first automatically creating the complete tree and pruning it afterwards.
This is the resulting tree :

The tree looks a bit different (one extra split). What is more important is the difference in test and train cost : 0.1 and 0.0975, which is remarkably better than the default case where we had 0.1061 and 0.1451.
Breiman, L., Friedman, J.H., Olshen R.A., & Stone, C.J. (1984). Classification and regression trees (pp. 112). New York: Chapman and Hall